Google's Gemma 4 launched on April 2 to strong benchmark numbers, but people trying to actually run the 31B model on their own hardware are hitting a wall: the KV cache (the memory a model uses to keep track of your conversation so far) is enormous.
The Numbers Don't Add Up
Here's the problem in concrete terms. The Gemma 4 31B model quantized to 8-bit precision (a common compression method that shrinks model files while keeping quality high) takes up about 35GB. That should leave room to spare on a 40GB GPU. But even at a tiny 2,048-token context window - roughly four pages of text - the KV cache pushes total memory past what fits. You'd need to further compress the KV cache itself down to 4-bit precision just to squeeze it in.
For comparison, Qwen 3.5 27B at the same 8-bit precision fits comfortably on the same hardware at full context length without any KV cache tricks. And Qwen 3.5 27B actually scores higher than Gemma 4 31B on most public benchmarks.
So the math is: a bigger model, worse benchmarks, and dramatically higher memory requirements. That's a tough sell.
What's Causing the Bloat
Gemma 4's 31B is a dense model, meaning every parameter is active during inference (unlike Google's own 26B-A4B variant, which uses a mixture-of-experts architecture where only 4 billion parameters fire at once). Dense models generally produce better output quality, but they pay for it in memory. The 31B variant supports context windows up to 256,000 tokens, and the KV cache scales with context length. The architecture appears to allocate cache memory aggressively even at short contexts.
The official VRAM requirements tell the story: 17-20GB at 4-bit quantization, 34-38GB at 8-bit, and a full 62GB at native FP16 precision. Those numbers are for model weights alone, before you add the KV cache overhead that scales with every token of conversation.
Practical Advice for Local Users
If you're set on running Gemma 4 locally, the 26B-A4B variant is the better bet. Its mixture-of-experts design means only a fraction of parameters are active at once, cutting memory use substantially. For the 31B model, Unsloth's documentation recommends starting with 32K context and 4-bit quantization as a baseline.
But honestly, if you have a single consumer GPU, Qwen 3.5 27B delivers comparable or better results with far less hassle. Gemma 4's strengths - multimodal support, strong benchmark scores on reasoning tasks like AIME 2026 (89.2%) - are real, but they're hard to appreciate when you can't load the model.