Related ToolsChatgptClaude

One Flag Cuts Gemma 4's VRAM Usage by 3x in llama.cpp

Meta Llama
Image: Meta

Running Google's Gemma 4 locally? There's a good chance you're burning through GPU memory before you even start generating text, and the fix is a single flag.

The problem is Gemma 4's Sliding Window Attention (SWA) cache - the chunk of memory the model reserves to keep track of recent context while generating responses. By default, llama.cpp allocates this cache in F16 (a high-precision number format that uses more memory) and sizes it for multiple simultaneous users. If you're the only person using the model on your machine - which, let's be honest, you almost certainly are - that's pure waste.

Adding -np 1 to your llama.cpp launch command tells the server to allocate cache for exactly one user instead of the default. The result: SWA cache VRAM drops by roughly 3x. On a 16GB GPU, that's the difference between out-of-memory crashes and actually running the model.

This matters most for the dense (non-mixture-of-experts) version of Gemma 4, which is already tight on 16GB cards. The cache allocation happens at startup, so you won't see the savings mid-session - you need to relaunch with the flag. The rest of the model's memory gets quantized (compressed to use less precision), but the SWA cache doesn't benefit from that by default, which is why it stands out as such a memory hog.

If you've been struggling to fit Gemma 4 on consumer hardware, try this before reaching for more aggressive quantization settings that might hurt output quality. It's the rare optimization that costs you nothing.