Open Source

One Flag Cuts Gemma 4's VRAM Usage by 3x in llama.cpp

April 3, 2026 2 min read

Image: Meta

Running Google's Gemma 4 locally? There's a good chance you're burning through GPU memory before you even start generating text, and the fix is a single flag.

The problem is Gemma 4's Sliding Window Attention (SWA) cache - the chunk of memory the model reserves to keep track of recent context while generating responses. By default, llama.cpp allocates this cache in F16 (a high-precision number format that uses more memory) and sizes it for multiple simultaneous users. If you're the only person using the model on your machine - which, let's be honest, you almost certainly are - that's pure waste.

Adding -np 1 to your llama.cpp launch command tells the server to allocate cache for exactly one user instead of the default. The result: SWA cache VRAM drops by roughly 3x. On a 16GB GPU, that's the difference between out-of-memory crashes and actually running the model.

This matters most for the dense (non-mixture-of-experts) version of Gemma 4, which is already tight on 16GB cards. The cache allocation happens at startup, so you won't see the savings mid-session - you need to relaunch with the flag. The rest of the model's memory gets quantized (compressed to use less precision), but the SWA cache doesn't benefit from that by default, which is why it stands out as such a memory hog.

If you've been struggling to fit Gemma 4 on consumer hardware, try this before reaching for more aggressive quantization settings that might hurt output quality. It's the rare optimization that costs you nothing.

Related Tools

More from today

Critical OpenClaw Flaw Gave Attackers Silent Admin Access to AI Agents

Google Releases Gemma 4 Open Models That Beat Systems 20x Their Size

Six Behavioral Rules to Stop AI Coding Agents From Cutting Corners

Cookie Preferences