A 31-billion parameter AI model with a 256K token context window (roughly 600 pages of text) running on a single consumer graphics card. That's what TurboQuant KV cache compression has made possible with Google's Gemma 4 on an NVIDIA RTX 5090.
The key bottleneck with large context windows isn't the model weights themselves - it's the KV cache, a chunk of memory the model uses to "remember" earlier parts of a conversation or document. At 256K tokens, this cache alone would normally blow past the 32GB of VRAM on an RTX 5090. TurboQuant compresses these cached values, shrinking Gemma 4 31B from 30.4GB down to around 18.9GB and leaving enough headroom for the context window.
Performance at Scale
Benchmarks using Q4_K_M quantization (a compression method that shrinks model precision while preserving most quality) on the RTX 5090 show practical speeds across context lengths:
- 4K context: 61 tokens/second generation, 3,395 tokens/second prompt processing
- 32K context: 55 tokens/second generation, 2,229 tokens/second prompt processing
- 64K context: 51 tokens/second generation, 1,459 tokens/second prompt processing
- 128K context: 43 tokens/second generation, 900 tokens/second prompt processing
At 61 tokens per second on short contexts and still 43 tokens per second at 128K, that's faster than comfortable reading speed. VRAM usage scales from 20GB at 4K context up to 30GB at 128K. The full 256K context pushes to roughly 40GB, which requires system RAM spillover on a 32GB card but remains functional.
Gemma 4's architecture helps here. It uses a shared KV cache where later layers reuse key/value data from earlier layers, cutting memory and compute during inference (the process of actually generating text). Dual RoPE - a positioning system that helps the model track where it is in long documents - keeps output quality stable even at extreme context lengths.
Local AI Gets Serious
NVIDIA has been actively promoting Gemma 4 for local deployment through its RTX AI Garage initiative. The combination of a strong open-weight model, aggressive quantization, and a $2,000 consumer GPU makes 256K-context AI genuinely accessible outside the cloud for the first time at this model size.
For anyone processing long documents, codebases, or extended conversations locally - without sending data to an API - this is the new baseline. The RTX 5090 isn't cheap, but it's a one-time purchase versus ongoing API costs that add up fast at these context lengths.