Nvidia released a quantized version of Google's Gemma 4 26B model in its proprietary NVFP4 format, designed to run on RTX 50-series consumer GPUs.
A quick translation of the model name: "26B-A4B" means 26 billion total parameters but only 4 billion activate per query. This Mixture of Experts design routes each input through a small fraction of the model's total capacity, which makes it cheaper to run than a conventional 26B model. NVFP4 is Nvidia's 4-bit floating point quantization format - quantization reduces a model's numerical precision to shrink its memory footprint, trading a small quality reduction for the ability to run on hardware with limited VRAM.
On an RTX 5090 (32GB VRAM), the model runs with about 26GB in use and handles roughly 50,000 tokens of context - around 130 pages of text.
The hard constraint: NVFP4 only works on Blackwell-architecture GPUs, meaning the RTX 50 series. Owners of RTX 40 or older cards can't use this format and will need GGUF or GPTQ quantization through Ollama, LM Studio, or similar tools instead.
Gemma 4 26B is Google's mid-range model from the Gemma 4 family. Running it locally on a single consumer card means no API costs and no data leaving your machine - which matters for developers building private applications or processing sensitive documents. The RTX 5090 starts at around $2,000, so this isn't a mass-market story, but it's relevant for anyone already on Blackwell hardware wondering what the architecture can actually run.