NVIDIA Releases Quantized Qwen3 35B MoE in FP4 Format for Local Inference

Editorial illustration for: NVIDIA Releases Quantized Qwen3 35B MoE in FP4 Format for Local Inference

NVIDIA published a quantized build of Qwen3-35B-A3B on Hugging Face under the nvidia/Qwen3.6-35B-A3B-NVFP4 repository, targeting developers who run large language models locally on NVIDIA hardware.

The base model uses a Mixture of Experts (MoE) architecture - meaning it has roughly 35 billion total parameters but activates only about 3 billion of them per generation step. That design makes it faster to run and less memory-hungry than a standard 35 billion parameter model where every weight is active on every token.

NVFP4 is NVIDIA's 4-bit floating point quantization format, which compresses the model's weights to roughly one-quarter the memory footprint of a standard 16-bit (FP16) version. The format is optimized for NVIDIA's Hopper-generation hardware - H100 and H200 data center GPUs - and Ada Lovelace consumer cards like the RTX 4090. Older RTX 30-series GPUs lack the tensor core support the format requires.

For developers with compatible hardware, this makes running a 35B-class model on a single consumer GPU a practical option rather than a data center project. Qwen3's MoE models have benchmarked competitively against dense models two to three times their active parameter count in reasoning and coding tasks, so the architecture's efficiency doesn't mean weak performance.

NVIDIA has been building out a library of optimized model releases on Hugging Face for several months, typically taking popular open-source models and releasing NVFP4 builds to reduce the setup friction for local deployment on their hardware ecosystem.