A 120 billion parameter model that only uses 12 billion parameters at a time. That's the pitch behind NVIDIA's Nemotron 3 Super, a fully open-source model built for AI agents that need to reason through long, complex tasks.
The architecture is genuinely unusual. Nemotron 3 Super is a hybrid of three different model designs working together: Mamba-2 layers handle long sequences efficiently without the quadratic cost of standard attention, Transformer attention layers handle tasks where the model needs precise recall across the full context, and a "Latent MoE" (Mixture of Experts) layer routes each token through only a fraction of the model's total capacity. That last piece is why the model has 120 billion parameters on paper but only activates 12 billion per token - you get a much larger model's knowledge at a fraction of the compute cost.
1 Million Token Context, 5x Faster Throughput
The numbers NVIDIA is reporting are substantial. Nemotron 3 Super supports a native 1 million token context window - roughly 2,500 pages of text in a single conversation. Throughput is over 5x better than the previous Nemotron Super version. On PinchBench, a benchmark for agent-style tasks like tool calling and multi-step reasoning, it scored 85.6%, which NVIDIA says makes it the top-performing open model in its weight class.
Two technical choices stand out. First, the model was trained natively in NVIDIA's NVFP4 format (a 4-bit floating point), which means it runs 4x faster on Blackwell GPUs compared to FP8 inference on H100s. Second, it uses multi-token prediction - forecasting several tokens at once instead of one at a time - which enables up to 3x wall-clock speedups during structured output generation like code or JSON.
Built for Agent Workloads
NVIDIA is positioning this squarely at the agentic AI use case. Their argument: when you chain multiple AI agents together, the conversation context explodes - they claim multi-agent workflows generate roughly 15x more tokens than a standard chat. That's where the Mamba layers and million-token context earn their keep, processing long histories without the memory blowup that standard Transformers suffer from.
The model was post-trained using reinforcement learning across 21 different environment configurations with 1.2 million rollouts, focused on reasoning, coding, safety, and multi-step tool use. The pre-training diet was 25 trillion tokens total, with 10 trillion unique curated tokens.
Fully Open, Widely Available
This is a genuinely open release. NVIDIA published the model weights, the full 40-million-sample training dataset, complete training recipes, and evaluation methods under the NVIDIA Nemotron Open Model License. You can run it through vLLM, SGLang, TensorRT LLM, or NVIDIA NIM, and it's already available on Hugging Face, build.nvidia.com, Perplexity, and OpenRouter.
For anyone running local models, the 12B active parameter count puts this in a surprisingly accessible range for the capability it delivers. The MoE approach means you still need enough VRAM to load the full 120B model, but inference costs scale with the 12B active slice. If you're building agent pipelines that need long context and structured tool calling, Nemotron 3 Super is now the open-source model to beat.