NVIDIA's Nemotron Cascade 2 Scores IMO Gold With Only 3B Active Parameters

NVIDIA AI
Image: NVIDIA

30 billion parameters total, but only 3 billion active on any given query. That ratio is what makes NVIDIA's Nemotron Cascade 2 worth paying attention to.

Released March 19, the model uses a Mixture of Experts (MoE) architecture - a design where the model contains many specialized sub-networks but only activates a small subset for each input, keeping compute costs low while maintaining a large knowledge base. The result: benchmark scores that compete with models running 4 to 10 times more active parameters.

The Numbers

The headline result is a 35-point score on IMO 2025 (the International Mathematical Olympiad), which puts it in gold medal territory. For competitive programming, it scored 10 out of 12 on ICPC World Finals 2025 problems and 439.3 on IOI 2025.

On more standard benchmarks:

  • AIME 2025 (math competition): 92.4, or 98.6 with tool-integrated reasoning (where the model can execute Python code to check its work)
  • LiveCodeBench v6 (real-world coding): 87.2
  • MMLU-Pro (broad knowledge): 79.8
  • GPQA-Diamond (graduate-level science questions): 76.1
  • 1M token context (long document processing): 99.0% accuracy on needle-in-a-haystack tests

For context, NVIDIA claims it outperforms both Qwen3.5-35B-A3B (a similarly sized MoE model from Alibaba) and NVIDIA's own previous Nemotron-3-Super-120B-A12B, despite that model activating four times as many parameters per query.

Why 3B Active Matters

Active parameter count is what determines how much memory and compute you need to actually run the model. A 30B-total, 3B-active model can run on hardware that would choke on a dense 30B model. For people running models locally - on a beefy desktop GPU or a small server - this is the metric that decides whether a model is practical or theoretical.

The model supports two modes: a "thinking" mode where it reasons step-by-step inside <think> tags before answering (similar to how OpenAI's o1 and o3 work), and a standard instruct mode for quick responses. It also supports tool use, including executing Python code mid-response to verify calculations.

Not Based on Qwen

Despite the similar sizing convention (30B-A3B mirrors Qwen's naming scheme), NVIDIA built this on their own Nemotron architecture, not on Qwen. That matters because most recent open-weight MoE models have been Qwen derivatives. A genuinely different architecture competing at this level gives the local AI community another option with different strengths and failure modes.

The model is available on Hugging Face under NVIDIA's Open Model License. The training datasets (both supervised fine-tuning and reinforcement learning data) are also published, which is unusually transparent for a model at this performance tier.

For anyone running local models, Nemotron Cascade 2 looks like the new efficiency benchmark to beat at the 3B-active scale.