Related ToolsChatgpt

NVIDIA Shrinks OpenAI's 120B Open Model to 88B Parameters, Runs 2.8x Faster

OpenAI
Image: OpenAI

OpenAI released its gpt-oss-120b as an open-weight reasoning model earlier this year. Now NVIDIA has taken that 120-billion-parameter model and squeezed it down to 88 billion parameters while actually matching or slightly beating the original's accuracy. The result, gpt-oss-puzzle-88B, landed on Hugging Face today under NVIDIA's Open Model License, cleared for commercial use.

The speed gains are the real story here. On a single H100 GPU, NVIDIA claims 2.82x faster inference (the time it takes the model to generate a response). On an 8-GPU H100 setup handling long documents, it's 1.63x faster. For short prompts on the same 8-GPU rig, 1.22x. Those aren't theoretical numbers from a paper - they're throughput benchmarks on production hardware.

How They Did It

NVIDIA used three techniques stacked on top of each other:

Heterogeneous expert pruning. The model uses a Mixture-of-Experts (MoE) architecture, where instead of activating the entire network for every query, it routes each request to a subset of specialized "expert" sub-networks. NVIDIA pruned different numbers of experts from different layers, keeping more where they matter and cutting aggressively where they don't. That's the "heterogeneous" part - previous pruning methods cut uniformly.

Selective window attention. Standard transformer models let every token attend to every other token in the context window. NVIDIA replaced most of those full-attention layers with windowed attention that only looks at the nearest 8,000 tokens. This cuts the KV-cache (the memory the model uses to track conversation context) by roughly 40%. The practical upside: you can fit longer conversations into the same GPU memory.

Aggressive quantization. The expert weights use MXFP4 (4-bit floating point), and the KV-cache uses FP8 (8-bit). Together, this roughly doubles how many tokens the model can hold in memory at once compared to the parent.

What You Can Actually Run

The model supports a 128K token context window - roughly 300 pages of text. It ships with three reasoning effort modes (low, medium, high) that let you trade speed for depth depending on the task. Ask it the capital of France, set effort to low. Ask it to work through a multi-step math proof, set it to high.

Deployment uses vLLM or Hugging Face Transformers. NVIDIA published Docker commands for single-GPU serving, which is notable - an 88B MoE model fitting on one H100 is genuinely useful for teams that don't want to manage multi-GPU setups.

The benchmark suite covers MMLU-Pro, GPQA-Diamond (graduate-level science questions), AIME 2025 (competition math), and RULER 128K (long-context evaluation). NVIDIA says the compressed model matches the 120B parent across all of them.

Who This Is For

Self-hosters and companies running their own inference infrastructure. If you're already deploying gpt-oss-120b locally, this is a straightforward upgrade: same accuracy, significantly less compute per request. The NVIDIA Open Model License allows commercial use, so startups building on top of gpt-oss can cut their GPU bills without switching models.

For everyone else using AI through APIs, this matters indirectly. It demonstrates that post-training compression of MoE reasoning models can deliver major efficiency gains without quality loss. Expect this technique to show up in more model releases throughout 2026.