NVIDIA Nemotron 3 Nano Omni Can Process 5+ Hours of Video, Audio, and Docs in One Pass

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents
Image: Hugging Face

The name is a bit misleading. NVIDIA is calling this model "Nano," but it has 30 billion parameters - not exactly small. The naming makes sense once you understand the architecture: it uses a Mixture of Experts (MoE) design, meaning only 3 billion parameters are active on any given computation pass. The rest sit idle until needed for different task types. That's where the efficiency comes from, and why NVIDIA claims 9x higher throughput on multimodal workloads compared to alternatives.

Nemotron 3 Nano Omni launched April 28 as an open-weights model on Hugging Face. It's designed specifically for agent workflows that need to process multiple types of input together: long documents, audio recordings, video files, and GUI screenshots - all within the same context window.

What Sets the Context Window Apart

Most multimodal AI models today handle images alongside text reasonably well. Far fewer can process audio without converting it to text first (transcription), which loses tone, pacing, and non-speech sounds like background noise or music. Nemotron Omni takes audio directly as input - 16kHz audio sampled and processed alongside video frames and text tokens, no transcription step required.

The practical result: a context window spanning over 5 hours of combined multimodal content. Audio training extends to 20 minutes of continuous audio. Documents scale to 100+ pages. For open-weights models specifically, that's a meaningful gap versus what's been available.

The video pipeline includes a technique called Conv3D temporal compression, which fuses consecutive video frames into grouped "tubelets" to cut the number of visual tokens in half - letting the model cover more footage without proportionally blowing up computation.

Benchmark Numbers Worth Examining

The individual benchmark results give a clearer picture than NVIDIA's aggregate throughput claims:

  • OSWorld 47.4 - this benchmark measures whether an AI can operate a computer by reading its screen and taking actions; 47.4 is competitive with models twice the active parameter count
  • Video-MME 72.2 - a standard video understanding benchmark
  • VoiceBench 89.4 - measures voice interaction quality across accents and speaking styles
  • MMLongBench-Doc 57.5 - long document comprehension across tables, charts, and dense text

The document numbers got a notable boost from synthetic training data: NVIDIA generated 11.4 million question-answer pairs (~45 billion tokens) from a PDF corpus using their NeMo Data Designer tool, which improved long-document accuracy by 2.19x over the base model. That data pipeline detail matters - it suggests you don't need to scrape the entire web to get strong domain-specific performance.

Who This Is Built For

NVIDIA isn't positioning this as a chatbot. Every use case in the announcement is agentic: computer use (interpreting GUI layouts and selecting actions), multi-document analysis, transcription of long meetings, and video understanding for recorded tutorials or screen captures.

Three quantization formats ship on day one for different hardware setups - BF16 (full precision), FP8 (reduced precision for faster inference), and NVFP4 (most compressed, lowest memory footprint). All three are available on Hugging Face now.

For teams building pipelines that ingest mixed-media inputs - a recorded sales call alongside its presentation slides, or a product tutorial video paired with documentation - this is one of the few open models that handles that combination without requiring separate specialized models stitched together.

NVIDIA also published the technical report alongside 9 runnable training recipes in their Megatron-Bridge repository, which is unusually thorough for an open-weights release.