Models Notable

NVIDIA Cosmos 3 Replaces Four Physical AI Models With One Open Checkpoint

June 1, 2026 2 min read Source: Hugging Face Blog

Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action

Image: Hugging Face

Training a robot used to mean juggling four separate NVIDIA models - Cosmos Predict for world generation, Cosmos Reason for scene understanding, Cosmos Transfer for controlled output, and Cosmos Policy for action generation. Cosmos 3 collapses all four into one.

NVIDIA released Cosmos 3 on June 1 in two sizes: Nano (16 billion total parameters) and Super (65 billion total parameters). Both use a dual-pathway design where one pathway handles autoregressive reasoning - understanding text, images, and video - while the other handles diffusion-based generation, the iterative process of refining noisy data into clean output. The two pathways share a single forward pass rather than operating as separate models.

The architecture uses what NVIDIA calls Mixture-of-Transformers (MoT), which routes different modalities - text, images, video, audio, and robot action signals - through specialized parameter sets while keeping everything inside one unified model.

Six Task Modes, One Checkpoint

Cosmos 3 handles six distinct configurations from a single set of weights: text-to-video generation, video-to-text description, forward dynamics (given an action input, predict what the world looks like next), inverse dynamics (given before-and-after video, infer what action caused the change), and full policy modeling (given an image and a text instruction, output both the next video frame and the next action to take).

The Nano variant runs on workstation-class hardware like the RTX PRO 6000. The Super model requires NVIDIA Hopper or Blackwell datacenter GPUs, putting it in reach of labs and companies doing large-scale synthetic data generation rather than individual developers.

Six Datasets Released Alongside

Along with the model weights, NVIDIA released six synthetic training datasets on Hugging Face: robotics simulation scenes, physical interaction data from Isaac Sim, spatial reasoning problems, digital human motion, autonomous driving scenarios, and warehouse operations footage. These are intended as fine-tuning datasets - starting points for teams adapting the base model to specific robots, environments, or tasks.

The model integrates into Hugging Face Diffusers via Cosmos3OmniPipeline, and post-training scripts are available in the Cosmos Framework GitHub repository. A full technical paper is available from NVIDIA Research.

For robotics and autonomy teams, the practical argument is real: maintaining four model checkpoints, coordinating their inputs and outputs, and paying inference costs four times over is operationally expensive. Whether a single omni-model actually matches four specialized ones at each subtask is the harder question, and that will take production deployment to answer.

Source

Hugging Face Blog Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action →

Six Task Modes, One Checkpoint

Six Datasets Released Alongside

Source

More from today

NVIDIA Releases Nemotron 3 Ultra, Its Latest Open-Weight AI Model

MiniMax M3 Launches with 1M Token Context and Agentic Coding Focus

Cookie Preferences