Training a robot used to mean juggling four separate NVIDIA models - Cosmos Predict for world generation, Cosmos Reason for scene understanding, Cosmos Transfer for controlled output, and Cosmos Policy for action generation. Cosmos 3 collapses all four into one.
NVIDIA released Cosmos 3 on June 1 in two sizes: Nano (16 billion total parameters) and Super (65 billion total parameters). Both use a dual-pathway design where one pathway handles autoregressive reasoning - understanding text, images, and video - while the other handles diffusion-based generation, the iterative process of refining noisy data into clean output. The two pathways share a single forward pass rather than operating as separate models.
The architecture uses what NVIDIA calls Mixture-of-Transformers (MoT), which routes different modalities - text, images, video, audio, and robot action signals - through specialized parameter sets while keeping everything inside one unified model.
Six Task Modes, One Checkpoint
Cosmos 3 handles six distinct configurations from a single set of weights: text-to-video generation, video-to-text description, forward dynamics (given an action input, predict what the world looks like next), inverse dynamics (given before-and-after video, infer what action caused the change), and full policy modeling (given an image and a text instruction, output both the next video frame and the next action to take).
The Nano variant runs on workstation-class hardware like the RTX PRO 6000. The Super model requires NVIDIA Hopper or Blackwell datacenter GPUs, putting it in reach of labs and companies doing large-scale synthetic data generation rather than individual developers.
Six Datasets Released Alongside
Along with the model weights, NVIDIA released six synthetic training datasets on Hugging Face: robotics simulation scenes, physical interaction data from Isaac Sim, spatial reasoning problems, digital human motion, autonomous driving scenarios, and warehouse operations footage. These are intended as fine-tuning datasets - starting points for teams adapting the base model to specific robots, environments, or tasks.
The model integrates into Hugging Face Diffusers via Cosmos3OmniPipeline, and post-training scripts are available in the Cosmos Framework GitHub repository. A full technical paper is available from NVIDIA Research.
For robotics and autonomy teams, the practical argument is real: maintaining four model checkpoints, coordinating their inputs and outputs, and paying inference costs four times over is operationally expensive. Whether a single omni-model actually matches four specialized ones at each subtask is the harder question, and that will take production deployment to answer.