Related ToolsD IdDall E 3Adobe FireflyCanvaPictory

DaVinci-MagiHuman: Open-Source 15B Model Generates Talking Head Video in 2 Seconds

AI news: DaVinci-MagiHuman: Open-Source 15B Model Generates Talking Head Video in 2 Seconds

A new open-source model called daVinci-MagiHuman can generate a 5-second lip-synced talking head video in 2 seconds on a single H100 GPU. The model, jointly developed by Sand.ai and SII-GAIR Lab, ships under Apache 2.0 - meaning anyone can download, modify, and use it commercially with zero restrictions.

The 15-billion parameter model stands out for a specific architectural choice: it processes text, video, and audio inside a single unified transformer (a type of neural network architecture) simultaneously, rather than using separate models for each and stitching the results together afterward. The practical upshot is that lip movements and facial dynamics are generated in sync with audio from the start, rather than being corrected in post-processing.

Speed and Quality Numbers

The inference speed (how fast it produces output) scales with resolution:

  • 256p: 2.0 seconds for a 5-second clip
  • 540p: 8.0 seconds
  • 1080p: 38.4 seconds

All benchmarks are on a single NVIDIA H100 GPU. The model uses a distilled version that needs only 8 denoising steps without classifier-free guidance, which is how it hits those speeds.

In human evaluation across 2,000 pairwise comparisons, daVinci-MagiHuman won 80% of head-to-head matchups against OVI 1.1 and 60.9% against LTX 2.3, two of the strongest existing models in this category. Its word error rate (how accurately the generated lip movements match spoken words) is 14.60%, compared to OVI 1.1's much worse 40.45% and LTX 2.3's 19.23%.

The Full Stack Is Open

The release includes the base model, a distilled (faster, smaller) model, a super-resolution model for upscaling, and all inference code. It supports six languages: English, Mandarin, Cantonese, Japanese, Korean, German, and French. You'll need Python 3.12+, PyTorch 2.9+, and an H100 to run it at the published speeds.

For anyone building products that need AI-generated talking avatars, training videos, or personalized video content, this removes a significant cost barrier. Previously, comparable quality required either expensive API calls to closed services like D-ID or HeyGen, or cobbling together multiple open-source models with manual alignment work. A single Apache 2.0 model that handles text, audio, and video generation in one pass simplifies that stack considerably.

The model weights and code are available on Hugging Face and GitHub under the GAIR-NLP organization.