Models Breaking

Google's Gemini Omni Generates Video From Text, Images, and Audio at Once

May 19, 2026 2 min read

Google announced Gemini Omni at I/O 2026 - a model that can process text, images, audio, and video simultaneously and turn all of that into generated or edited video through a conversational interface.

Most AI video tools today take a text prompt. Some accept an image. Gemini Omni's approach is different: it reasons across all input types at once, building a unified understanding before generating anything. You could show it a product photo, describe the mood you want verbally, and have a back-and-forth conversation about the final clip before it renders a frame.

The first version shipping is Gemini Omni Flash. Google uses "Flash" to denote faster, lower-cost model variants - optimized for speed over raw capability. Larger Omni variants are expected to follow.

Why the "Reasoning Across Inputs" Framing Matters

When a model reasons across modalities (different input types like text, images, and audio), it's not handling each one in isolation and stitching results together afterward. It processes all inputs as a unified whole before generating output. That's the theoretical advantage over chaining tools - like transcribing audio to text, then feeding that text to a video generator. Less information gets lost between steps.

Whether Gemini Omni actually delivers on that in practice will only become clear once creators start using it at scale. Google has a pattern of demonstrating capabilities at I/O that take quarters to reach general availability with consistent results.

How Omni Differs From D-ID, HeyGen, and Synthesia

D-ID, HeyGen, and Synthesia are all established in AI video generation, and InVideo AI has built a strong product around text-to-video for marketers. None currently combine all four input types with a conversational editing loop that remembers context across the conversation. That's the specific product angle Google is making with Omni - video editing as a dialogue, not a one-shot prompt you have to rewrite from scratch each time.

Gemini Omni Flash is beginning to roll out now. How quickly it reaches the broader user base - and whether the conversational editing holds up on real production tasks - will determine whether this changes anything for working video creators.

Why the "Reasoning Across Inputs" Framing Matters

How Omni Differs From D-ID, HeyGen, and Synthesia

Related Tools

More from today

Google I/O 2026: Gemini 3.5, AI Mode in Search, and Smart Glasses Again

Google Announces Gemini 3.5, Built Around Taking Action, Not Just Answering

Claude Is Citing Iranian State Media - and Can't Explain Why

Cookie Preferences