Related ToolsD Id

Why AI Video Will Stay Expensive for a Long Time

AI news: Why AI Video Will Stay Expensive for a Long Time

Text generation costs keep falling. Video generation costs barely budge. The gap is not a temporary optimization problem. It is structural.

When a text model like GPT-4o or Claude generates a response, it works with tokens, compact representations of words and word fragments. A 1,000-word essay might be 1,300 tokens. The model predicts one token at a time, each prediction referencing the tokens before it. The math is intensive but bounded.

Video has no equivalent shortcut. A single second of 1080p video at 30 frames per second contains 30 full images, each with over 2 million pixels. Every pixel has color values. And unlike text, where each token is somewhat independent, video frames must maintain consistency: the same face, the same lighting, the same physics, frame after frame. An object that appears in frame 1 needs to look right in frame 90, even if the camera angle has changed.

The Dimensionality Problem

Text models compress meaning into a manageable number of dimensions. A token embedding might have 4,096 or 8,192 dimensions. That sounds like a lot until you compare it to video, where the raw data for a single 10-second clip can run into hundreds of millions of values. Current video models use techniques like latent diffusion (compressing video into a smaller mathematical space before generating it), but even compressed, the data is orders of magnitude larger than text.

There is also a temporal coherence problem that text simply does not have. In text, if you write a contradictory sentence, it is annoying but readable. In video, if a character's shirt changes color between frames, the output is unwatchable. Maintaining that consistency across time requires the model to track spatial relationships across every frame, which scales the computation dramatically.

What This Means for Pricing

Sora, Runway, Kling, and other video generation tools charge per second of output, and the prices reflect these physics. Generating a few seconds of video can cost what thousands of words of text generation costs. Some services have tried subscription models to mask per-use costs, but the underlying compute does not get cheaper just because the billing changes.

The optimistic case is that new architectures will find better compression schemes for video, the same way transformers found a good abstraction for text. But text had a natural tokenization target (words and subwords). Video does not have an obvious equivalent. Researchers are experimenting with approaches like tokenizing motion separately from appearance, but nothing has produced the kind of efficiency jump that would bring video costs in line with text.

For anyone building workflows around AI video, the practical implication is clear: budget for video generation to remain a premium service. Text costs will keep dropping toward commodity pricing. Video costs will drop too, but more slowly, and from a much higher starting point. Plan your pipelines accordingly.