Models Notable

DeepSeek V4 Pro Matches GPT-5.2 on Agentic Tasks at 17x Lower Cost

May 5, 2026 2 min read

Image: OpenAI

17 times cheaper. That's roughly the cost difference between running DeepSeek V4 Pro and GPT-5.2 on FoodTruck Bench, a benchmark built to test AI on agentic tasks - and the two models scored the same.

An agentic benchmark tests something different from standard AI evaluations. Rather than asking a model to answer a single question, agentic tasks require the model to plan and execute a sequence of steps to reach a goal - more like what an AI agent does in a real workflow than a multiple-choice quiz. These tests are harder to pass through memorization and more predictive of performance when the AI is actually doing work.

The results appeared 10 weeks after GPT-5.2's release. That gap is notable: ten weeks is not a long development cycle. DeepSeek, backed by Chinese quantitative trading firm High-Flyer, has consistently released models that match or approach frontier accuracy at significantly lower inference cost - the per-call price of running the model via API - and this result follows the same pattern.

What a 17x Cost Gap Means in Practice

Running ChatGPT or GPT-5.2 for a single conversation, per-token pricing barely registers. Running an AI agent that makes 50 API calls per task, thousands of times a day, cost becomes the primary constraint on whether a product is viable at all.

A 17x price difference changes the math on entire product categories. A developer building agentic workflows on GPT-5.2 versus DeepSeek V4 Pro faces infrastructure costs that differ by a factor of 17. That affects margins, what you can charge customers, and which use cases make sense to build.

One Benchmark, One Team

FoodTruck Bench was developed by the same team that published these results - it's not an industry-standard evaluation like MMLU or HumanEval. Independent practitioner benchmarks often capture real-world performance better than academic tests, but they reflect the specific tasks that team cares about. DeepSeek V4 Pro may perform differently on tasks outside that benchmark's scope.

What benchmark choice doesn't affect is the pricing. DeepSeek's API costs are public, as are OpenAI's, and the gap is real regardless of which evaluation you run. For developers who have been defaulting to frontier OpenAI models, results like this make the alternative worth a serious evaluation - not because one benchmark settles the question, but because the cost argument compounds across every project where performance is comparable.

What a 17x Cost Gap Means in Practice

One Benchmark, One Team

Related Tools

More from today

Pennsylvania AG Sues Character.AI After Chatbot Falsely Claimed to Be a Licensed Psychiatrist

Anthropic Partners With Wall Street Firms to Embed Claude in Portfolio Companies

Major Publishers Sue Meta Over Llama AI Training Data

Cookie Preferences