Models Notable

Gemma 4 Beats Nearly Every Model in Independent Testing at $0.20/Run

April 5, 2026 3 min read

31 billion. That's how many parameters Gemma 4 runs on - roughly one-fifth the size of most frontier models - and it's finishing at the top of nearly every independent benchmark being run right now, losing only to Anthropic's Opus 4.6 and OpenAI's GPT-5.2.

Parameters are the internal numerical settings a model uses to process and generate text. More parameters typically mean more capability, but also more compute and cost. The conventional wisdom has been that you need 100B+ parameters to compete at the frontier. Gemma 4 is breaking that assumption.

The cost side of this is hard to ignore: $0.20 per run via API. That puts Gemma 4 in a completely different pricing bracket than the models it's matching in quality. Frontier model pricing typically lands between $3 and $15 per million tokens (tokens are chunks of text, roughly 3/4 of a word each). At $0.20/run, developers running high-volume workloads - content pipelines, customer support bots, document processing - are looking at costs an order of magnitude lower than what they'd pay for comparable output from GPT-4-class models.

What It Can't Do

Gemma 4 isn't touching the top two. Opus 4.6 and GPT-5.2 remain ahead in the testing being shared across the community - particularly on complex reasoning tasks that require multi-step thinking and precise instruction following. If you're running a legal research tool or anything where accuracy on hard problems is non-negotiable, those models still justify their premium.

But the gap below those two is now Gemma 4's territory. It's beating everything else: Mistral's frontier offerings, Llama's latest releases, other mid-tier models from major labs. The spread in these benchmarks isn't marginal - it's decisive.

Why This Matters for Small Teams

Small model, frontier performance is not a new goal - it's been the promise of every efficiency-focused release for two years. The difference is that Gemma 4 appears to actually deliver it. Earlier models claiming this positioning (Mistral 7B, Phi-3, Gemma 2) were genuinely impressive for their size but still noticeably weaker than frontier models on complex tasks. The benchmark data here is suggesting Gemma 4 closes that gap for most real-world use cases.

For a content team using AI for drafts, research summaries, and email copy, the quality difference between Gemma 4 and GPT-5.2 is probably invisible in daily use. Paying $0.20/run instead of $5-10/run is not invisible.

Google has been building toward this with the Gemma line for a couple of years. The open weights model strategy - where the model itself can be downloaded and run locally, not just accessed via API - gives developers who want full data control another reason to consider it. Gemma 4 at 31B is small enough to run on high-end consumer hardware with the right setup, which is a meaningful option for teams with privacy or compliance requirements.

This is a benchmark snapshot, not a controlled study, and performance varies significantly by task type. But when independent testing consistently puts a $0.20 model this close to the frontier, that result is worth paying attention to.

What It Can't Do

Why This Matters for Small Teams

Related Tools

More from today

Kimi Identifies Itself as Claude When Asked What It Is

Gemma 4 Is 25x Smaller Than DeepSeek R1 Was a Year Ago - and Runs Locally

Claude Thinks It's Bedtime at Noon - Here's the Fix

Cookie Preferences