31 billion. That's how many parameters Gemma 4 runs on - roughly one-fifth the size of most frontier models - and it's finishing at the top of nearly every independent benchmark being run right now, losing only to Anthropic's Opus 4.6 and OpenAI's GPT-5.2.
Parameters are the internal numerical settings a model uses to process and generate text. More parameters typically mean more capability, but also more compute and cost. The conventional wisdom has been that you need 100B+ parameters to compete at the frontier. Gemma 4 is breaking that assumption.
The cost side of this is hard to ignore: $0.20 per run via API. That puts Gemma 4 in a completely different pricing bracket than the models it's matching in quality. Frontier model pricing typically lands between $3 and $15 per million tokens (tokens are chunks of text, roughly 3/4 of a word each). At $0.20/run, developers running high-volume workloads - content pipelines, customer support bots, document processing - are looking at costs an order of magnitude lower than what they'd pay for comparable output from GPT-4-class models.
What It Can't Do
Gemma 4 isn't touching the top two. Opus 4.6 and GPT-5.2 remain ahead in the testing being shared across the community - particularly on complex reasoning tasks that require multi-step thinking and precise instruction following. If you're running a legal research tool or anything where accuracy on hard problems is non-negotiable, those models still justify their premium.
But the gap below those two is now Gemma 4's territory. It's beating everything else: Mistral's frontier offerings, Llama's latest releases, other mid-tier models from major labs. The spread in these benchmarks isn't marginal - it's decisive.
Why This Matters for Small Teams
Small model, frontier performance is not a new goal - it's been the promise of every efficiency-focused release for two years. The difference is that Gemma 4 appears to actually deliver it. Earlier models claiming this positioning (Mistral 7B, Phi-3, Gemma 2) were genuinely impressive for their size but still noticeably weaker than frontier models on complex tasks. The benchmark data here is suggesting Gemma 4 closes that gap for most real-world use cases.
For a content team using AI for drafts, research summaries, and email copy, the quality difference between Gemma 4 and GPT-5.2 is probably invisible in daily use. Paying $0.20/run instead of $5-10/run is not invisible.
Google has been building toward this with the Gemma line for a couple of years. The open weights model strategy - where the model itself can be downloaded and run locally, not just accessed via API - gives developers who want full data control another reason to consider it. Gemma 4 at 31B is small enough to run on high-end consumer hardware with the right setup, which is a meaningful option for teams with privacy or compliance requirements.
This is a benchmark snapshot, not a controlled study, and performance varies significantly by task type. But when independent testing consistently puts a $0.20 model this close to the frontier, that result is worth paying attention to.