Three AI coding models score within one percentage point of each other on test pass rates. But when you actually look at the code they write, one produces solutions nearly twice as close to what a human developer would ship.
That's the central finding from a new analysis by benchmarking platform Stet, which ran 87 real-world coding tasks pulled from three open-source repositories (Zod, graphql-go-tools, and sqlparser-rs) against three GPT models. Each task came from an actual merged pull request, and the models had to reproduce the fix given only the pre-merge repo state and task instructions.
Nearly Identical Pass Rates, Wildly Different Code
The pass rates look like a dead heat:
- GPT-5.1-codex-mini: 77/87 (88.5%)
- GPT-5.3-codex: 78/87 (89.7%)
- GPT-5.4: 78/87 (89.7%)
If you stopped here, as most AI coding benchmarks like SWE-Bench and Terminal Bench do, you'd conclude these models are basically interchangeable. But Stet went further, scoring the 75 tasks where all three models passed on metrics that matter to working engineers: how closely the output matched the human-written solution, whether it would survive code review, and how much unnecessary code it touched.
| Metric | GPT-5.1-codex-mini | GPT-5.3-codex | GPT-5.4 |
|---|---|---|---|
| Equivalence to human PR | 24.0% | 38.7% | 45.3% |
| Code review pass rate | 9.3% | 8.0% | 16.0% |
| High-risk footprint | 12.0% | 9.3% | 8.0% |
| Cost per task | $1.98 | $5.23 | $1.34 |
GPT-5.4's code matched the human solution 45.3% of the time, nearly double the 24.0% from codex-mini. It also passed code review at nearly double the rate (16% vs. 9.3%), touched fewer risky areas of the codebase, and cost less per task ($1.34 vs. $1.98).
What This Means for Teams Picking AI Coding Tools
This gap matters because most teams evaluating AI coding assistants lean heavily on pass/fail metrics. "Does the code work?" is the easy question to answer at scale. "Is the code good?" requires human judgment, or at least a more sophisticated evaluation framework.
The practical consequence: a team might choose a cheaper or faster model based on benchmark scores, then spend more time in code review cleaning up the output. The tests pass, but the pull request still needs significant rework before merging. That hidden cost doesn't show up in any leaderboard.
Stet's methodology isn't perfect. 87 tasks across three repos is a small sample, and "equivalence to human PR" assumes the original developer's approach was optimal. But the directional finding is hard to argue with: pass rates alone are a dangerously incomplete picture of AI coding quality.
For anyone shopping for AI coding tools right now, the takeaway is straightforward: don't trust pass rate comparisons in isolation. Ask how the tool performs on code review acceptance, unnecessary changes, and solution equivalence. Those are the metrics that predict whether AI-generated code actually saves you time or just shifts the work from writing to reviewing.