Research Notable

AI Coding Tools: 65% More Usage, Only 10% More Merged Code

March 14, 2026 2 min read

What if AI coding tools are getting dramatically better at tests and barely better at work?

A 15-month longitudinal study tracking 400 companies from November 2024 through February 2026 found that AI tool usage among developers increased 65%. Pull request throughput, the actual code that makes it into production, increased just 10%.

That's a massive gap. And it lines up with a growing body of evidence that AI coding benchmarks are measuring something real but not something that translates cleanly to productivity.

The Benchmark Problem

SWE-bench has become the standard leaderboard for AI coding agents. It measures whether AI-generated code passes automated tests. Scores have climbed steadily. Every major model release trumpets a new SWE-bench high.

But passing tests and shipping code are different activities. Research from METR (Model Evaluation and Threat Research) found that when actual project maintainers reviewed AI-generated pull requests that had already passed automated tests, roughly 50% were rejected. Not because the code was broken. Because it had wrong code style, inappropriate scope, poor architectural fit, or ignored project conventions.

The code worked. It just wasn't code anyone wanted to merge.

Typing Was Never the Hard Part

This cuts at a fundamental misunderstanding about what makes software engineering difficult. Writing syntactically correct code that passes a test is a solved problem for modern AI. Understanding why you'd structure something one way instead of another, anticipating how a codebase will need to evolve, respecting the unwritten norms of a team - that's the actual job.

AI models pattern-match against training data. They can produce code that looks right and runs right. They struggle with the contextual judgment calls that experienced developers make automatically: "This works, but it'll create a maintenance nightmare in six months" or "This approach conflicts with the migration we're planning."

Amazon employees have reported internally that AI coding tools haven't freed them up. Instead, the tools created new work: reviewing, verifying, and correcting AI-generated output. The time saved writing code gets partially eaten by the time spent babysitting it.

What the Numbers Actually Mean

None of this means AI coding tools are useless. A 10% increase in merged code across 400 companies is real output. For individual developers on well-scoped tasks like writing boilerplate, generating tests, or exploring unfamiliar APIs, the productivity gains can be substantial.

But the industry narrative of "10x developer productivity" isn't showing up in the data. Not at the team level. Not at the company level. Not after 15 months of rapidly increasing adoption.

The tools are genuinely useful. The benchmarks are genuinely improving. The gap between those two things should make everyone, especially companies betting their engineering strategy on AI productivity multipliers, pay close attention to what they're actually measuring.

The Benchmark Problem

Typing Was Never the Hard Part

What the Numbers Actually Mean

Related Tools

More from today

Karpathy Maps AI Exposure Across 342 US Occupations

Credal's Spreadsheet Compression Boosts GPT Accuracy by Up to 21 Points

The AI Subsidy Era Is Over: How 2026 Price Hikes Are Forcing Leaner Engineering

Cookie Preferences