Research Notable

More Code, More Problems: Evidence Mounts That AI Agents Hurt Quality

March 23, 2026 2 min read

Pull request counts are up. Outages are also up. That tension sits at the center of a growing body of evidence that AI coding agents may be making developers faster at producing code while making teams slower at shipping reliable software.

Gergely Orosz, writing in The Pragmatic Engineer, assembled data from several major companies that paints a complicated picture. The headline metrics look great. The results downstream do not.

The Numbers Look Good on Paper

Uber reports that "power users" who use AI tools 20 or more days per month generate 52% more pull requests. CEO Dara Khosrowshahi says 30% of engineers using AI at full speed show productivity changes he's "never ever seen before." Meta now factors AI token usage into performance reviews, flagging engineers with low AI usage and low output as underperformers.

These are the metrics that make investor decks sing. More PRs, more code, more activity.

The Quality Gap

Then there's what actually ships. Amazon's retail division has seen a "trend of incidents" with "high blast radius" tied to AI-assisted changes. One specific case: engineers let Amazon's Kiro AI tool delete and recreate an environment, triggering a 13-hour outage of the AWS cost calculator. Amazon's response was to require senior engineer sign-off on all AI-assisted changes from junior engineers.

Anthropic itself provides an ironic example. A persistent bug on Claude.ai caused the prompt textbox to reset mid-typing whenever subscription data loaded. This affected millions of paying customers daily. Anthropic generates roughly 80% of its production code with Claude. The bug went unnoticed for an extended period.

Dax Raad, CEO of OpenCode, argues that AI agents lower shipping standards, discourage refactoring (why clean up code when the AI can just generate more?), and don't actually accelerate team velocity once you account for time spent reviewing and fixing generated code.

Measuring the Wrong Things

The core problem is a measurement gap. Companies are tracking inputs (PRs merged, tokens consumed, lines written) while ignoring outputs (uptime, bug rates, customer-facing quality). When your performance review rewards AI usage regardless of what that usage produces, you've created an incentive to generate volume over value.

Sentry's CTO has made a similar observation: AI removes the initial barrier to writing code but produces bloated, hard-to-maintain output that compounds into long-term slowdowns.

None of this means AI coding tools are useless. The Uber and Meta numbers suggest real productivity gains for experienced developers who know what to ask for and what to reject. The problem is organizational, not technological. When companies treat AI code generation as a pure accelerant without investing equally in review, testing, and quality gates, they trade today's velocity for tomorrow's incidents.

The 13-hour AWS outage is a $50,000 lesson in what happens when "ship faster" outpaces "ship correctly."

The Numbers Look Good on Paper

The Quality Gap

Measuring the Wrong Things

Related Tools

More from today

Anthropic's Physicist Used Claude to Write a Real Research Paper in Two Weeks

Study of 134,000 Legal AI Queries Shows Lawyers Still Outperform

The 12 Writing Tics That Instantly Mark Your Text as AI-Generated

Cookie Preferences