Research Notable

Claude's Hidden Bugs: How AI-Written Code Passes All Tests Then Breaks at Scale

March 7, 2026 3 min read

Image: Anthropic

What Happened

A benchmark study from the Mycelium project tested how Claude handles increasingly complex coding tasks across multiple rounds of development. The results are a wake-up call for anyone relying on AI coding assistants for production systems.

Researchers had Claude build an order-processing system in four progressively complex stages, from a simple 3-subsystem checkout pipeline to a 15-subsystem architecture with cross-cutting features like bulk pricing, store credit, and tiered shipping.

The numbers tell the story:

Round 1 (V1, 6 subsystems): Claude introduced 2 latent bugs, including a shipping key mismatch where one module used :shipping-detail and another expected :shipping-groups. All tests passed.
Round 2 (V2, 11 subsystems): Those 2 bugs carried forward, plus 2 new ones appeared. All tests still passed - 235 out of 235 assertions green.
Round 3 (V3, 15 subsystems): The V1 shipping bug finally detonated. 17 test failures across 8 test cases. The system was silently calculating shipping refunds as $0.00 when the correct value was $6.39.

Why did tests miss it for two full rounds? Every defective-return test scenario happened to involve free shipping, so the bug never had a chance to produce a wrong number. It took the addition of tiered shipping in V3 to create execution paths that exposed it.

Why It Matters

This benchmark quantifies something many developers have felt but could not prove: AI coding assistants produce code that looks correct, passes tests, and then fails in ways that are extremely hard to predict.

The core problem is context. When Claude generated V1, separate AI agents built different subsystems without shared context. They used slightly different key names for the same data. No single test caught it because the mismatch only matters when specific data flows through specific paths.

At 3 subsystems, this is manageable. At 15 subsystems with 48 possible cell interactions, it becomes impossible for any agent - or human - to hold the full dependency graph in working memory.

The study also tested a schema-enforced approach (Mycelium) where each module declares its inputs, outputs, and dependencies as explicit contracts. That approach caught every bug at build time. Zero latent bugs across all four rounds. The trade-off: roughly 70-75% more lines of code due to manifest declarations.

Our Take

This is not a reason to stop using Claude Code, Cursor, or any other AI coding tool. It is a reason to stop trusting green test suites as proof that AI-generated code is correct.

The real lesson here maps directly to what experienced developers already know: tests verify behavior you thought to check. Structural bugs - where two modules disagree on data shapes - live in the gaps between tests. AI assistants are particularly prone to this because they lack persistent memory of decisions made in earlier sessions.

If you are building anything beyond a prototype with AI coding tools, you need one of two things: either a schema layer that enforces contracts between modules, or a disciplined review process that specifically checks cross-module data contracts after each generation round. Type systems, interface definitions, and explicit schemas are not overhead. They are the safety net that AI-assisted development needs most.

The 440 lines of manifest that prevented 5 bugs and 17 test failures is a trade-off most teams should take.

What Happened

Why It Matters

Our Take

Related Tools

More from today

AI Tools Help Developers Ship 27% More Code - But They're Burning Out Faster

Anthropic's Own Research Maps AI Job Displacement: White-Collar Workers Face the Biggest Risk

MIT's Attention Matching Shrinks LLM Memory Use 50x While Keeping Accuracy Intact

Cookie Preferences