Tools Notable

AI Coding Agents Pass Every Test and Still Ship Broken Software

March 24, 2026 3 min read

Every test passed. The code was clean, well-documented, and covered by a comprehensive test suite. There was just one problem: the software didn't actually do what it was supposed to do.

Developer Eric Mann ran into this exact situation while using AI agents to build tss-ceremony, a terminal animation demonstrating threshold ECDSA signing ceremonies (a cryptographic method where multiple parties must cooperate to produce a single signature). The agents delivered polished key generation scenes, solid cryptographic primitives, and even bonus comparison features. But the core signing ceremony - scenes 5 through 11, the entire reason the project existed - was left as placeholder stubs.

Worse, the signing function used a standard single-key signature method, which completely defeats the purpose of threshold cryptography. The tests all passed because the agents wrote them against the broken behavior.

Locally Perfect, Globally Useless

Mann calls this the "agentic harness problem." AI coding agents optimize for what he terms "local completeness" - they do an excellent job on individual tasks, writing clean functions with good test coverage. But they miss "global coherence," the part where all those components need to connect into something that actually works as a whole.

Anyone who has used Cursor, Claude Code, or GitHub Copilot for more than toy projects has probably felt this. The agent crushes the function you asked for. It writes tests that pass. But it doesn't step back and ask whether this function actually fits into the larger system correctly. It can't tell the difference between a real implementation and a convincing-looking stub, because it's optimizing for the immediate task, not the project's purpose.

This is different from a typical human developer mistake. A junior developer might write buggy code, but they usually attempt the actual feature. Agents deliver professional-quality code for the wrong thing.

Five Guardrails That Actually Help

Mann proposes a set of process-level fixes that go beyond "just review the code more carefully":

Milestone completion gates - Before moving on, verify that every deliverable is a real implementation, not a stub or placeholder
Explicit data contracts - Define which components produce data and which consume it, so disconnected pieces get caught early
Integration assertions - Write acceptance criteria that test whether components actually talk to each other, not just whether they work in isolation
Placeholder tracking - Flag every stub as a blocker, not a TODO to get back to later
Wiring before polish - Connect the pieces first, then refine them. Agents tend to do the opposite, polishing individual components before confirming they fit together

The common thread: standard code review catches syntax issues, logic bugs, and style problems. It's not designed to catch "this entire subsystem is a convincing fake." That requires structural guardrails built into the development process itself.

For teams adopting AI coding agents in real workflows, this is a practical checklist worth stealing. The agents are genuinely good at writing code. The gap is in making sure that code adds up to working software.

Locally Perfect, Globally Useless

Five Guardrails That Actually Help

Related Tools

More from today

Origin: Open-Source Tool Adds 'Git Blame' for AI-Generated Code

73-Year-Old Cardiac Patient Builds Health App With Claude, Zero Coding Experience

AI Coding Tools Have a UX Problem No One Is Fixing

Cookie Preferences