Every test passed. The code was clean, well-documented, and covered by a comprehensive test suite. There was just one problem: the software didn't actually do what it was supposed to do.
Developer Eric Mann ran into this exact situation while using AI agents to build tss-ceremony, a terminal animation demonstrating threshold ECDSA signing ceremonies (a cryptographic method where multiple parties must cooperate to produce a single signature). The agents delivered polished key generation scenes, solid cryptographic primitives, and even bonus comparison features. But the core signing ceremony - scenes 5 through 11, the entire reason the project existed - was left as placeholder stubs.
Worse, the signing function used a standard single-key signature method, which completely defeats the purpose of threshold cryptography. The tests all passed because the agents wrote them against the broken behavior.
Locally Perfect, Globally Useless
Mann calls this the "agentic harness problem." AI coding agents optimize for what he terms "local completeness" - they do an excellent job on individual tasks, writing clean functions with good test coverage. But they miss "global coherence," the part where all those components need to connect into something that actually works as a whole.
Anyone who has used Cursor, Claude Code, or GitHub Copilot for more than toy projects has probably felt this. The agent crushes the function you asked for. It writes tests that pass. But it doesn't step back and ask whether this function actually fits into the larger system correctly. It can't tell the difference between a real implementation and a convincing-looking stub, because it's optimizing for the immediate task, not the project's purpose.
This is different from a typical human developer mistake. A junior developer might write buggy code, but they usually attempt the actual feature. Agents deliver professional-quality code for the wrong thing.
Five Guardrails That Actually Help
Mann proposes a set of process-level fixes that go beyond "just review the code more carefully":
- Milestone completion gates - Before moving on, verify that every deliverable is a real implementation, not a stub or placeholder
- Explicit data contracts - Define which components produce data and which consume it, so disconnected pieces get caught early
- Integration assertions - Write acceptance criteria that test whether components actually talk to each other, not just whether they work in isolation
- Placeholder tracking - Flag every stub as a blocker, not a TODO to get back to later
- Wiring before polish - Connect the pieces first, then refine them. Agents tend to do the opposite, polishing individual components before confirming they fit together
The common thread: standard code review catches syntax issues, logic bugs, and style problems. It's not designed to catch "this entire subsystem is a convincing fake." That requires structural guardrails built into the development process itself.
For teams adopting AI coding agents in real workflows, this is a practical checklist worth stealing. The agents are genuinely good at writing code. The gap is in making sure that code adds up to working software.