Research Notable

Adding Verification Steps to AI Coding Agents Made Them Worse in 29 Tests

March 30, 2026 2 min read

A developer spent hundreds of hours and thousands of commits testing whether AI coding agents could check their own work. The answer: not only can they not, but asking them to try actually makes things worse.

Krzysztof Dudek ran 29 controlled experiments across four software projects, including a SaaS application with 70 business processes and roughly 7,500 lines of specification documents. The goal was simple: could structured verification steps - mandatory self-audits, re-reading requirements, pattern-triggered reviews - help AI agents catch their own mistakes?

The Numbers Tell the Story

The results were bleak. Of seven verification experiments, four performed worse than the baseline (scored at 1.82 out of 10). The worst approach - forcing the agent to re-read code when certain patterns appeared - scored just 1.4 out of 10. Every structured self-audit attempt backfired.

Meanwhile, the unverified agent had its own problems. On one SaaS project, the agent marked all 10 development phases as "COMPLETED" while 32% of API endpoints lacked input validation, zero error boundaries existed, zero loading states were implemented, and 68% of planned end-to-end tests were missing. In 32,000 lines of code, there was exactly one error logging call.

Binary Rules Work, Judgment Calls Don't

The experiments revealed a pattern that matters for anyone building with AI agents. Simple yes/no rules - "does this table have row-level security?" - hit near-100% compliance. But anything requiring judgment - "does this endpoint need input validation?" - dropped to 30-70% compliance.

Dudek's explanation is that agents perform "lossy compression" on requirements. The agent isn't lying when it marks a phase complete. It genuinely believes it finished, because it compressed the original specification into a simpler mental model and completed that instead. Adding verification steps just asks the same lossy process to review itself, which predictably fails.

What Actually Helped

The experiments used Claude Opus and Cursor. The only approaches that reliably caught missing work were external mechanical checks: Git hooks that reject commits missing certain patterns, code coverage gates, and automated validators that run outside the agent's reasoning loop.

This matches what a lot of teams are discovering the hard way. Telling an AI agent to "double-check your work" is like asking the same person who made an error to proofread their own writing. The blind spots are the same both times. The fix isn't better prompting or more verification steps - it's building guardrails that don't depend on the agent's own judgment.

For teams relying on AI agents for production code, the practical takeaway is clear: trust binary checks, automate validation externally, and stop expecting agents to reliably assess the quality of their own output. An 18-experiment prompt optimization effort produced only marginal gains. Mechanical enforcement produced real ones.

The Numbers Tell the Story

Binary Rules Work, Judgment Calls Don't

What Actually Helped

Related Tools

More from today

1 in 5 AI-Generated Security Patches That Pass Tests Still Break Production

BCG Study: 14% of AI-Using Workers Report 'Brain Fry' Cognitive Overload

Bot Traffic Now Outpaces Humans on the Internet

Cookie Preferences