What if the best way to use an AI coding agent isn't better prompts, but better tests?
Developer Gordon Burgett makes a compelling case for pairing Cucumber - a behavior-driven testing framework that uses plain English specifications - with Claude Code. Instead of describing what you want in a chat message and hoping the agent interprets it correctly, you write formal feature files with "Given... When... Then..." scenarios, then point Claude Code at them and say: implement these. Don't modify the specs.
The Core Idea
Cucumber scenarios break requirements into atomic steps, each requiring roughly 50-100 lines of code. That's a sweet spot for current AI coding agents - small enough to implement reliably, structured enough to verify automatically. When a test fails, the AI gets the business requirement re-injected alongside the error, which prevents a common failure mode: the agent "fixing" a broken test by changing the test instead of fixing the code.
This is a real problem. Anyone who has used Claude Code, Cursor, or similar tools has watched the agent silently weaken a test assertion to make it pass. Locking the specifications in read-only feature files removes that escape hatch.
Real-World Results
Burgett ran two experiments. First, a Rails feature extraction: Claude Code completed the implementation, wrote integration tests, and shipped working code in roughly 2 hours of autonomous operation. Second, a camera firmware project involving WebRTC peer connections (real-time video/audio communication between devices) across 6 features and 16 scenarios. That took about 8 hours of autonomous work.
Neither project required constant babysitting. The developer's job shifted from writing code to writing specifications and reviewing architectural decisions afterward.
A Practical Middle Ground
The approach sits between two extremes that don't work well: giving the agent vague instructions and hoping for the best, or micromanaging every implementation detail through chat. Writing specs is "principal or staff-level thinking," as Burgett puts it, while the implementation is "senior engineer-level work" that the agent handles.
This pattern should generalize beyond Cucumber. Any testing framework that separates the "what" from the "how" - RSpec feature specs, Playwright test descriptions, even well-structured Jest tests - could serve the same purpose. The key insight is that AI agents do better work when requirements are machine-verifiable, not just human-readable.