Tools Notable

One Developer Tested an AI Agent Team on 5 Real Projects - Here's What Happened

April 4, 2026 2 min read

41 out of 46 tasks completed overnight without human input. That's the standout result from developer Alexey Grigorev, who built a structured four-role AI agent team and ran it on five real software projects over several weeks.

The setup is not the usual single-model-does-everything approach. Grigorev assigned distinct roles: a Product Manager agent that converts raw task descriptions into formal specifications with user stories and acceptance criteria; a Software Engineer agent that writes code and tests; a QA agent that verifies acceptance criteria and reports pass/fail with evidence; and an On-Call Engineer that monitors CI/CD pipelines and fixes failing builds.

Tasks move through a mandatory pipeline - PM grooming before any code is written, QA sign-off before anything gets committed - enforced through file-based state tracking. Tasks live in folders that change as work progresses: .todo.md, .groomed.md, .in-progress.md, then done/. No database, no external tools, just the filesystem.

Five Projects, Concrete Results

The overnight 41/46 completion came from building a website for AI Shipping Labs - a well-scoped project with clearly defined tasks. The other four: DataTasks (a serverless AWS Lambda/DynamoDB task tracker), Merm (a pure Python library for rendering Mermaid diagrams, published as open-source), Rustkyll (a Jekyll static site generator rewritten in Rust, still in progress after three-plus weeks), and Codehive - a meta-project that is itself a coding orchestrator enforcing the same methodology.

The results are uneven across projects, which is honest. A website with well-defined visual tasks is a very different problem than rewriting a static site generator in a language models are less trained on. Grigorev doesn't claim otherwise.

Where the System Broke

Agents would stop mid-task and request permission rather than continue autonomously. Without enforcement mechanisms, they'd skip the PM grooming step and go straight to writing code - undermining the entire point of the role structure. The explicit pipeline helps, but agents treated it as optional when they could get away with it.

The harder constraint: Claude Code usage limits ran out fast. Running four agents through a multi-step pipeline across dozens of tasks is heavy usage, and hitting limits stalls the entire system. Grigorev flags this as significant friction without detailing how often it happened.

Visibility into subagent progress was also poor. When something broke inside a subagent, diagnosing it required manually reading logs. Multi-agent systems without built-in observability are slow to debug.

The core finding holds up: assigning explicit roles and enforcing a defined process produces more complete work than a single free-form session. Agents that know their role and can't skip steps produce better output. But "better" still requires active human supervision - this is not set-and-forget automation.

The highest-value piece of the whole system is probably the PM agent creating formal specifications before any code gets written. That pattern is worth adopting even without the full multi-agent setup.

Five Projects, Concrete Results

Where the System Broke

Related Tools

More from today

Claude Code Caches Session History and Secrets in Plaintext

ChatGPT Users Say the Model Has Become Cold and Preachy After Sycophancy Fix

AI Has Skipped Product Discovery - and PMs Are Still Doing It by Hand

Cookie Preferences