What happens when you hand 15 different AI agents a real product design project and trust them to deliver? Chetandesh, a product designer, documented exactly this - running AI agents through the full process of designing a wearable device, from concept through engineering, and writing up where each one failed.
The short answer: they failed often, and in specific ways worth understanding before you build anything similar.
The Failure Patterns That Show Up Repeatedly
The failure modes in multi-agent design work aren't random. They cluster around a few predictable problems.
The first is context loss at handoff points. AI models hold information in a context window - the amount of text they can process at once, roughly comparable to short-term working memory. When one agent's output becomes another agent's input, context compresses. Nuances in the original design brief that one agent understood clearly can arrive garbled or missing at the next agent in the chain. Engineering constraints established in step three of a workflow don't automatically carry forward to step nine.
The second is confident wrong answers. AI models generate fluent, authoritative-sounding output regardless of accuracy. In a design workflow, this means an agent might produce detailed engineering specifications that are physically impossible, or describe material properties that don't match reality. A human engineer catches this quickly. A downstream AI agent often doesn't, and builds on the bad assumption instead.
The third is capability mismatch. Current AI agents are strong at some design tasks - generating concepts, writing documentation, researching comparable products - and genuinely weak at others, including spatial reasoning, understanding manufacturing constraints, and iterating based on tactile or physical feedback. When workflows assign tasks without accounting for these limits, the outputs at the weak spots degrade everything that follows.
What This Means for Practical Use
The real takeaway isn't that AI agents don't work. It's that they fail in patterned ways, which means you can design workflows around the failure points.
Put human review at handoff steps where context loss is most likely - particularly when specifications established early in a project need to carry through many stages. Verify engineering or technical outputs against known constraints before feeding them to the next agent in the chain. Don't assign spatial or physical reasoning tasks to agents where you wouldn't trust a text description of a 3D object.
This kind of practitioner write-up is more useful than most AI benchmark papers. Benchmarks test models on clean, standardized problems with measurable outputs. Real design projects involve ambiguous briefs, interdependent decisions, and constraints that emerge mid-process.
Knowing in advance where the agent chain breaks is the difference between building a useful system and building one that quietly produces wrong outputs until something fails downstream.