A 3% drop in success rate. That's what happened when researchers gave AI coding agents auto-generated context files like AGENTS.md, according to a new study from ETH Zurich.
The research team - Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, and Martin Vechev - built AGENTbench, a dataset of 138 real-world Python tasks pulled from niche repositories. They tested how AI coding agents performed with three setups: no context file, an LLM-generated context file, and a human-written context file.
The results challenge a practice that's become standard in AI-assisted development.
The Numbers Don't Lie
LLM-generated context files (the kind you get when you ask an AI to write its own AGENTS.md) reduced task success rates by about 3% compared to having no context file at all. Human-written files did better, showing roughly a 4% improvement, but that gain came at a steep cost.
Both types of context files increased inference costs significantly. Auto-generated files pushed costs up over 20%, while human-written ones added up to 19%. The agents also took more steps to complete tasks regardless of which file type they received.
Following Instructions Too Well
The root cause is almost funny: the agents were too obedient. When given a context file full of best practices and architectural guidance, agents diligently followed every instruction. They ran more tests, read more files, executed more searches, and performed more code-quality checks than the task actually required.
In other words, telling an AI agent "here's how this project works" caused it to do more busywork, not better work.
What Should Go in a Context File
The researchers recommend keeping context files minimal. Their specific advice: skip architectural overviews and repository structure explanations entirely. These didn't reduce the time agents spent discovering files on their own, and they bloated the prompt with information the agent would have figured out anyway.
What does belong in a context file? Only details the agent genuinely cannot infer: custom build commands, specialized tooling configurations, non-standard testing frameworks, or project-specific conventions that break from common patterns.
This tracks with practical experience. A line like run uv sync before testing saves an agent real confusion. A paragraph explaining your MVC architecture does not.
There's a fair counterargument the study doesn't fully address: all 138 tasks came from public GitHub repositories. The training data for these models likely includes those repos or similar ones. In a proprietary codebase with domain-specific patterns, unusual naming conventions, or internal APIs, context files might prove far more valuable because the agent has no prior exposure to fall back on.
For now, the practical takeaway is clear. If you're maintaining a CLAUDE.md, AGENTS.md, or .cursorrules file, treat it like a .env file - only put in what the tool genuinely can't figure out on its own. Everything else is just running up your API bill.