Related ToolsClaude CodeCursorCodyAider

AgentEval Brings Linting and CI to AI Coding Instruction Files

AI news: AgentEval Brings Linting and CI to AI Coding Instruction Files

Every serious codebase has tests. APIs have contracts. But the instruction files that tell AI coding agents how to behave? Those have been running on vibes.

AgentEval, a new open-source tool from developer Lukas Metzler, applies static analysis to the growing pile of AI agent configuration files that developers maintain alongside their code. It supports the major formats: CLAUDE.md for Claude Code, .cursorrules and .cursor/rules/*.mdc for Cursor, .github/copilot-instructions.md for GitHub Copilot, and .claude/skills/*/SKILL.md for Anthropic's skills spec.

What It Actually Checks

The linter flags the kind of problems that silently degrade your AI assistant's output:

  • Dead file references - paths in your instructions that point to files that no longer exist on disk
  • Filler phrases that burn context tokens without adding information ("make sure to," "it is important that")
  • Contradictions within the same file, like "always use X" appearing alongside "never use X"
  • Content overlap between multiple instruction files
  • Token budget overruns where your instructions crowd out the actual code the agent needs to read
  • Vague instructions that lack specific, actionable detail

Running agenteval lint --explain shows the reasoning behind each flagged issue, which is a nice touch for understanding why something matters rather than just that it failed.

More Than a Linter

The tool goes beyond static checks. A harvest command builds benchmark tasks from your git history by identifying AI-assisted commits, then run executes agents against those tasks and scores the results. The compare command measures whether changes to your instruction files actually improved agent performance or made it worse. And ci gates regressions before merge, failing your build if scores drop.

This is the pipeline that instruction files have needed: lint, benchmark, compare, gate. The same feedback loop developers already use for code quality, applied to the meta-layer that shapes AI output.

AgentEval ships as a self-contained binary built in TypeScript on Bun, requiring no Node.js runtime. It's MIT-licensed and currently at v0.7.3. The project is very early (67 commits, minimal community adoption so far), but the concept addresses a real gap. As AI coding instruction files grow longer and more complex, the question of whether they actually work has been mostly unanswered. This is a first serious attempt at making those files testable.