Kiln AI Argues Your AI System Needs a Requirements Layer Before Evals

AI news: Kiln AI Argues Your AI System Needs a Requirements Layer Before Evals

Most AI teams build evals the same way: throw a dataset at a model, have a single LLM judge score the output on a 1-5 scale, and call it a day. A new blog post from Kiln AI argues this approach has a fundamental problem - not with the testing, but with the step that should come before it.

The argument is borrowed from traditional software engineering, where teams write requirements, acceptance criteria, and specifications before writing tests. AI development mostly skips this. Teams jump from a vague task description straight to vibes-based evaluation, hoping a single holistic judge can assess everything at once.

Quality Is Not One Number

Kiln's core point: quality is not a single metric. It is dozens of independent dimensions - accuracy, tone, safety, brand voice, formatting, refusal behavior, and more. These dimensions are orthogonal. A response can be perfectly accurate but completely off-brand. It can be safe but unhelpful. Cramming all of these into one judge prompt means some criteria inevitably get deprioritized. The LLM judge develops its own implicit preferences, and important requirements quietly get ignored.

The fix, according to Kiln, is a specifications layer - individual, single-purpose requirements that each get their own definition, examples, and dedicated evaluator. Instead of asking one judge "is this response good?", you ask separate judges "does this response refuse competitor mentions?" and "does this response match our brand voice?" and "is the factual content accurate?"

Better Synthetic Data, Better Debugging

This approach has a practical side effect for synthetic data generation (using AI to create test datasets). When each specification is independent, you only need to generate data that triggers one specific behavior at a time. Need to test whether your model correctly refuses requests about competitors? Generate conversations mentioning competitors. Need to test toxicity detection? Generate toxic inputs. The data only needs to exercise the single dimension you are validating, which makes it far cheaper and more reliable to produce.

Debugging also gets simpler. When a holistic eval score drops from 4.2 to 3.8, you have no idea why. When your "brand voice" specification score drops while everything else holds steady, the problem is obvious.

Kiln AI is an open-source tool for building, evaluating, and optimizing AI systems. Their specifications feature is built into the platform, so the blog post is partly a product pitch. But the underlying argument is sound regardless of which tooling you use. If you are building AI products and your eval process is "one big judge scores everything," breaking that into independent requirements with dedicated evaluators is a concrete improvement worth making.