Related ToolsClaude CodeClaudeCursor

Benchmarks Show Claude Skills Don't Work the Same Across Opus, Sonnet, and Haiku

Claude by Anthropic
Image: Anthropic

"Most skill authors test their skill once, on one model, on a task they already had in mind when writing it. That's not a benchmark, it's confirmation bias."

That line from a new Tessl blog post captures a problem anyone building Claude Code skills should care about. The company ran structured benchmarks testing the same skills across Haiku, Sonnet, and Opus, and the results show just how misleading single-model testing can be.

The Numbers Tell the Story

Tessl tested two types of skills. A domain-specific skill called "fastify-best-practices" saw massive gains on all three models, but the size of improvement varied wildly: Haiku jumped from 34% to 89% (a 55-point gain), Sonnet went from 49% to 97%, and Opus climbed from 58% to 100%. That looks like a win across the board.

But a general knowledge skill ("nodejs-core") told a different story. The gains were tiny: 4 points for Haiku, 5 for Sonnet, 3 for Opus. The models already knew most of what the skill was trying to teach.

The more troubling finding: some skills actually made performance worse on certain models. Without cross-model testing, you'd never catch it.

What This Means for Skill Builders

Skills in Claude Code are markdown files that give the model domain knowledge and instructions for specific tasks. They're powerful, but this research exposes a real gap in how most people build them.

Tessl recommends a three-tier analysis: measure baseline performance without the skill, then with it, then calculate the delta across all three Claude models. If every model fails even with the skill loaded, the content itself is the problem. If only one model struggles, you may need model-specific adjustments.

The practical takeaway is straightforward. If you're writing Claude Code skills for a team where people use different models (Haiku for speed, Opus for complex tasks), you need to benchmark across all of them. A skill optimized for Opus might actively hurt Haiku's output. Given that Haiku costs a fraction of Opus, plenty of teams are running the cheaper model for routine work, and those teams deserve skills that actually help.