Related ToolsClaude CodeClaudeCursorCody

Codeset Claims Its $5 Knowledge Base Makes Claude Haiku Match Opus Performance

Claude by Anthropic
Image: Anthropic

A $5-per-repo tool called Codeset claims it can make Claude's cheapest coding model perform as well as its most expensive one. The specific claim: Haiku 4.5 with Codeset scored 62% on their benchmark, beating raw Opus 4.5 at 60.7% - at roughly one-tenth the inference cost (the per-query cost of running the AI model).

That's a bold headline number. Here's what's actually happening under the hood.

How It Works

Codeset analyzes your GitHub repository - mining commit history, running static analysis, and mapping test coverage - then generates structured knowledge files you commit to your project. When a coding agent like Claude Code opens files in that project, it automatically picks up context it wouldn't otherwise have:

  • Past bugs with their root causes and how they were fixed
  • Edit checklists listing which tests to run and constants to verify when touching specific files
  • Co-change relationships showing which files historically break together
  • Function dependencies and call chains

None of this is hand-written. Codeset extracts it from your Git history during a one-time analysis that takes under an hour.

The Benchmark Numbers

On Codeset's own benchmark (codeset-gym-python, 150 tasks):

  • Haiku 4.5: 52% to 62% (+10 points)
  • Sonnet 4.5: 56% to 65.3% (+9.3 points)
  • Opus 4.5: 60.7% to 68% (+7.3 points)

On the more widely recognized SWE-Bench Pro (300 tasks), the gains were smaller: Sonnet went from 53% to 55.7%, but cost per task dropped 15.6% from $2.70 to $2.28.

The diminishing returns pattern is telling. Bigger models improve less because they already have stronger reasoning to compensate for missing context. Smaller models benefit more because they rely more heavily on the information handed to them. That tracks with how these models actually work.

The Catch

Codeset's headline benchmark is their own. The SWE-Bench Pro results, which are independently verifiable, show a more modest 2.7 percentage point improvement. That's still useful - especially paired with the cost reduction - but it's not "Haiku becomes Opus" territory on the industry-standard test.

The $5 one-time price per repo makes this low-risk to try. If you're running Claude Code on a large codebase and burning through API costs, the math could work out quickly. The real value might not be the model-tier leapfrogging but the cost savings: getting 90% of a bigger model's output at 10% of the price is a practical win for teams watching their AI spend.