Related ToolsClaudeClaude Code

Open-Source Claude Harness Achieves 87% Cost Savings With Prompt Caching

Claude by Anthropic
Image: Anthropic

87% cost savings and sub-3 second response times. Those are the numbers a developer is reporting after months of running a personal Claude agent that manages AWS infrastructure and codebases - and the harness that produces those numbers, named Galadriel, is now open source.

The core problem Galadriel addresses is what the developer calls the "goldfish problem": most Claude implementations treat every API call as if it starts from scratch. Every call resends the entire system prompt, conversation history, and context. That's expensive and slow.

Claude's prompt caching feature changes this equation. Prompt caching lets you mark static sections of your prompt - the system instructions, background documents, fixed context - so Claude stores them server-side between calls. Instead of processing 10,000 tokens from scratch each time, Claude processes only what's new. Cached tokens cost 90% less than regular input tokens.

Most developers don't use this feature at all. Galadriel is built specifically around it.

What the Harness Actually Does

The "warm cache" approach means Galadriel keeps the cached sections fresh between requests rather than letting them expire. Claude's prompt cache has a 5-minute TTL (time-to-live - content is stored for 5 minutes before being cleared), so the harness sends periodic lightweight requests to keep the cache alive. The result is sub-3 second response times even for complex agent tasks.

The architecture also structures prompts for maximum cache efficiency: static content at the top, dynamic content at the bottom. This matches how Claude's caching works - content that appears consistently at the start of the prompt caches best.

The developer ran this in production on Discord for months before releasing it publicly.

What Most Claude Integrations Skip

Any Claude integration with a consistent system prompt - customer support bots, document analysis pipelines, coding assistants - is paying unnecessary input token costs. The fix isn't switching models. It's restructuring prompts to take advantage of caching that's already available.

Claude Code users running long sessions get some of this automatically, but anyone building their own API integrations needs to implement it manually. Galadriel provides a working reference implementation for that pattern.

The 87% savings figure assumes you're currently doing zero caching, which describes most implementations. How much you actually save depends on the ratio of static to dynamic content in your prompts.