Tools Notable

Teams Are Wasting Thousands on Duplicate LLM API Calls - Here's How to Stop It

March 7, 2026 3 min read

What Happened

A Hacker News discussion on March 7, 2026, surfaced a problem that most teams running LLM-heavy applications know too well but rarely talk about publicly: duplicate and redundant API calls burning through token budgets.

The core issue is straightforward. In production applications that make heavy use of OpenAI, Anthropic, or similar APIs, the same prompt often gets sent multiple times from different parts of the application. Sometimes it's the same user triggering identical requests. Sometimes it's different services in a microservice architecture independently calling the same model with the same input. The result is the same: wasted tokens and inflated API bills.

The discussion revealed several approaches teams are using in production. Semantic caching sits at the top of the list, where teams hash prompts and store responses, returning cached results when an identical or near-identical prompt comes in. Some teams use exact-match caching with simple key-value stores like Redis. Others implement embedding-based similarity matching that catches prompts that are slightly different in wording but functionally identical.

Request deduplication at the API gateway level is another common pattern. If two identical requests arrive within a short window, the second one waits for the first to complete and gets the same response. Logging and observability tools like Helicone, LangSmith, and custom dashboards help teams spot patterns of waste after the fact.

Why It Matters

LLM API costs add up fast. GPT-4-class models still run $5-15 per million input tokens, and Claude Opus sits at $15 per million input tokens. A production app making 100,000 calls per day where even 10% are duplicates is throwing away real money.

For teams building AI-powered products, this is a direct hit to margins. But it's also relevant for individual practitioners and small teams using tools like Cursor or Claude Code in daily workflows. Every redundant call is a cost that provides zero additional value.

The broader point is that the AI tooling ecosystem is maturing past "just get it working" into "make it efficient." Cost optimization was an afterthought a year ago. Now it's a prerequisite for sustainable AI-powered products.

Our Take

This problem is embarrassingly common and underreported. Most teams discover they have a duplication problem only after they get a surprising API bill. The fix isn't complicated - prompt caching with a TTL-based cache is table stakes at this point - but it requires intentional architecture.

The more interesting challenge is semantic deduplication: catching prompts that aren't character-for-character identical but will produce functionally the same output. That's where embedding-based similarity caching comes in, and it's still an area where tooling is immature.

If you're running any LLM-heavy application in production without a caching layer, you're almost certainly paying 15-30% more than you need to. Start with exact-match caching on deterministic prompts (temperature 0, same system prompt). That alone will cut waste significantly. Move to semantic caching once you have the data to tune similarity thresholds.

Anthropic and OpenAI both now offer built-in prompt caching at the API level, which helps with repeated system prompts. But application-level caching for full request deduplication is still your responsibility.

What Happened

Why It Matters

Our Take

Related Tools

More from today

Laid-Off Developer With 18 Years Experience Says Vibe Coders Are Getting the Jobs

Pragmatic Engineer Survey: 95% of Devs Use AI Weekly, Claude Code Tops the List

Claude Code Gets Scheduled Tasks: Anthropic's Coding Agent Now Runs Autonomously

Cookie Preferences