Related ToolsChatgptClaudeGeminiDatabricks

Developers Are Still Struggling to Log and Audit LLM Calls in Production

AI news: Developers Are Still Struggling to Log and Audit LLM Calls in Production

What Happened

A Hacker News "Ask HN" post on March 7, 2026 posed a question many teams are quietly dealing with: how do you properly log prompts, responses, and model calls in AI applications?

The poster described a common situation. Traditional systems have mature observability stacks - Datadog, Grafana, structured logging. But LLM-based apps introduce new challenges. Prompts can contain sensitive user data. Responses vary wildly in length. Token costs need tracking. And if you are in a regulated industry, you need audit trails that prove what your AI said and why.

The thread surfaced a mix of approaches. Some teams build custom logging pipelines that capture prompt-response pairs with metadata like model version, temperature settings, and latency. Others use emerging tools like LangSmith, Helicone, or Braintrust for LLM-specific observability. A few teams just dump everything to object storage and deal with it later.

Why It Matters

This is not a niche concern anymore. As companies move AI features from prototypes to production, the "we will figure out logging later" approach stops working fast.

Three problems keep coming up. First, cost visibility. Without logging token usage per feature, per user, per model, you cannot optimize spend. Teams regularly discover that one poorly written prompt is burning through their API budget. Second, debugging. When an AI feature gives a bad answer, you need to replay the exact prompt and context to understand why. Third, compliance. Healthcare, finance, and legal applications need to show auditors exactly what the AI produced and what inputs it received.

The lack of a standard approach means every team reinvents this. That is wasted engineering time that could go toward the actual product.

Our Take

The fact that this question is still being asked in March 2026 tells you something about how young the LLM tooling ecosystem remains. We have had structured logging standards for web apps for over a decade. For AI apps, we are still in the "everyone rolls their own" phase.

If you are building with LLM APIs today, here is the practical advice: log everything from day one. Capture the full prompt, the full response, the model identifier, token counts, latency, and any retrieval context. Store it somewhere queryable. You will thank yourself in three months when debugging a production issue.

The dedicated observability tools for LLM apps are maturing but none has become the clear default yet. For most teams, a simple structured logging setup that writes to your existing data warehouse is good enough to start. Do not let the search for the perfect tool delay capturing data you cannot get back retroactively.