Related ToolsChatgptClaude

AI Carb Counting Fails Consistency Test Across 27,000 Queries

AI news: AI Carb Counting Fails Consistency Test Across 27,000 Queries

27,000 queries. One question repeated thousands of times: how many carbohydrates are in this food? The answer from AI chatbots changed constantly - never the same number twice.

Tim Street, who runs diabettech.com and manages Type 1 diabetes, ran this experiment because carb counting isn't academic for him. People with Type 1 diabetes calculate insulin doses directly from carbohydrate estimates. Get the number wrong and you risk a dangerous blood sugar spike or crash. Whether AI can reliably count carbs is really a question about whether AI can be trusted with something that carries direct health consequences.

Based on 27,000 data points, the answer is: not really.

What the Numbers Actually Show

AI models - including tools like ChatGPT and Claude - don't produce deterministic answers. They're probabilistic systems that generate responses by sampling from a range of likely outputs rather than retrieving a fixed value. A parameter called "temperature" controls how varied those outputs are. For most tasks, this variability is fine or even desirable. For carb counting, it creates a real problem.

Street found that estimates for identical foods varied by meaningful amounts across repeated queries. A meal pegged at 45g of carbs on one query might come back as 38g or 52g the next time. For someone calculating an insulin dose, that spread isn't a rounding error - it's the difference between a good outcome and a medical event.

This isn't a flaw specific to one model. It's a structural property of how large language models work. They learn statistical patterns from training data. When asked about the carb content of a specific meal, they generate a plausible-sounding estimate based on what similar foods looked like in that data - they're not querying a verified nutritional database with locked values.

The Confidence Problem

AI does many things well. Summarizing documents, drafting copy, answering questions where roughly-right is good enough. Numeric tasks requiring precise, repeatable outputs are a different category.

"AI can answer this question" and "AI can answer this question reliably" are not the same thing. The model delivers 38g with the same confident tone as 52g. There's no uncertainty flag. No caveat that this is a rough estimate. Just a number, stated as if it were a fact.

For anyone using AI in health, finance, or any context where a specific number matters: treat AI-generated figures as a starting point, not a source of truth. Cross-reference against verified databases. Run consistency checks if precision is critical.

Street's experiment also makes a case for structured integrations over chat interfaces. A query to a verified nutritional database via an API returns the same answer every time. A chatbot does not.