Try running the same LLM API call 50 times and you will not get 50 successes. That is the uncomfortable finding from developer Andrew Wheeler, who stress-tested APIs from OpenAI, Anthropic, Google, and AWS Bedrock while compiling code examples for a technical book.
The setup was simple: a Quarto-based book that executes every code example live against real APIs during compilation. Each build cycle meant hundreds of API calls, and the failure patterns that emerged tell you a lot about where each provider's rough edges are.
What Breaks and Where
OpenAI had the most dramatic failure. An example combining web search with image analysis worked fine, then stopped working entirely on January 24th - the API consistently failed to download images needed for analysis. It later resolved on its own. The unpredictable part: other stochastic examples ran reliably throughout.
Anthropic's Claude has a subtle JSON bug. Structured output responses (where you ask the model to return data in a specific format) occasionally append an extra bracket at the end, breaking JSON parsing. Wheeler calls it "quite rare" but hit it multiple times across full book compilations. His production systems handle it with error-catching code, but it is the kind of bug that bites you at 2 AM.
Google's Gemini struggled hardest with its Maps grounding feature. Instead of returning an error when it could not fetch map data, it returned a friendly message saying it could not find data right now - which looks like a successful API call to your code. Silent failures like this are worse than loud crashes because your application keeps running with bad data.
AWS Bedrock running DeepSeek randomly returned completely empty response bodies. Other models on the same Bedrock infrastructure (including Anthropic and Mistral) worked fine, pointing to a DeepSeek-specific issue.
The Practical Takeaway
None of these failures are show-stoppers on their own. The problem is that every provider has different failure modes, and most of them are intermittent. You cannot reproduce them on demand, which makes debugging miserable.
If you are building anything production-grade on LLM APIs, the lesson is blunt: wrap every call in retry logic, validate response formats before parsing, and never trust that a 200 status code means you got good data back. The APIs are good enough to build on, but not reliable enough to trust blindly.