"Claude goes down more often than any of us would like." That's not a critic talking. That's Alex Palcuie, an engineer on Anthropic's own AI reliability team, speaking at QCon London on March 19.
Palcuie, a former Google Cloud Platform SRE, gave a surprisingly candid talk about using Claude to help keep Claude running. The short version: it's genuinely useful for some tasks, clearly not ready for others, and the team isn't pretending otherwise.
What Claude Actually Does Well in Ops
The strongest use case is log analysis. During a New Year's Eve incident that triggered HTTP 500 errors, Claude Opus 4.5 identified an unhandled exception in an image processing class, then flagged 200 accounts that were all sending 22 images simultaneously. It also surfaced that 4,000 accounts had been created at the same time and were mostly dormant - a pattern consistent with coordinated abuse.
As Palcuie put it: "It reads the logs at the speed of I/O, it doesn't get bored." For anyone who's spent hours scrolling through log output at 2 AM during an outage, that alone is a real improvement. Claude can also generate SQL queries for diagnostics on the fly, which speeds up the early triage phase.
Where It Falls Apart
Root cause analysis is where things break down. Palcuie was blunt: Claude "will get wrong correlation versus causation." The model can spot that two things happened around the same time, but it struggles to determine which caused which, especially without knowledge of the system's history and architecture.
Postmortems - the detailed write-ups teams produce after incidents - land at about 80% accuracy. Palcuie called them "an 80 percent story that's pretty, it's readable and convincing," which is actually a problem. A plausible-sounding postmortem with wrong conclusions is worse than an obviously incomplete one, because people act on it.
The team breaks incident response into four phases: observe, orient, decide, act. Claude is strong on observe (log scanning, pattern detection) and decent at orient (summarizing what's happening). But decide and act still need a human. The model doesn't understand that the KV cache - a memory system that can be gigabytes in size - is, in Palcuie's words, "fragile," or why certain failure modes cascade the way they do.
The Honest Takeaway
"It would be hypocritical to say that Claude fixes everything," Palcuie said. That kind of honesty from inside a model provider is rare and worth paying attention to.
The practical lesson for teams considering AI-assisted operations: use it as a fast first-pass analyst, not a decision-maker. Let it chew through logs and surface anomalies. Don't let it write the postmortem without heavy human review. And definitely don't let it decide whether to add more servers or roll back a deployment.
Palcuie closed with "the models are the worst today that they'll ever be" - the standard optimist line. Maybe. But right now, the 80% accuracy problem on postmortems is a real gap, and knowing that gap exists is more useful than pretending it doesn't.