Accuracy is going up. Reliability is barely moving. That's the core finding from Princeton University's new AI Agent Reliability Tracker, which evaluates 14 AI models across four dimensions that matter far more than benchmark scores: consistency, predictability, robustness, and safety.
The research, from Princeton's Language & Intelligence group, asks a question most AI companies would rather skip: does a higher accuracy number actually mean the agent works better in practice?
The answer, backed by twelve distinct metrics across two benchmarks, is a qualified no.
Consistency Is Basically Flat
The most striking number in the dataset: consistency improvement over time shows a correlation of r=0.02 (where 1.0 would mean perfect correlation). That's statistical noise. Despite models getting measurably smarter, they aren't getting measurably more predictable in how they complete tasks.
This matters because real-world AI agent use depends on predictability. A coding assistant that gets the right answer 70% of the time but fails in unpredictable ways is harder to trust than one that gets 60% right but fails consistently in known patterns.
The accuracy-to-time correlation sits at just 0.06 points per year. Models are improving, but slowly, and the reliability gap isn't closing.
The Scoreboard
Overall reliability scores across all benchmarks:
- Gemini 3.0 Pro: 85.2%
- Claude Opus 4.5: 84.6%
- Claude Sonnet 4.5: 82.6%
- GPT-5.2 (extra-high compute): 81.3%
- Gemini 2.5 Pro: 78.6%
Claude Opus 4.5 leads on predictability (84.2%) while Gemini 3.0 Pro takes the top overall spot. Safety scores are uniformly high across all models, ranging from 93% to 99.8%, which suggests the industry has largely solved the "agent goes rogue" problem at least on standard benchmarks.
The interesting gap shows up in the GAIA benchmark (general agent reasoning). Claude Opus 4.5 hits 71.5% accuracy with 81.6% reliability. GPT-4 Turbo scores just 20% accuracy but still manages 75.7% reliability. Lower accuracy, but more predictable failures.
What This Means for People Using AI Agents
If you're building workflows around AI agents, this research suggests you should care less about which model topped the latest benchmark and more about how consistently it handles your specific task type.
The strong correlation between accuracy and reliability (r=0.95) means better models are generally more reliable. But the weak time correlation (r=0.72) means you shouldn't expect next quarter's model update to magically make your agent workflows more dependable. Reliability improvements are coming slower than accuracy improvements.
For anyone integrating AI agents into business processes, this data argues for building in verification layers and fallback logic rather than trusting that raw accuracy scores will keep climbing toward the point where checking becomes unnecessary. The tracker itself is worth bookmarking as an ongoing, model-agnostic resource for comparing agents on metrics that actually predict real-world performance.