212 days. That was the average time it took for AI agent capabilities to double between 2019 and 2025. By early 2026, that number has compressed to roughly 3.5 months, according to Ajeya Cotra, a researcher at METR, an AI evaluation organization that runs structured benchmarks on frontier models.
Cotra published a detailed revision of predictions she made just one month earlier, admitting she underestimated how fast software engineering agents would improve. Her data tells a specific story worth paying attention to.
From 5-Hour Tasks to 12-Hour Tasks in 10 Weeks
Cotra's framework measures AI capability by "time horizon" - the length of task a model can reliably complete without human intervention. Claude Opus 4.5 could handle roughly 5-hour tasks with a 50% success rate. Claude Opus 4.6, released about 2.5 months later, pushed that to approximately 12-hour tasks. Of 19 benchmark tasks requiring 8 or more human hours, the newer model solved 14.
The concrete examples she cites are striking: AI writing a web browser from scratch, building a C compiler, porting SimCity in four days without access to source code. These are not toy demos. They are multi-file, multi-system projects that would take skilled developers days or weeks.
Traditional Benchmarks Are Breaking
Cotra flags a measurement problem that matters for anyone tracking AI progress. Standard coding benchmarks like SWE-bench are saturating, meaning top models are hitting near-perfect scores. When every model aces the test, the test stops being useful for distinguishing real capability differences.
She predicts 100-plus-hour time horizons by the end of 2026, but adds a caveat: "the whole concept of time horizon starts to break down" at that scale. A 100-hour task is not just a longer version of a 5-hour task. It requires planning, error recovery, and the kind of judgment that current models still struggle with on open-ended research problems.
The Parallel Work Multiplier
The most practical insight in Cotra's analysis is about decomposition. A month-long software project is rarely one continuous 720-hour task. It is dozens of smaller tasks that can run in parallel. AI agents do not need to match human endurance on single tasks if they can coordinate through documentation and ticketing systems, the same way remote engineering teams already work.
This is where Cotra's prediction gets consequential for everyday AI users: she now considers full AI research automation in 2026 "genuinely possible," though she notes it would require breakthroughs in research judgment, not just raw coding ability. The gap between "can write code that passes tests" and "can decide what code to write" remains real, but it is closing faster than most forecasters expected even a few months ago.
Cotra's track record on these predictions is closely watched in AI safety circles. When she revises upward, it tends to signal that the underlying capability data is genuinely surprising, not just PR hype from model providers.