What happens when you stop letting AI models pattern-match their way through benchmarks and force them to actually learn something new on the fly? They fail completely.
ARC-AGI-3, launched today by the ARC Prize Foundation, is the first major format change to the ARC benchmark since it was introduced in 2019. The previous versions (ARC-AGI-1 and ARC-AGI-2) gave models static puzzles: here are some input-output grid pairs, figure out the transformation rule, apply it to a new input. Frontier models eventually got decent at those. Gemini 3.1 Pro recently hit 77% on ARC-AGI-2.
ARC-AGI-3 throws all of that out. Instead of static puzzles, it drops AI agents into interactive grid-world mini-games with hidden rules. The agent has to explore the environment, experiment with actions, form hypotheses about how the game works, and then solve it. Think of it like handing a toddler a toy they have never seen before. Humans figure these games out in minutes. Current AI agents score zero.
How It Actually Works
Each task is an interactive environment with hidden dynamics. The agent can take actions that change the state of the grid, observe what happens, and use that information to build a mental model of the rules. This tests a cluster of capabilities that static benchmarks completely miss:
- Memory across sequential steps (remembering what you tried and what happened)
- Exploration strategy (deliberate hypothesis testing, not random guessing)
- Credit assignment (figuring out which of your actions actually mattered over a long sequence)
- Tool use in some environments that require invoking specific procedures
The benchmark deliberately excludes trivia, cultural knowledge, and linguistic tricks. It measures raw learning ability, nothing else.
The Scores Are Brutal
In the developer preview of just three games, current frontier agents scored zero points. The one bright spot: an OpenAI researcher reported that a ChatGPT-based agent managed to solve the first preview game. Progress, but a long way from the roughly 100 games in the full benchmark.
Hugging Face is sponsoring a four-week coding sprint with a $10,000 prize pool. Developers can submit custom agents through a public API. The full benchmark, split into public and private test sets, is rolling out now.
A Benchmark That Might Actually Measure Something Useful
Most AI benchmarks end up measuring memorization. Models train on data that overlaps with test sets, scores inflate, and everyone argues about contamination. ARC-AGI-3 sidesteps this entirely because there is nothing to memorize. Each environment has novel rules that the agent must discover through interaction.
This is closer to what people actually mean when they talk about "general intelligence" - the ability to encounter something genuinely new and figure it out without being told the answer first. The fact that every current system fails at this basic task is a useful reality check on where AI actually stands, regardless of what the marketing copy says.