Related ToolsClaudeChatgpt

Karpathy's Autoresearch Bot Found 20 Real ML Improvements in 2 Days

AI news: Karpathy's Autoresearch Bot Found 20 Real ML Improvements in 2 Days

Three days, roughly 700 autonomous code changes, and about 20 confirmed improvements to a language model's training efficiency. That's the result Andrej Karpathy shared on March 10 after letting his open-source "autoresearch" tool run unsupervised on a GPU cluster.

The setup: a small language model (nanochat) training on a single GPU, with an AI agent authorized to modify the training code, run 5-minute experiments, evaluate results, and loop. No human in the loop. About 12 experiments per hour, roughly 100 overnight.

The interesting part isn't that AI can tinker with code. It's that the improvements were real.

The Results Hold Up at Scale

Karpathy tested the 20 changes the agent found worthwhile. All of them were additive, meaning they stacked without canceling each other out. More importantly, improvements discovered on a small model (depth-12) transferred cleanly to a larger model (depth-24). That's the kind of result that separates useful automation from noise.

The concrete outcome: the "Time to GPT-2" benchmark (how quickly you can train a model to GPT-2-level performance) dropped from 2.02 hours to 1.80 hours. An 11% efficiency gain, found entirely by software.

What did the agent actually find? Specific architectural issues that human researchers had missed. The attention mechanism lacked a scaler multiplier, making attention too diffuse. Value embeddings needed regularization. Banded attention (a technique that limits which tokens can attend to each other) was set too conservatively. AdamW optimizer parameters were misconfigured. These aren't random guesses. They're the kind of findings a careful ML researcher would make after weeks of manual ablation studies (systematically testing each component in isolation).

How It Works

Autoresearch is stripped down to about 630 lines of Python. The human writes a Markdown file called program.md describing the research strategy. The AI agent reads it, modifies train.py, runs experiments within a strict 5-minute budget per test, evaluates validation loss (how well the model predicts unseen text), and keeps changes that improve the metric. Everything runs through git, so improvements ratchet forward and bad changes get reverted.

The constraint design matters more than the model powering it. Karpathy noted that GPT-5.4 actually failed at following the basic "LOOP FOREVER" instruction, while Claude Opus 4.6 ran 12+ hours straight and completed 118 experiments. The bottleneck isn't raw intelligence. It's the ability to follow structured instructions reliably over long periods.

Shopify CEO Tobi Lutke independently validated the approach, reporting a 19% improvement on a 0.8B parameter model that ended up outperforming a manually-configured model twice its size.

Smooth Recursion, Not Sudden Singularity

Karpathy has been building toward this argument since at least June 2025, when he described recursive self-improvement as something already happening in "smooth, incremental" form. The autoresearch results are the practical evidence.

The vision he outlined on March 8 goes further: a SETI@home-style distributed network where thousands of AI agents collaborate on research simultaneously. Not one PhD student, but an entire research community of agents working asynchronously.

For anyone training models, even at small scale, autoresearch is already usable. The code is open-source. The practical implication is clear: overnight agent runs can now surface real architectural improvements that humans miss. The human's job shifts from writing Python to writing good research briefs. That's a meaningful change in how ML research gets done, even if it looks nothing like science fiction.