Research Notable

Karpathy's AutoResearch Ran 700 ML Experiments in 2 Days Without Human Input

March 23, 2026 3 min read

700 experiments. Two days. Zero human involvement. That's the output of AutoResearch, a new open-source framework from former Tesla AI director and OpenAI co-founder Andrej Karpathy that lets AI coding agents autonomously run machine learning experiments in a continuous loop.

The concept is deceptively simple: give an AI agent a training script, lock down the evaluation function so it can't be gamed, and let the agent form hypotheses, modify code, train for exactly five minutes, check results, keep what works, revert what doesn't, and repeat. Forever. The entire framework is just 630 lines of Python.

What the Agent Actually Found

Karpathy pointed AutoResearch at his "nanochat" project, a small language model training setup he considered already well-tuned. Over roughly two days on a single NVIDIA GPU, the agent churned through about 700 modifications and surfaced around 20 genuine improvements to validation loss.

The specific fixes are humbling for any ML practitioner. The agent discovered that a QKNorm layer (a normalization technique applied to attention queries and keys) was missing a scalar multiplier, making attention patterns too spread out. It found that value embeddings had no regularization applied. It caught that the banded attention window was set too conservatively - something Karpathy admitted he had "forgotten to tune." It corrected AdamW optimizer parameters and weight decay schedules.

All 20 improvements transferred to larger models. Stacked together, they dropped the "Time to GPT-2" benchmark - how long it takes to train a model to GPT-2-level performance - from 2.02 hours to 1.80 hours. That's an 11% speedup on code that one of the most respected ML engineers in the world thought was already optimized.

The Infrastructure Matters More Than the Model

One of Karpathy's more interesting observations: the quality of the loop infrastructure mattered more than the raw intelligence of the AI model running inside it. Claude Opus 4.6 sustained 12+ hours and 118 experiments continuously. GPT-5 couldn't reliably follow the instruction to keep looping.

The framework is model-agnostic by design - any AI coding agent can plug in. But the practical finding that robust looping, clean interruption handling, and transparent session management outweigh model capability is a useful signal for anyone building agent systems.

SkyPilot, a cloud orchestration company, scaled AutoResearch across 16 GPUs (a mix of H100s and H200s) and hit about 910 experiments in 8 hours - a 9x throughput increase. Their total cost: roughly $9 for Claude Code API calls and $300 for GPU compute. That's a $309 bill for what would take a human researcher weeks of manual hyperparameter tuning.

Shopify CEO Tobias Lutke ran it overnight on a query-expansion model and got a 19% performance gain, with a 0.8 billion parameter model outscoring a previous 1.6 billion parameter version. Smaller model, better results, found by an agent while everyone slept.

The "Loopy Era"

Karpathy is framing this moment as the start of what he calls the "Loopy Era" of AI development - where agents run continuous self-improvement loops on code and research without human direction. His ambition for AutoResearch extends beyond solo agents: he envisions distributed teams of agents collaborating asynchronously on research problems, essentially emulating an entire research community.

"All LLM frontier labs will do this," Karpathy wrote. "Any metric you care about that is reasonably efficient to evaluate can be autoresearched by an agent swarm."

That's a bold claim, but the results back it up for a specific class of problems: anything where you can define a clear evaluation metric and run experiments cheaply. Hyperparameter tuning and architecture search are obvious fits. Drug discovery assays, compiler optimizations, materials science simulations - anywhere the feedback loop is fast and measurable, this pattern applies.

The repo has already cleared 42,000 GitHub stars. For AI practitioners, the real takeaway isn't that agents can tune hyperparameters - AutoML has existed for years. It's that general-purpose coding agents, given nothing but a training script and instructions to "continue working indefinitely," can find meaningful optimizations that experienced humans miss. The gap between "useful coding assistant" and "autonomous researcher" just got noticeably smaller.

What the Agent Actually Found

The Infrastructure Matters More Than the Model

The "Loopy Era"

Related Tools

More from today

Anthropic's Physicist Used Claude to Write a Real Research Paper in Two Weeks

Study of 134,000 Legal AI Queries Shows Lawyers Still Outperform

The 12 Writing Tics That Instantly Mark Your Text as AI-Generated

Cookie Preferences