50,800 GitHub stars in under three weeks. Andrej Karpathy's latest open-source project, AutoResearch, has clearly struck a nerve with the ML community.
The premise is simple but compelling: what if an AI agent could run your machine learning experiments overnight, try dozens of variations, and hand you a log of results by morning? That's exactly what AutoResearch does. You point it at a training setup, give it a single file to modify, and let it loop through 5-minute experiment cycles autonomously. At roughly 12 experiments per hour, a single overnight run can explore nearly 100 different configurations without any human intervention.
One GPU, One File, One Metric
Karpathy designed AutoResearch around radical simplicity. The entire framework revolves around three files:
- prepare.py handles data prep and utilities. The agent never touches this.
- train.py is the one file the AI agent is allowed to modify. It contains the model architecture, optimizer settings, and training loop.
- program.md is a markdown file where you write instructions telling the agent what to explore.
Each experiment cycle runs for exactly 5 minutes on a single NVIDIA GPU, then the agent checks validation bits per byte (a metric that measures how well the model predicts text, where lower is better). If the change improved performance, it keeps the modification. If not, it rolls back and tries something different.
The training setup uses a simplified version of nanochat, a small GPT model, with the Muon and AdamW optimizers. It's deliberately minimal. No distributed training across multiple machines, no complex configuration files, no dashboard to set up.
What the Agent Actually Changes
This is the interesting part. The AI agent doesn't just tweak hyperparameters like learning rate or batch size, though it does that too. It can modify the model architecture itself, restructuring layers, changing attention mechanisms, and adjusting how the model processes data. The agent essentially plays the role of a junior ML researcher running ablation studies (systematic experiments where you change one thing at a time to measure its effect).
The fixed 5-minute training window is a smart constraint. It forces the agent to work with fast-feedback experiments rather than kicking off 8-hour training runs that might not pan out. You get breadth of exploration instead of depth on any single idea.
Community Has Already Forked It for Every Platform
Karpathy built this for NVIDIA GPUs (tested on H100s), but the community has already produced forks for macOS using Apple's MLX framework, standard macOS support, Windows, and AMD GPUs. The project requires Python 3.10+ and the uv package manager, and it ships under the MIT license, so there are no restrictions on commercial use.
The 50.8k stars and 7.1k forks since the March 6 release tell you something about demand. Researchers have been manually doing this loop for years: change something, train, check results, repeat. Having an AI agent handle the tedious iteration while you focus on higher-level research direction is a genuine productivity gain.
That said, AutoResearch is very much a research prototype, not a polished product. There's no experiment tracking dashboard, no visualization of results over time, and you need to be comfortable reading Python code and working from the command line. Karpathy has been upfront that this is about demonstrating the concept of AI-in-the-loop research, not replacing tools like Weights & Biases or MLflow.
For ML practitioners with access to a GPU, it's worth cloning the repo and letting it run overnight on a problem you care about. The worst case is you lose a few dollars in compute. The best case is you wake up to a model configuration you wouldn't have tried yourself.