630 lines of Python. One GPU. An AI agent that runs 100 machine learning experiments overnight — and improves itself without any human in the loop.
On March 7, 2026, Andrej Karpathy — OpenAI co-founder, former Tesla AI director, and the man who coined "vibe coding" — pushed a new project to GitHub. No launch event. No press release. No 20-page technical report. Just a repo called autoresearch and a few words on X.
The post gathered 8.6 million views in two days and the repo hit 30,000 stars in a week — one of the fastest-growing repositories in GitHub history. Why? Because what Karpathy built is deceptively small and philosophically enormous: a system where an AI agent conducts machine learning research autonomously, without any human involvement, running experiments, evaluating outcomes, and continuously improving a model — indefinitely.
The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement.
— Andrej Karpathy, March 2026The repo's README begins with a piece of speculative fiction from the future — a world where "frontier AI research used to be done by meat computers in between eating, sleeping, and synchronizing via sound wave interconnect in the ritual of group meeting." It's tongue-in-cheek. But it also lands like a quiet prophecy.
AutoResearch is built on a stripped-down version of nanochat, Karpathy's minimal LLM training framework, condensed to a single GPU and a single file. The entire system is deliberately tiny — roughly 630 lines of Python — small enough to fit inside an LLM's context window, which is part of the design.
There are only three files that matter:
The human writes a high-level research goal in program.md — a plain Markdown file that acts as a lightweight "skill" or research brief. Then an external AI agent (Claude, Codex, or any LLM with coding capabilities) takes over.
Here's the cycle the agent runs, again and again, without stopping:
Every experiment runs for exactly 5 minutes, regardless of what the agent changes — model architecture, learning rate, batch size, optimizer type. This fixed time budget is a crucial design decision: it makes every experiment directly comparable, since a larger model and a smaller model both get the same compute and are judged by the same standard.
The performance metric is val_bpb — validation bits per byte. This is vocabulary-size-independent, which means the agent can change tokenization strategies or model sizes between runs and still get a fair, apples-to-apples comparison.
The fixed time budget solves the comparison problem elegantly. By making compute constant, you're always measuring what's optimal for your specific hardware in a fixed budget, not which model is globally better. A smaller model that trains more steps in 5 minutes can beat a larger model that takes longer to iterate — and the agent learns this pattern on its own.
Every successful change is committed to git. Every failed experiment is reverted. The result is a perfectly legible experiment journal — the git log becomes a scientific notebook that tracks every hypothesis, what worked, and what didn't. The agent also writes results to a flat results.tsv file after each run.
In an overnight run, Karpathy's agent completed 126 experiments autonomously, driving validation loss down from 0.9979 to 0.9697 with zero human input. But then Karpathy did something more interesting: he let it run for two full days.
The 11% gain might sound modest, but the context makes it remarkable. Karpathy himself noted the model was already heavily manually optimized over years of work. The agent caught oversights in attention scaling and regularization that he had missed through two decades of working in the field. "Seeing the agent do this entire workflow end-to-end and all by itself... is wild," he wrote.
These are real and substantial gains. All we're doing is optimizing performance per compute.
— Karpathy, responding to skeptics on XWithin days of release, the community began documenting their own runs. One user on X, running the loop on a Mac Mini M4 overnight, found that "the model got better by getting simpler" — 26 of 35 experiments failed or crashed, but the 7 that succeeded consistently pointed toward leaner architectures. An insight reached with zero human intervention.
Harrison Chase, founder of LangChain, published autoresearch-agents within days — adapting the loop for agent optimization rather than model training. Varun Mathur of Hyperspace AI distributed the loop across a peer-to-peer network, running 333 experiments across 35 autonomous agents in a single night. The emergent patterns were striking: H100 GPUs used "brute force" strategies (aggressive learning rates), while CPU-only agents on laptops were forced to be clever — focusing on initialization strategies and normalization because they couldn't rely on raw throughput.
One of the most counterintuitive things about AutoResearch is that its minimalism is a feature, not a constraint. Karpathy deliberately kept the entire system small enough to fit inside an LLM context window — because the agent needs to understand everything it's working with.
This philosophy has three concrete engineering consequences:
| Constraint | Engineering Reason | Effect |
|---|---|---|
| Fixed 5-min time budget | Makes experiments platform-comparable | Any hypothesis can be fairly tested |
| Single editable file | Keeps search space interpretable | Every change is reviewable as a clean diff |
| Single scalar metric | Eliminates judgment ambiguity | Agent can't be fooled by proxy metrics |
| No external infra | Just PyTorch + 3 files | Reproducible by anyone with a GPU |
The single scalar metric point is particularly important. Karpathy chose val_bpb because it requires no human judgment to interpret — the number either goes down or it doesn't. Goodhart's Law (when a measure becomes a target, it ceases to be a good measure) applies with particular force to an agent running 100 experiments per night with no off switch. The metric must be clean.
The most important thing about AutoResearch is not that it improves LLM training. It's that it demonstrates a generalizable pattern for autonomous optimization in any domain where you can define a single scalar metric and a mutable artifact.
The loop is just: Modify → Run → Measure → Keep or Revert → Repeat. Applied universally, this pattern already has working forks in:
pi-autoresearch — optimizes test speed, bundle size, and Lighthouse scores for web apps.
autoresearch-agents (Harrison Chase) — optimizes AI agent behavior and routing strategies.
Customer support routing — editable asset is a routing config; metric is classification accuracy against a labelled holdout set.
Database query optimization — editable asset is a query plan; metric is execution time.
A/B testing — editable asset is a UI config; metric is conversion rate.
The key insight, articulated by The New Stack's analysis, is that the primary investment required is in document authorship rather than infrastructure. You're not writing code — you're writing the Markdown that tells the agent how to write code. The human becomes the meta-researcher, designing the experiment design, not running the experiments.
Karpathy is transparent about these constraints. The repo notes that results become non-comparable between different compute platforms — what works on an H100 may not transfer to a Mac Mini M4, and vice versa. The community has begun treating this as a feature (hardware diversity as exploration diversity) rather than a bug.
One of the most remarkable things about AutoResearch is what grew around it in under a week. Because the repo is minimal, MIT-licensed, and designed around three simple files, forking and adapting it is trivial. The community obliged.
| Fork / Project | What It Does |
|---|---|
| autoresearch-mlx | Apple Silicon port using MLX instead of PyTorch/CUDA — runs natively on MacBooks and Mac Minis |
| n-autoresearch | Multi-GPU parallelism with structured experiment tracking, REST API for hypothesis registration |
| autoresearch-agents (LangChain) | Applies the loop to agent optimization — modify routing logic, test against holdout, keep or revert |
| Hyperspace distributed loop | 35 autonomous agents across a P2P network; 333 experiments in one night |
| autoexp (generic) | Fully generalized version: same loop, any domain, any quantifiable metric |
| pi-autoresearch | Applies the loop to web performance metrics: bundle size, test speed, Lighthouse scores |
AutoResearch is not the end state. It doesn't replace human researchers, it doesn't generate paradigm-shifting theoretical breakthroughs, and it won't autonomously solve alignment. But it does something subtler and arguably more important: it lowers the floor for what counts as "running experiments."
Before AutoResearch, running 100 experiments meant weeks of researcher time across multiple GPUs with careful manual bookkeeping. After AutoResearch, it means going to sleep and waking up. The compression of that timeline — from weeks to one night — changes what's tractable to explore. It changes which hypotheses are worth testing. It changes the economics of ML research entirely.
Karpathy himself frames it with characteristic wit in the repo's README: this is "the story of how it all began." The vision at the end of that sentence — swarms of AI agents running across compute cluster megastructures, 10,205 generations deep into a codebase no human could comprehend — reads as science fiction. But AutoResearch is the first commit.
Vibe coding has leveled up. This is vibe science.
— AI community reaction to AutoResearch's releaseThe real innovation is the paradigm: you are no longer the experimenter. You are the designer of the experiment engine. Your job is to write clear instructions in a Markdown file and define what "better" means. The rest happens while you sleep.
For anyone building in ML, systems engineering, or anywhere that optimization matters — the question is no longer whether to use this pattern. The question is what metric you're going to define, and how good your program.md is.