AutoResearch by Andrej Karpathy

Introduction

What Just Happened?

On March 7, 2026, Andrej Karpathy — OpenAI co-founder, former Tesla AI director, and the man who coined "vibe coding" — pushed a new project to GitHub. No launch event. No press release. No 20-page technical report. Just a repo called autoresearch and a few words on X.

The post gathered 8.6 million views in two days and the repo hit 30,000 stars in a week — one of the fastest-growing repositories in GitHub history. Why? Because what Karpathy built is deceptively small and philosophically enormous: a system where an AI agent conducts machine learning research autonomously, without any human involvement, running experiments, evaluating outcomes, and continuously improving a model — indefinitely.

The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement.

— Andrej Karpathy, March 2026

The repo's README begins with a piece of speculative fiction from the future — a world where "frontier AI research used to be done by meat computers in between eating, sleeping, and synchronizing via sound wave interconnect in the ritual of group meeting." It's tongue-in-cheek. But it also lands like a quiet prophecy.

///

Architecture

Inside the Loop: How AutoResearch Actually Works

AutoResearch is built on a stripped-down version of nanochat, Karpathy's minimal LLM training framework, condensed to a single GPU and a single file. The entire system is deliberately tiny — roughly 630 lines of Python — small enough to fit inside an LLM's context window, which is part of the design.

There are only three files that matter:

# The three files that run the future of ML research

prepare.py   — constants, data prep, runtime utilities (read-only)
train.py     — model, optimizer, training loop (agent edits this)
program.md  — agent instructions in plain Markdown (you write this)
  

The human writes a high-level research goal in program.md — a plain Markdown file that acts as a lightweight "skill" or research brief. Then an external AI agent (Claude, Codex, or any LLM with coding capabilities) takes over.

The Core Research Loop

Here's the cycle the agent runs, again and again, without stopping:

Read program.md

→

Form hypothesis

→

Edit train.py

→

Run 5-min GPU job

→

Measure val_bpb

→

Keep or revert

Every experiment runs for exactly 5 minutes, regardless of what the agent changes — model architecture, learning rate, batch size, optimizer type. This fixed time budget is a crucial design decision: it makes every experiment directly comparable, since a larger model and a smaller model both get the same compute and are judged by the same standard.

The performance metric is val_bpb — validation bits per byte. This is vocabulary-size-independent, which means the agent can change tokenization strategies or model sizes between runs and still get a fair, apples-to-apples comparison.

Engineering Insight

The fixed time budget solves the comparison problem elegantly. By making compute constant, you're always measuring what's optimal for your specific hardware in a fixed budget, not which model is globally better. A smaller model that trains more steps in 5 minutes can beat a larger model that takes longer to iterate — and the agent learns this pattern on its own.

Git as the Experiment Journal

Every successful change is committed to git. Every failed experiment is reverted. The result is a perfectly legible experiment journal — the git log becomes a scientific notebook that tracks every hypothesis, what worked, and what didn't. The agent also writes results to a flat results.tsv file after each run.

///

Results

What Happened When Karpathy Let It Run?

In an overnight run, Karpathy's agent completed 126 experiments autonomously, driving validation loss down from 0.9979 to 0.9697 with zero human input. But then Karpathy did something more interesting: he let it run for two full days.

~700 Autonomous code changes over 48 hours

~20 Additive improvements that transferred to larger models

11% Reduction in "Time to GPT-2" metric (2.02 hrs → 1.80 hrs)

The 11% gain might sound modest, but the context makes it remarkable. Karpathy himself noted the model was already heavily manually optimized over years of work. The agent caught oversights in attention scaling and regularization that he had missed through two decades of working in the field. "Seeing the agent do this entire workflow end-to-end and all by itself... is wild," he wrote.

These are real and substantial gains. All we're doing is optimizing performance per compute.

— Karpathy, responding to skeptics on X

What the Community Found

Within days of release, the community began documenting their own runs. One user on X, running the loop on a Mac Mini M4 overnight, found that "the model got better by getting simpler" — 26 of 35 experiments failed or crashed, but the 7 that succeeded consistently pointed toward leaner architectures. An insight reached with zero human intervention.

Harrison Chase, founder of LangChain, published autoresearch-agents within days — adapting the loop for agent optimization rather than model training. Varun Mathur of Hyperspace AI distributed the loop across a peer-to-peer network, running 333 experiments across 35 autonomous agents in a single night. The emergent patterns were striking: H100 GPUs used "brute force" strategies (aggressive learning rates), while CPU-only agents on laptops were forced to be clever — focusing on initialization strategies and normalization because they couldn't rely on raw throughput.

///

Design Philosophy

Why Simplicity Is the Innovation

One of the most counterintuitive things about AutoResearch is that its minimalism is a feature, not a constraint. Karpathy deliberately kept the entire system small enough to fit inside an LLM context window — because the agent needs to understand everything it's working with.

This philosophy has three concrete engineering consequences:

Constraint	Engineering Reason	Effect
Fixed 5-min time budget	Makes experiments platform-comparable	Any hypothesis can be fairly tested
Single editable file	Keeps search space interpretable	Every change is reviewable as a clean diff
Single scalar metric	Eliminates judgment ambiguity	Agent can't be fooled by proxy metrics
No external infra	Just PyTorch + 3 files	Reproducible by anyone with a GPU

The single scalar metric point is particularly important. Karpathy chose val_bpb because it requires no human judgment to interpret — the number either goes down or it doesn't. Goodhart's Law (when a measure becomes a target, it ceases to be a good measure) applies with particular force to an agent running 100 experiments per night with no off switch. The metric must be clean.

///

Broader Implications

Beyond ML: The "Loop" as a Universal Pattern

The most important thing about AutoResearch is not that it improves LLM training. It's that it demonstrates a generalizable pattern for autonomous optimization in any domain where you can define a single scalar metric and a mutable artifact.

The loop is just: Modify → Run → Measure → Keep or Revert → Repeat. Applied universally, this pattern already has working forks in:

Domains Where the Loop Already Works

pi-autoresearch — optimizes test speed, bundle size, and Lighthouse scores for web apps.

autoresearch-agents (Harrison Chase) — optimizes AI agent behavior and routing strategies.

Customer support routing — editable asset is a routing config; metric is classification accuracy against a labelled holdout set.

Database query optimization — editable asset is a query plan; metric is execution time.

A/B testing — editable asset is a UI config; metric is conversion rate.

The key insight, articulated by The New Stack's analysis, is that the primary investment required is in document authorship rather than infrastructure. You're not writing code — you're writing the Markdown that tells the agent how to write code. The human becomes the meta-researcher, designing the experiment design, not running the experiments.

///

Limitations

What AutoResearch Cannot Do

✓ Where It Excels

Hyperparameter and architecture search in a well-defined space
Finding non-obvious combinations that humans would overlook
Hardware-specific optimization (finds what's best for your GPU)
Validating intuitions overnight instead of over weeks
Any domain with a clean scalar metric and a mutable config

✗ Current Limitations

Cannot generate genuinely novel research ideas from scratch
Results are hardware-specific and may not generalize across platforms
Noisy results — small gains can be statistical noise
Can get trapped in local optima
Requires a good metric; bad metrics get ruthlessly exploited

Karpathy is transparent about these constraints. The repo notes that results become non-comparable between different compute platforms — what works on an H100 may not transfer to a Mac Mini M4, and vice versa. The community has begun treating this as a feature (hardware diversity as exploration diversity) rather than a bug.

///

Ecosystem

Seven Days, An Entire Ecosystem

One of the most remarkable things about AutoResearch is what grew around it in under a week. Because the repo is minimal, MIT-licensed, and designed around three simple files, forking and adapting it is trivial. The community obliged.

Fork / Project	What It Does
autoresearch-mlx	Apple Silicon port using MLX instead of PyTorch/CUDA — runs natively on MacBooks and Mac Minis
n-autoresearch	Multi-GPU parallelism with structured experiment tracking, REST API for hypothesis registration
autoresearch-agents (LangChain)	Applies the loop to agent optimization — modify routing logic, test against holdout, keep or revert
Hyperspace distributed loop	35 autonomous agents across a P2P network; 333 experiments in one night
autoexp (generic)	Fully generalized version: same loop, any domain, any quantifiable metric
pi-autoresearch	Applies the loop to web performance metrics: bundle size, test speed, Lighthouse scores

///

Conclusion

The Beginning of Something Larger

AutoResearch is not the end state. It doesn't replace human researchers, it doesn't generate paradigm-shifting theoretical breakthroughs, and it won't autonomously solve alignment. But it does something subtler and arguably more important: it lowers the floor for what counts as "running experiments."

Before AutoResearch, running 100 experiments meant weeks of researcher time across multiple GPUs with careful manual bookkeeping. After AutoResearch, it means going to sleep and waking up. The compression of that timeline — from weeks to one night — changes what's tractable to explore. It changes which hypotheses are worth testing. It changes the economics of ML research entirely.

Karpathy himself frames it with characteristic wit in the repo's README: this is "the story of how it all began." The vision at the end of that sentence — swarms of AI agents running across compute cluster megastructures, 10,205 generations deep into a codebase no human could comprehend — reads as science fiction. But AutoResearch is the first commit.

Vibe coding has leveled up. This is vibe science.

— AI community reaction to AutoResearch's release

The real innovation is the paradigm: you are no longer the experimenter. You are the designer of the experiment engine. Your job is to write clear instructions in a Markdown file and define what "better" means. The rest happens while you sleep.

For anyone building in ML, systems engineering, or anywhere that optimization matters — the question is no longer whether to use this pattern. The question is what metric you're going to define, and how good your program.md is.