'The Karpathy Loop': 700 experiments, 2 days, and a glimpse of where AI is heading

Earlier this month, Andrej Karpathy, a well-known AI researcher who was one of the founding employees of OpenAI and later headed up AI for Tesla, went viral on X. This alone isn’t so unusual. Karpathy—who now works as an independent AI researcher and is also the founder of Eureka Labs, which says it is creating a new kind of school for the AI era—has 1.9 million followers on X and his reputation is such that almost anything he says about AI is treated as either gospel or prophecy.

But this post was about an experiment he’d run where put an AI coding agent to work running a series of experiments to figure out how to improve the training of a small language model. He let the AI agent run continuously for two days, during which time it conducted 700 different experiments. Over the course of those experiments, it discovered 20 optimizations that improved the training time.

Karpathy found that applying the same 20 tweaks to a larger, but still fairly small, language model resulted in an 11% speed up in the time it took to train the model. Karpathy called the system he built for conducting this experiment “autoresearch.”

Tobias Lütke, the cofounder and CEO of Shopify, posted on X that he tried autoresearch to optimize an AI model on internal company data, giving the agent instructions to improve the model’s quality and speed. Lütke reported that after letting autoresearch run overnight, it ran 37 experiments and delivered a 19% performance gain.

What caught many people’s attention was that the autoresearch is close to the idea of self-improving AI systems that were originally broached in science fiction and that some AI researchers fervently desire and others deeply fear. The concern is that “recursive self-improvement,” where an AI continually optimizes its own code and training in a kind of loop, could lead to what AI safety researchers sometimes call a “hard takeoff” or an “intelligence explosion.” In these scenarios, an AI system rapidly improves its own performance, leading it to surpass human cognitive abilities and escape human control.

Karpathy’s experiment wasn’t quite this. The AI agent at the heart of autoresearch set up isn’t refining its own training set up, it’s adjusting the training code and initial neural network settings for a different, much smaller and less sophisticated, AI model. But Karpathy rightly noted that his experiment had big implications for how AI labs will do research going forward, and this might accelerate their progress.

“All LLM frontier labs will do this. It’s the final boss battle,” Karpathy wrote on X. He acknowledged that “it’s a lot more complex at scale of course,” since his autoresearcher only had to worry about adjusting a model and training process that was contained in just 630 lines of Python code, whereas the training codebase of frontier AI models is orders of magnitude bigger. “But doing it is ‘just engineering’ and it’s going to work,” he continued. “You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges.”

He said that while the current autoresearch system he built was designed for a single agent to continually improve a piece of code along a single path, in the future he imagines multiple AI agents will be able to explore different optimizations and different experiments in parallel. “The next step for autoresearch is that it has to be asynchronously massively collaborative for agents,” he wrote. “The goal is not to emulate a single PhD student, it’s to emulate a research community of them.”

Karpathy also said something else about autoresearch which got many people excited. “*any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm,” he wrote. “It’s worth thinking about whether your problem falls into this bucket too.”

Some commentators pointed out that the basic components of autoresearch could be used for many other agentic systems to optimize a process. Janakiram MSV, principal analyst at Janakiram & Associates, writing in tech publication The New Stack called this “the Karpathy Loop.” It has three components: an agent with access to a single file that it can modify; a single metric, objectively testable metric, that the agent can optimize for; and a fixed time limit for how long each experiment can run. He also highlighted that the instructions Karpathy gave the AI agent in autoresearch were also good models for anyone interacting with any AI agent. The plain text file Karpathy used included clear instructions for what the agent should do, constraints, telling the agent what it should not do or change, and a stopping criteria, indicating how long each loop should run and when the agent should stop looping and report its results.

But some critics said that Karpathy had done little more than rediscover part of a process known as AutoML that researchers at Google, Microsoft, and other AI labs have already been using for years. AutoML also uses an optimization loop and series of experiments to find the best data to use for AI, the best model architecture to use, and to tune that model architecture. But it doesn’t use an AI agent that can read AI research papers and develop hypotheses for which improvement to make. AutoML systems tend to depend on random variations or various evolutionary algorithms to decide which changes to try.

Karpathy replied to some of these comments, saying that some AutoML methods, such as neural architecture search, which is an automated way to optimize the design of an AI model, were not nearly as powerful as his autoresearch. “Neural architecture search as it existed then is such a weak version of this that it’s in its own category of totally useless by comparison,” he wrote. “This is an *actual* LLM writing arbitrary code, learning from previous experiments, with access to the internet. It’s not even close.”

In 2001, Fortune first convened “The Smartest People We Know,” bringing together CEOs and founders, builders and investors, thinkers and doers. Since then, Fortune Brainstorm Tech has been the place where bold ideas collide. From June 8–10, we will return to Aspen—where it all began—to mark 25 years of Brainstorm. Register now.

‘The Karpathy Loop’: Former OpenAI researcher’s autonomous agents ran 700 experiments in 2 days—and gave a glimpse of where AI is heading