AI mastered language. The physical world is next

The next great leap in artificial intelligence will not come from better language models. It will come from machines that understand how the physical world works and how to control it.

I’ve spent years thinking about this, first as an immunologist at Oxford, studying how immunological networks learn through feedback rather than instruction, then as an investor leading Khosla Ventures’ largest seed investment since OpenAI, into a world modeling lab called General Intuition.

The binding constraint on embodied AI isn’t compute or architecture. It’s a specific kind of data that barely exists.

Letting the Genie out

Earlier this year, Google shipped Project Genie and sent the entire gaming market downhill. The market read it as a threat to Unity, TakeTwo Interactive, Roblox, the entire content creation pipeline—AI coming for game developers. But reducing this to gaming disruption is like watching the first iPhone demo and concluding Apple was coming for Nokia. The real play is owning every spatial workload on the planet.

What tipped Google’s hand is not what Genie does well, but what it compromises on: environments that last only a few minutes, noticeable latency, physics that behaves strangely. For now, these are acceptable limitations when the real purpose isn’t entertainment. Google told us explicitly that Genie 3 is “a key stepping stone on the path to AGI,” infrastructure for training SIMA, their generalist agent that needs endless diverse environments to learn navigation, object manipulation, and real-world physics. Spawning objects mid-session and changing environmental conditions on the fly isn’t a gaming feature. It’s a curriculum generator for reinforcement learning.

What Google has built is an environment factory, a system that collapses the months of hand-coding traditionally required to create training simulations into seconds of text prompting.

Going beyond glass screens

To understand why that distinction matters, zoom out. For all the upheaval of the digital revolution, remarkably little has changed about how we physically interact with reality. The leap from early desktop computing to the smartphone to the transformer architecture was enormous in terms of information flow. But we’re still mostly poking at glass screens.

Consider the squirrel outside your window, leaping branch to branch, adjusting mid-flight for wind and flex. It possesses an extraordinarily sophisticated internal model of physics: gravity, momentum, friction, and can plan complex action sequences. Yet it has no language. It simply knows, in the way that knowing existed long before describing ever could.

AI has ignored this kind of knowing almost entirely. Today’s large language models can write sonnets and debug code. But ask one to fold a towel and you’ll discover the gulf between knowing about the world and knowing how to act within it. Language is but a compression of human experience. Text captures only a thin slice of what we know.

World models, neural networks trained to understand and predict physical reality, promise to change that equation. Yann LeCun grasps this, and proclaimed “LLMs basically are a dead end when it comes to superintelligence” before leaving Meta to launch his own world-model startup. Fei-Fei Li’s World Labs just released Marble, generating 3D environments. Both understand that spatial intelligence is AI’s next frontier.

But neither has solved the binding constraint: they don’t have the data to build agents.

Training an agent requires action-conditioned data. Not just what the world looked like, but what someone did and what happened next: observation, decision, action, consequence. The complete loop. The pivot to agents requires millions of hours of human decision-making captured at the source, frame-aligned with resulting state changes, self-selected for edge cases.

Hands as the final bottleneck

Games may be the unlikely answer. They provide complete records of human agency, every input logged and labeled, in environments that capture physics and decision-making under uncertainty. Millions of hours of human judgment, already digitized.

The deepest value isn’t physics. It’s human intuition. A physics engine models how a drone moves; it can’t model how a skilled operator reacts when surprised. In surgery, it’s the feel for how the tissue responds to the scalpel. Train on human decision-making and you capture expertise that can’t be described with words, only shown, felt.

Get this right and the consequences echo what software did to information.

When a machine can learn a manipulation task from hours of demonstration instead of months of programming, manufacturing economics flip. Small-batch production becomes viable. Custom goods cost what mass goods cost today. A master electrician’s lifetime of knowledge deploys in a thousand cities at once. The best surgeon’s judgment scales to rural hospitals that have no access today. The bottleneck was never scalpels. It was hands.

Agriculture, logistics, eldercare. Every domain where physical skill is scarce becomes a candidate for transformation. The common thread: expertise locked in individual bodies becomes transferable.

The digital revolution made information free. The world-model revolution will make capability free. I can’t think of a more consequential bet to make.

The opinions expressed in Fortune.com commentary pieces are solely the views of their authors and do not necessarily reflect the opinions and beliefs of Fortune.

Join us at the Fortune Workplace Innovation Summit May 19–20, 2026, in Atlanta. The next era of workplace innovation is here—and the old playbook is being rewritten. At this exclusive, high-energy event, the world’s most innovative leaders will convene to explore how AI, humanity, and strategy converge to redefine, again, the future of work. Register now.