Playing peekaboo isn’t just a game for babies. It’s also one of the first ways they learn that something is still there, even if they can’t see it. When you hide your face and suddenly reappear—peekaboo!—the delightful surprise leads to giggles and learning.
Developing that skill isn’t about memorization. Instead, through observation, babies begin building a simple internal model of how the world works. By about a year old, they can tell that a ball that rolls behind a couch hasn’t vanished—it still exists, even out of sight.
Today’s AI systems—for all their conversational and pattern-matching prowess—can’t do this reliably. They can describe what’s in front of them, but they struggle with concepts like what’s hidden, or what will happen next in a sequence of actions.
The solution, many of the field’s top researchers believe, lies in so-called world models: AI systems designed not just to recognize patterns in text or images, but also to simulate how the physical world behaves. By training on millions of hours of video, these models can build an accurate internal picture of how the world works, physics and all—a crucial capability for a wide range of technologies, whether it’s to help a self-driving car predict what happens if a child runs into the street; help a home robot learn how to fold clothes; or simulate surgical procedures before a single incision is made.
Star power
The need to build systems for these “physical AI” use cases has pushed world models from a niche research idea to a central focus of the field. Google recently unveiled a research preview called Project Genie, which can generate interactive, photorealistic environments from simple prompts—then predict how those worlds evolve and respond to a user’s actions.
Meanwhile, “AI godmother” Fei-Fei Li and “AI godfather” Yann LeCun have each raised roughly $1 billion for separate startups developing world models. Stanford professor Li founded World Labs in 2024 to focus on giving AI a richer sense of 3D space and of how objects exist and interact within it. LeCun, Meta’s former chief AI scientist, launched AMI Labs in March with the mission of moving beyond the limitations of large language models.
“We have systems that can manipulate language, and they fool us into thinking they are smart because they manipulate language,” LeCun said in a recent lecture. “But in fact, they are completely helpless when it comes to the physical world.”
The momentum reflects a continual shift in how researchers are thinking about intelligence, but many of today’s world model efforts trace back to a 2018 paper by David Ha and Jürgen Schmidhuber. For decades, AI systems largely learned by reacting to data or brute-force trial and error. The paper proposed a different approach: Before an AI can act intelligently, it needs to learn how the world works via a simulated version of its environment.
Ha, who has since cofounded Japanese AI R&D company Sakana, told Fortune these systems allow AI to train inside their own simulated versions of reality—what he described as “hallucinated dreams”—where an agent can practice and plan ahead before acting in the real world.
Ming-Yu Liu, a vice president at Nvidia’s Cosmos Lab, shared a more cinematic analogy, pointing to The Matrix, in which the main character learns kung fu inside a simulated world. Think of world models as a “generative training facility,” Liu said: “There’s feedback and guidance so that the AI can constantly enhance its skill.”
The fourth dimension
Beyond the most visible efforts from Li and LeCun, a number of startups are also building world models—each with its own approach.
Take Niantic Spatial, a startup spun out of the augmented reality and mapping technology behind Pokémon Go. Niantic Spatial is building what it calls large geospatial models that can interpret real-world environments in 3D. The San Francisco–based firm utilizes over 30 billion images and 3D scans, largely crowdsourced from years of Pokémon Go player activity and its Scaniverse app.
Tel Aviv–based Decart is pushing world models into real time, creating systems where environments aren’t just simulated but continuously generated and updated as users interact with them. By building its own optimized systems, it can generate video fast enough to match a user’s actions frame by frame.
“That was the main unlock,” said Kfir Aberman, a research scientist and one of Decart’s founding members. The challenge, he said, isn’t simply generating video—it’s generating it fast enough to react in the moment. That is already translating into applications like virtual try-on, where clothing moves realistically, as well as tools that allow streamers to change environments on the fly.
In some world models, the element of time is integral, constituting a fourth dimension. For example, Palo Alto–based Odyssey, founded by self-driving pioneers Oliver Cameron and Jeff Hawke, is focused on building world models that can predict how the world evolves over time—learning physics, human behavior, and cause-and-effect directly from video.
“We thought this was just magical, this idea that you could predict the future,” said Cameron. “What if we could make that possible for general applications—robotics, gaming, education, health care, defense?”
A ‘ChatGPT moment’
For all the excitement, building world models comes with steep technical and economic hurdles. Training systems that can simulate the real world requires far more computing power than today’s language models, because they need to process not just words, but high-resolution video.
For example, Cameron has said that it takes one of Nvidia’s powerful H200 AI chips for each user that accesses its Odyssey 2 model through its application programming interface. Each H200 can cost up to $40,000.
Data is another constraint. Unlike LLMs, which are trained on data scraped from the vast corpus of the internet, world models rely heavily on video—which is far more complex and harder to collect, label, and train on at scale.
None of these challenges have dampened the enthusiasm of those building world models. As researchers push toward systems that can understand and interact with the physical world, the sense is that a breakthrough may be close.
“There’s a huge amount of excitement and investment in physical AI right now,” said Nvidia’s Liu, who added that a “ChatGPT moment” is near. His own holy grail? Teaching robots new skills from just a few examples—even ones they have never seen before—and then doing them consistently.
“I do believe that people are gradually figuring out the right recipe,” he said.
This article appears in the June/July 2026 issue of Fortune with the headline “AI’s next frontier moves from words to world models.”











