1. What Is Advanced Artificial Intelligence?
The Role of the World Model.
Before addressing the problem further, it is important to clarify a foundational concept: what constitutes artificial intelligence—particularly advanced AI?
By definition, artificial intelligence aims to replicate human cognitive abilities through computational systems. However, human intelligence is not monolithic; it can broadly be divided into basic intelligence and advanced intelligence.
Basic intelligence refers to the ability to perceive, understand, and respond to the physical world in real time. For example: avoiding obstacles while walking; catching a towel thrown toward us; untangling a knotted rope. These capabilities rely on a deep, experience-driven understanding of the physical world, developed through continuous interaction.
On top of this foundation, advanced intelligence encompasses language, abstract reasoning, planning, and emotional cognition.
This distinction is crucial in differentiating humans from animals. Basic intelligence originates from interaction with the physical world and is essential for survival. Even small organisms such as flies or mosquitoes demonstrate remarkable proficiency in this domain—rapidly sensing and reacting to threats. That’s why swatting a mosquito can be surprisingly difficult.
At this point, a natural question arises: today’s AI systems—such as large language models like DeepSeek—already demonstrate strong linguistic and reasoning abilities. Does this mean we are close to achieving highly intelligent, capable robots?
In fact, the opposite is true.
Most current AI models—whether for language, images, or video—are fundamentally designed for “generation” tasks. In simple terms, they learn how to produce outputs that resemble human behavior, such as generating text or imitating reasoning processes. However, whether these outputs conform to the physical laws of the real world is often beyond their ability to verify. This limitation is the root cause of the well-known hallucination problem.
In essence, modern AI systems largely bypass direct interaction with the physical world. Instead, they rely on massive datasets of text and images to construct a superficial intelligence—one that lacks a true physical grounding.
To achieve genuine advanced AI, we must start from the ground up by equipping machines with basic intelligence—the ability to perceive, understand, and predict the dynamics of the physical world.
This foundational capability is known as the World Model.
2. Training Intelligence: Reinforcement or Imitation?
With the objective clarified, the next question is: how do we train AI systems to acquire advanced intelligence?
To answer this, it is helpful to revisit how intelligence develops in humans and animals.
At a macro level, intelligence has evolved over hundreds of millions of years through natural selection. As described by Darwinian theory: the fittest surivive. The environment provides continuous feedback, reinforcing advantageous behaviors while eliminating ineffective ones. Over time, biological intelligence has steadily advanced, becoming increasingly stable and efficient.
However, evolution is not the only mechanism. At the individual level, humans acquire most of their capabilities through learning from others—parents, peers, and accumulated knowledge in society. Learning and imitation are central to human development.
A historical anecdote back to 6th century BC illustrates this point: the so-called “language deprivation experiment” attributed to Pharaoh Psamtik I suggests that children deprived of linguistic input do not spontaneously develop language. While the historical accuracy of this account is debated, it highlights a key insight: many advanced human abilities are not purely innate—they depend on learning and imitation. This learning approach is also common among animals.
These two mechanisms correspond directly to two core paradigms in AI: Reinforcement Learning (RL) and Imitation Learning (IL).
Reinforcement Learning trains agents through environmental feedback, using reward functions to guide behavior. Through continuous trial and error, the agent discovers optimal strategies. This process closely mirrors biological evolution: the environment determines what constitutes “good” behavior. The advantage of RL lies in its stability and potential for generalization, but it is often time-consuming and computationally expensive.
Imitation Learning focuses on learning directly from expert demonstrations. By replicating observed actions or decisions, an agent can quickly acquire complex skills. This approach dramatically improves learning efficiency and underpins humanity’s ability to accumulate and transmit knowledge across generations.
Both paradigms are essential to the development of AI, albeit with different emphases: Imitation Learning enables rapid acquisition of baseline competence, while Reinforcement Learning refines and adapts that competence through interaction.
Today’s robots, capable of dancing, climbing stairs, and executing complex locomotion, owe much of their progress to Reinforcement Learning. However, for robots to truly “work” in real-world scenarios—especially in tasks requiring dexterous manipulation—Imitation Learning will become indispensable. This holds true for humans as well: while most people learn to walk at around one year of age, the acquisition of complex skills and knowledge extends throughout a lifetime.
3. Data, Data, Data!
It is well understood that training AI requires large amounts of data. Within current technological frameworks, machines are still far less efficient than humans at utilizing data. As a result, enabling AI to acquire a given capability typically requires far more data than a human would need to learn the same task.
The rapid advancement of large language and image models over recent years can largely be attributed to the abundance of digital data. Text and images are scalable, readily available, and inexpensive to collect, enabling extensive model training and improving.
In contrast, data required for understanding spatial structure, physical laws, and real-world interaction is far more scarce and difficult to obtain. This scarcity has become a major bottleneck in the development of embodied AI and robotics.
In Imitation Learning, data functions much like a “textbook”: expert demonstrations—captured as videos or motion trajectories—are used to teach AI systems specific tasks. For example, training a robot to fold clothes may require thousands of demonstrations, typically provided by human operators via teleoperation or manual guidance.
Whereas in Reinforcement Learning, data resembles “experience”: robots learn by interacting with their environment, generating data through trial and error rather than relying on pre-collected datasets.
Whether IL or RL, both approaches face a common challenge: real-world data collection is inherently difficult to scale.
Let’s talk about Imitation Learning first. In practice, human-guided demonstration is typically more suitable for humanoid or near-humanoid robots. Even when focusing solely on such robots, the volume of data that can be collected within a given timeframe remains limited. Data collection is constrained by hardware, workspace, and operator skill. In extreme environments, such as underwater or in space, data acquisition becomes prohibitively expensive or impractical.
Moreover, demonstration data often captures only successful outcomes, leaving models unprepared for unexpected deviations. Once a robot deviates from the scenarios covered by demonstration data—due to perception errors, execution inaccuracies, or environmental changes—it often fails to determine how to proceed.
Move on to Reinforcement Learning. It requires robots to continuously explore through trial and error. However, in the real world, mistakes often come with tangible costs: robots may collide with objects, damage equipment, or even pose risks to human safety, particularly in scenarios such as autonomous driving.
Even under controlled conditions, resetting environments and conducting repeated experiments requires significant time and human involvement. Each trial may take several minutes or even hours, and any equipment damage introduces additional downtime for repairs.
It is therefore evident that both IL and RL face a common challenge in the real world: the difficulty of scaling data collection. This limitation does not arise from shortcomings in the algorithms, but rather stems from the intrinsic nature of the physical world: real-world data collection depends heavily on human labor, time, and environmental constraints, while AI training demands data at massive scale. It is this fundamental mismatch that creates a persistent bottleneck for the development of embodied intelligence and robotics.
4. Simulation: The Key to Scalable Data
Given the inherent limitations of real-world data, a natural question comes out: where else can data come from?
The answer is clear: simulation.
Then, what is simulation?
In simple terms, simulation involves constructing a “virtual world” within a computer, where mathematical models and physical laws are used to describe the shape, motion, and interactions of objects. Within such environment, objects respond to forces as they would in the real world, experiencing collisions, friction, motion, and deformation. Robots can carry out tasks such as walking, grasping, and manipulation, with the entire process taking place within the computer.
In another way, simulation functions as a resettable “digital laboratory,” where various actions, strategies, and solutions can be tested without relying on physical robots or real-world environments.
Simulation is gaining increasing importance in artificial intelligence primarily because it offers advantages in data acquisition that the real world cannot match:
- Controllability
Initial conditions can be precisely defined, environmental parameters can be freely adjusted, the same scenario can be repeatedly tested. These are often difficult—or even impossible—to achieve in the real world.
- Repeatability
After each experiment, the environment can be immediately reset, allowing the process to restart from exactly the same initial state. This is particularly important for Reinforcement Learning, which relies on extensive trial and error; in contrast, each real-world experiment inevitably alters the state of the environment.
- Safe exploration
Failures carry no physical risk. Robots can fall, collide, and retry without damaging equipment or endangering humans. This makes it possible to systematically collect large amounts of exploratory data and failure cases.
- Richer data
Beyond visual observations and actions, simulation can directly provide detailed physical information, such as contact forces, friction states, internal deformation, and energy changes. This information is crucial for understanding physical processes and training intelligent systems.
- Scalability and generalization
Fundamentally, the key difference between simulated and real-world data lies in cost structure: simulation data scales with computational power, and real-world data scales with human and physical resources. As computing power continues to grow, simulation can expand in tandem with AI model complexity. By contrast, real-world data collection is inherently constrained by human labor, time, and environmental conditions, making it difficult to achieve the same rate of scalability.
If large models are the “brain” of AI, then simulation is its “training ground”. Before AI can operate effectively in the real world, it must first learn within a virtual one. Yet an even more crucial question remains: What makes a simulation truly useful?
We’ll decode this question in the next installment.