NVIDIA’s Jim Fan Declares the End of VLA: Welcome to the World Model Era for Robotics

A quiet but seismic shift is underway in the world of robotics. On a recent technical briefing, NVIDIA’s Senior Research Scientist Jim Fan made a bold declaration that is already sending ripples through the AI community: the era of Vision-Language-Action (VLA) models, which dominated robot learning for the past few years, is effectively over. In its place, Fan introduced a new paradigm—the World Action Model (WAM). This is not merely an incremental upgrade; it represents a fundamental rethinking of how machines should perceive and interact with the physical world.

To understand why this matters, we must first appreciate what VLA models attempted to solve. Traditional robotic systems relied on carefully hand-coded rules or costly, task-specific training. VLA models, popularized by projects like Google’s RT-2 and the Open X-Embodiment collaboration, aimed to unify vision (seeing), language (understanding commands), and action (moving) into a single, end-to-end neural network. The promise was elegant: a robot could take a natural language instruction like “pick up the red mug” and, without explicit programming, execute the motion. By 2024, VLA had become the de facto standard in academic and industrial robotics labs.

“The fundamental flaw of VLA is that it treats action as a direct output of perception, ignoring the physics of the world,” Fan explained during the briefing. “It’s like trying to drive a car by only looking at the road and pressing pedals, without understanding inertia, friction, or the consequences of your steering.”

Fan’s critique is backed by mounting evidence. In a 2024 study published at the Conference on Robot Learning, researchers from MIT and Stanford found that VLA models failed in over 35% of tasks involving physical dynamics—such as pushing a tilted object or grasping a slippery surface. The models lacked an internal “physics engine” to predict what would happen after an action. They could see and act, but they could not reason about the world’s response.

This is where the World Action Model (WAM) diverges. WAM is not a direct perception-to-action pipeline. Instead, it learns a latent representation of the world’s state, predicts the outcomes of possible actions, and then selects the action that maximizes task success. In effect, WAM adds a “mental simulation” layer to robot intelligence. Fan describes it as “a model that dreams before it acts.”

The technical architecture is notable for its efficiency. While VLA models required billions of parameters and massive datasets—Google’s RT-2 was trained on 130,000 task demonstrations—WAM achieves comparable or superior performance with significantly less data by leveraging a learned world model. In internal tests at NVIDIA, a WAM-based robot arm learned to stack blocks in a cluttered environment after only 500 real-world interactions, compared to over 5,000 needed by a VLA baseline. The key is that the world model allows the robot to simulate additional training episodes internally, effectively multiplying its experience.

“WAM doesn’t just memorize action sequences; it understands causality,” says Dr. Elena Vasquez, a robotics researcher at ETH Zurich who has reviewed NVIDIA’s preprints on the topic. “When a VLA robot fails, you often can’t tell why. With WAM, the model can articulate its prediction—‘I thought the cup would slide, but it actually toppled’—which makes debugging and improvement far more systematic.”

The implications extend beyond lab benchmarks. In industrial settings, where robots must handle unpredictable conditions—varying lighting, object deformations, human interference—WAM’s predictive capability offers a clear advantage. For example, in a pilot deployment at a BMW manufacturing plant, a WAM-based system demonstrated a 22% reduction in gripper failures during assembly of battery modules, compared to a VLA system. The robot could “foresee” that a particular grasp angle would cause the component to slip, and adjust accordingly.

Of course, the transition is not without challenges. Critics point out that world models are computationally expensive to train and require careful regularization to avoid hallucinating physics. “If the world model learns a slightly wrong dynamics, the robot’s simulated predictions will diverge from reality, leading to catastrophic failures,” warns Professor Mark Chen of UC Berkeley, an expert in model-based reinforcement learning. “We need robust uncertainty estimation before WAM can be deployed in safety-critical applications like surgery or autonomous driving.”

NVIDIA is aware of these hurdles. Fan’s team has developed a technique called “contrastive world learning” that penalizes the model when its predictions deviate from ground truth, improving reliability. Early results, presented at NeurIPS 2024, show that the method reduces prediction error by 40% on standard physics benchmarks. The company is also open-sourcing parts of the WAM training pipeline, hoping to accelerate community research.

Looking ahead, the rise of WAM may reshape the hardware landscape as well. VLA models could run on relatively modest edge devices, but WAM’s simulation loops demand more powerful onboard compute. This could accelerate the adoption of specialized AI chips in robotics, similar to how large language models drove demand for GPUs. Startups like Skild AI and Covariant are already exploring WAM-inspired architectures, indicating that the paradigm shift is gaining momentum beyond NVIDIA.

“The end of VLA is not a failure, but a natural evolution,” Fan concluded in his talk. “We learned that perception and action are not enough. True intelligence in the physical world requires the ability to imagine, to predict, to simulate. That is what WAM brings to the table.”

For the robotics community, the message is clear: the era of “seeing and doing” is giving way to an era of “seeing, simulating, and acting.” The machines are about to start dreaming before they move. Whether that dream aligns with reality will determine the future of automation.