Beyond next-token: world models and the next paradigm

Large language models are, at heart, extraordinary next-token predictors. Trained on enough text, predicting the next word turns out to require a startling amount of competence — but it is still, fundamentally, a model of language. A growing camp of researchers argues that the path to the next leap is a model not of words, but of the world itself.

What a world model actually predicts

The difference is in what the model is trained to anticipate. An LLM, given “the cat sat on the,” predicts the next token: “mat.” A world model, given a state of the world and an action, predicts the next state of the world: where the ball rolls if you push it, what the room looks like after you take a step, what happens next if a car brakes. World models are learned representations and simulators that maintain state, predict dynamics, and support counterfactual reasoning for planning and control.

That last part is the prize. If a model can simulate “what happens if I do X,” an agent can plan by imagining consequences before acting — which is exactly what you need for robotics, autonomous driving, and any agent that has to act in a physical or interactive environment, not just chat about one.

Two philosophies fighting it out

There is a genuine intellectual split over how to build one, and it is one of the more interesting debates in AI right now. One path compresses the world to understand it; the other renders the world to predict it:

Render the world (Sora, Genie). The bet is that if you can generate a realistic, controllable video of what happens next, that generative ability is a kind of understanding. Google DeepMind’s Genie 3 generates persistent, interactive 3D environments in real time at 24 frames per second — a world you can actually move around inside.
Compress the world (JEPA, Dreamer). The counter-bet is that pixel-perfect rendering is a distraction; what matters is predicting the abstract state and dynamics, not every visual detail. This is the camp Yann LeCun left Meta to pursue, founding a new lab to build AI that understands physics rather than predicting text.

Why this could matter more than another LLM

The honest framing is that world models are not a replacement for language models so much as a different substrate. Language is humanity’s compressed knowledge; the physical world is the thing language is about. An agent that only ever learned from text knows what people say happens when you drop a glass; an agent with a world model can simulate it. For anything embodied — robots, self-driving, agents acting in real or virtual environments — that difference is the whole game.

It is too early to crown a winner, and plenty of the hype will not survive contact with reality. But the underlying idea is one of the most exciting in AI: moving from models that predict our words to models that predict our world — and can therefore imagine, plan, and act within it. If the transformer was the architecture that defined the language era, the race now is to find the one that defines whatever comes after it.