Test-time compute: why thinking longer beats thinking bigger

For most of the deep-learning era, the recipe for a smarter model was blunt: make it bigger. More parameters, more data, more training compute, better benchmarks — a relationship so reliable it was written down as the scaling laws. The trouble is that this dial is brutally expensive. Doubling a frontier model’s capability could mean an order of magnitude more GPUs, months of training, and a power bill the size of a small town’s.

Then reasoning models arrived and showed there was a second dial — one you turn not during training, but at the moment you ask the question.

The second scaling axis

The idea is almost embarrassingly simple. Instead of answering immediately, the model first generates a long internal monologue — a chain of thought — working through the problem step by step, trying approaches, catching its own mistakes, and only then committing to a final answer. Every one of those intermediate tokens is extra computation. You are, quite literally, paying the model to think before it speaks.

This is test-time compute scaling (also called inference-time scaling): spending more compute after training, per query, to get a better answer — without changing a single model weight.

The headline result that made everyone pay attention: DeepSeek-R1 matched OpenAI’s o1 on hard reasoning benchmarks largely by generating far more reasoning tokens per query, not by being a bigger model. The same principle shows up as a rule of thumb in the inference-scaling literature — a small model given enough thinking budget can rival one many times its size. The “intelligence” was not all baked into the weights. Some of it was waiting to be unlocked at inference time.

The dashed line is the old world: a large model answers in a single forward pass, and its accuracy is whatever it is. The solid curve is the new one: a much smaller model whose accuracy climbs as you let it think longer — until, somewhere around the crossover, it overtakes the bigger model entirely. Same problem, less hardware, more patience.

Why does thinking even help?

A model’s forward pass is fixed-depth: a fixed stack of layers, a fixed amount of computation per token. For an easy question that is plenty. For a hard one — a competition math problem, a tricky bug, a multi-step plan — there simply is not enough computation in one pass to get from the question to the answer.

Chain of thought turns that fixed-depth process into a variable-depth one. Each generated token feeds back in as context, so the model can use its own intermediate work as a scratchpad: lay out the problem, try a path, notice it leads nowhere, back up, try another. It is the difference between blurting out the first thing that comes to mind and working it out on paper. This is the same mechanism behind prompting a model to reason step by step — reasoning models just learned to do it natively, and at length, without being asked.

A second trick stacks on top: instead of one chain of thought, generate many in parallel and let them vote or be scored. Spreading the same compute budget across several independent attempts and keeping the best is often far more reliable than one long attempt — search beats a single guess.

The infrastructure earthquake

This quietly rewired the economics of AI. Under the old scaling law, the eye-watering cost was training: spend a fortune once, then serve cheap forward passes forever. Test-time compute inverts that. Every single query now carries a variable, sometimes large, thinking cost — you pay for the reasoning every time someone asks.

At scale the shift is dramatic. Analysts now project that inference compute demand will dwarf training by roughly two orders of magnitude as reasoning models proliferate. The bottleneck of the industry is moving from “can we afford to train it?” to “can we afford to run it?” — which is exactly why so much 2026 engineering effort has gone into making each token cheaper to serve and into the KV-cache optimisations that decide how much it costs to think.

Where it quietly breaks

Test-time compute is not a free lunch, and two failure modes matter.

Overthinking. More reasoning is not monotonically better. Past some point, extra deliberation starts to hurt — researchers have documented an overthinking effect where extended chains lead models to abandon a correct answer they had already found and talk themselves into a wrong one. The accuracy curve in the diagram does not rise forever; it bends back down. Longer is better, until it is worse.

Knowledge, not reasoning. Thinking longer helps when the answer is derivable — math, code, logic, planning. It does much less when the answer is simply a fact the model either knows or does not. A 2026 study found that test-time scaling is not yet effective for knowledge-intensive tasks, where more reasoning fails to improve accuracy and can even increase hallucination — the model reasons elaborately toward a confidently wrong fact. No amount of deliberation recovers information that was never in the weights.

The practical rule that falls out: reach for heavy test-time compute on problems that reward working it out, and keep it cheap on problems that only reward knowing it. Pointing a giant thinking budget at a trivia question mostly buys you a more eloquent mistake.

The takeaway

For a decade, “make it smarter” meant “make it bigger,” and the only knob was training. Reasoning models added a second knob — one you turn per query, at inference time — and showed that a modest model with room to think can stand toe to toe with a giant that answers on reflex. It is one of the most important shifts in how we get capability out of these systems since the transformer itself: not a better brain, but permission to use the one it has for longer.