datarekha

Reasoning models & test-time compute

o-series and R1-style models trade inference compute for accuracy by thinking longer before they answer. How test-time scaling works, when it's worth it, and the overthinking trap.

8 min read Intermediate Generative AI Lesson 4 of 33

What you'll learn

  • What a reasoning model is and how test-time compute differs from training compute
  • Why accuracy rises with a thinking budget — and where overthinking kicks in
  • When a reasoning model is worth its cost and latency, and when it isn't

Before you start

For years, the only way to make a model smarter was to make it bigger — more parameters, more training compute. Then a different lever appeared: let the model think longer at inference time. Give it room to reason step by step, internally, before it commits to an answer. That’s a reasoning model — the o-series, DeepSeek-R1, and their descendants — and it reshaped how the whole field thinks about getting good answers. If your mental model of an LLM is still “one prompt in, one answer out,” this is the update.

Two kinds of compute

  • Training compute is spent once, up front, to bake knowledge into the weights. Bigger models, more data.
  • Test-time (inference) compute is spent per query, at the moment you ask. A reasoning model generates a long internal chain of thought — often thousands of hidden “thinking tokens” — exploring, checking, and backtracking before it writes the final answer.

The striking finding: on hard problems (competition math, code, science), spending more test-time compute can beat spending more training compute. A modest model that’s allowed to think can outperform a much larger one that answers immediately. Reasoning models are trained with reinforcement learning to make that thinking productive — DeepSeek’s GRPO and the o-series are the canonical examples — and inference, not training, is now the dominant compute cost for many deployments.

More thinking helps — until it doesn’t

Here’s the part people miss: the relationship between thinking budget and accuracy is not “more is always better.” Accuracy rises with diminishing returns, plateaus, and can actually decline as the model over-reasons and second-guesses correct answers — while latency and cost climb the entire time. Find the sweet spot:

That decline on the right is real — it’s the overthinking regime, and it’s why the knob you tune on a reasoning model is a budget, not “max it out.”

Chain-of-thought, but built in

You may know chain-of-thought prompting — “think step by step.” A reasoning model has that behavior trained in and runs it in a hidden scratchpad, far more thoroughly than a prompt could elicit. One important consequence:

When to reach for a reasoning model

It’s a tool with a cost, not a free upgrade. A simple decision:

Hard multi-step problem? (math, code, planning, hard analysis)
        │ yes                          │ no
        ▼                              ▼
  Reasoning model,              Standard model
  budget = high                 (faster, cheaper)

Latency-critical UX (autocomplete, chat-fast-path)?  → standard model
High-volume, easy classification/extraction?         → standard model (or a small one)
A few genuinely hard queries among many easy ones?   → route the hard ones to a reasoning model

The last line is the production pattern: don’t send everything to an expensive reasoning model — route only the queries that need it.

Quick check

Quick check

0/3
Q1What is 'test-time compute' in a reasoning model?
Q2How does accuracy typically relate to the thinking budget on a hard problem?
Q3You're using a reasoning model. Should you add 'think step by step' to your prompt?

Next

Reasoning models are powerful but expensive, which makes two other lessons essential: model routing to send only the hard queries their way, and LLM evals to actually measure whether the extra thinking is buying you correctness.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What are reasoning models, and what is test-time compute?

Reasoning models are trained to produce an extended chain of thought before answering, often via reinforcement learning, so they spend more computation deliberating on hard problems. Test-time compute is the idea of improving answer quality by allocating more inference-time compute, for example longer reasoning chains, sampling multiple solutions, or self-verification, rather than only scaling parameters.

What is chain-of-thought prompting and when does it help?

Chain-of-thought (CoT) prompting instructs the model to write out intermediate reasoning steps before producing a final answer, which improves accuracy on multi-step arithmetic, logic puzzles, and compositional questions. It is most impactful on models with at least ~10B parameters and on tasks where the answer space is large enough that guessing is hard.

How do you choose between batch and real-time inference for a model?

Decide based on how fresh the prediction must be versus the cost and complexity of serving live. Use batch when results are needed every few hours or days, like daily churn lists, because it is cheap, simple, and can use spot or scheduled compute. Use real-time when a late or stale decision causes immediate loss, like fraud or ad auctions needing sub-100ms responses, accepting higher cost and complexity. Most production systems are hybrid: precompute heavy signals offline and do lightweight re-ranking online.

What is Chain-of-Thought prompting and how does it aid reasoning?

Chain-of-Thought prompting asks the model to generate intermediate reasoning steps before its final answer, either via examples or instructions like think step by step. Producing intermediate steps lets the model decompose multi-step problems and conditions the final answer on its own reasoning, improving accuracy on arithmetic, logic, and multi-hop tasks.

Related lessons

Explore further

Skip to content