What is Chain-of-Thought prompting and how does it aid reasoning?

Chain-of-Thought prompting asks the model to generate intermediate reasoning steps before its final answer, either via examples or instructions like think step by step. Producing intermediate steps lets the model decompose multi-step problems and conditions the final answer on its own reasoning, improving accuracy on arithmetic, logic, and multi-hop tasks.

What are reasoning models, and what is test-time compute?

Reasoning models are trained to produce an extended chain of thought before answering, often via reinforcement learning, so they spend more computation deliberating on hard problems. Test-time compute is the idea of improving answer quality by allocating more inference-time compute, for example longer reasoning chains, sampling multiple solutions, or self-verification, rather than only scaling parameters.

What is chain-of-thought prompting and when does it help?

Chain-of-thought (CoT) prompting instructs the model to write out intermediate reasoning steps before producing a final answer, which improves accuracy on multi-step arithmetic, logic puzzles, and compositional questions. It is most impactful on models with at least ~10B parameters and on tasks where the answer space is large enough that guessing is hard.

How do you choose between batch and real-time inference for a model?

Decide based on how fresh the prediction must be versus the cost and complexity of serving live. Use batch when results are needed every few hours or days, like daily churn lists, because it is cheap, simple, and can use spot or scheduled compute. Use real-time when a late or stale decision causes immediate loss, like fraud or ad auctions needing sub-100ms responses, accepting higher cost and complexity. Most production systems are hybrid: precompute heavy signals offline and do lightweight re-ranking online.

Reasoning models & test-time compute — Generative AI

For years, the only way to make a model smarter was to make it bigger — more parameters, more training compute. Then a different lever appeared: let the model think longer at inference time. Give it room to reason step by step, internally, before it commits to an answer. That’s a reasoning model — the o-series, DeepSeek-R1, and their descendants — and it reshaped how the whole field thinks about getting good answers. If your mental model of an LLM is still “one prompt in, one answer out,” this is the update.

Two kinds of compute

Training compute is spent once, up front, to bake knowledge into the weights. Bigger models, more data.
Test-time (inference) compute is spent per query, at the moment you ask. A reasoning model generates a long internal chain of thought — often thousands of hidden “thinking tokens” — exploring, checking, and backtracking before it writes the final answer.

The striking finding: on hard problems (competition math, code, science), spending more test-time compute can beat spending more training compute. A modest model that’s allowed to think can outperform a much larger one that answers immediately. Reasoning models are trained with reinforcement learning to make that thinking productive — DeepSeek’s GRPO and the o-series are the canonical examples — and inference, not training, is now the dominant compute cost for many deployments.

More thinking helps — until it doesn’t

Here’s the part people miss: the relationship between thinking budget and accuracy is not “more is always better.” Accuracy rises with diminishing returns, plateaus, and can actually decline as the model over-reasons and second-guesses correct answers — while latency and cost climb the entire time:

That decline on the right is real — it’s the overthinking regime, and it’s why the knob you tune on a reasoning model is a budget, not “max it out.”

Chain-of-thought, but built in

You may know chain-of-thought prompting — “think step by step.” A reasoning model has that behavior trained in and runs it in a hidden scratchpad, far more thoroughly than a prompt could elicit. One important consequence:

When to reach for a reasoning model

It’s a tool with a cost, not a free upgrade. A simple decision:

Hard multi-step problem? (math, code, planning, hard analysis)
        │ yes                          │ no
        ▼                              ▼
  Reasoning model,              Standard model
  budget = high                 (faster, cheaper)

Latency-critical UX (autocomplete, chat-fast-path)?  → standard model
High-volume, easy classification/extraction?         → standard model (or a small one)
A few genuinely hard queries among many easy ones?   → route the hard ones to a reasoning model

The last line is the production pattern: don’t send everything to an expensive reasoning model — route only the queries that need it.

import numpy as np

# Illustrative: accuracy vs thinking budget on a hard task, with overthinking.
def accuracy(b):
    rise = 0.42 + 0.46 * (1 - np.exp(-b / 900))      # diminishing returns
    overthink = max(0.0, (b - 3000) / 5000) * 0.11    # decline past ~3000 tokens
    return min(0.9, rise - overthink)

print(f"{'budget':>7} {'accuracy':>9} {'cost(¢)':>8}")
for b in [0, 500, 1500, 3000, 5000, 8000]:
    print(f"{b:7d} {accuracy(b)*100:8.0f}% {b/1000:7.1f}")

# The cost-effective budget is near the accuracy peak, NOT the maximum.
budgets = np.arange(0, 8001, 100)
best = budgets[np.array([accuracy(b) for b in budgets]).argmax()]
print(f"\npeak accuracy at ~{best} thinking tokens — past that you pay more for less")

 budget  accuracy  cost(¢)
      0       42%     0.0
    500       62%     0.5
   1500       79%     1.5
   3000       86%     3.0
   5000       83%     5.0
   8000       77%     8.0

peak accuracy at ~3000 thinking tokens — past that you pay more for less

The table is the curve in numbers: accuracy climbs fast, crests at ~86% around 3000 thinking tokens, then falls to 77% by 8000 even as the cost keeps rising. The cost-effective budget sits at the peak, not the ceiling.

In one breath

A reasoning model spends extra test-time (inference) compute — a long hidden chain of thought — before answering.
On hard problems, more inference compute can beat more training compute; a thinking model can outscore a bigger instant one.
Accuracy rises with diminishing returns, plateaus, then declines (overthinking) — tune a budget, do not max it.
The new knob is reasoning effort (low/medium/high), set per task — every hidden thinking token is billed.
Skip “think step by step” for reasoning models (built in, can interfere); route only genuinely hard queries to them.

Quick check

0/3

Q1What is 'test-time compute' in a reasoning model?

Q2How does accuracy typically relate to the thinking budget on a hard problem?

Q3You're using a reasoning model. Should you add 'think step by step' to your prompt?

Reasoning models are powerful but expensive, which makes two other lessons essential: model routing to send only the hard queries their way, and LLM evals to actually measure whether the extra thinking is buying you correctness.

Reasoning models & test-time compute

What you'll learn

Before you start

Two kinds of compute

More thinking helps — until it doesn’t

Chain-of-thought, but built in

When to reach for a reasoning model

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further