Temperature, top-p, top-k: the three knobs on an LLM

Ask an LLM to complete the sentence “The capital of France is” and it will, without fail, say “Paris.” Ask it to continue “In the forest at midnight, I heard” and you will get something different every time you press send. Both outputs come from the same type of computation: a softmax over a vocabulary of tens of thousands of tokens, producing a probability score for each one. What changes between the two situations — and what changes when you drag the “temperature” slider in any playground — is not the model’s knowledge. It is the rules of the lottery you run over that distribution.

Most practitioners know vaguely that high temperature means “creative” and low temperature means “focused.” That framing is true but thin. Understanding why it is true, and how top-k and top-p layer on top of it, changes how you reach for these settings in production — and saves you from the common mistake of blaming a model for randomness that is entirely your configuration’s fault.

What a forward pass actually produces

Before sampling even begins, the model runs its full forward pass: embeddings, attention layers, feed-forward layers, all of it. The final layer produces a vector of unnormalized scores — called logits — one per token in the vocabulary. A typical vocabulary is 32,000 to 128,000 tokens. Logit values are raw numbers on no particular scale; “Paris” might have a logit of 14.2 while “Lyon” has 8.6 and “cheese” has 2.1.

These logits get converted to probabilities via softmax, which exponentiates each logit and divides by the sum of all exponentiated logits. The result is a proper probability distribution: every token gets a value between 0 and 1, and they all sum to 1.

Temperature enters before the softmax, not after. And that placement is everything.

Temperature is a scalpel on the logits

The formula is simple: divide every logit by the temperature value T before applying softmax. That is the entire operation. A temperature of 1.0 leaves logits untouched, which is why it is the “default” — the distribution you get at T=1.0 is the distribution the model actually learned.

At T below 1.0, you are dividing by a number smaller than one — which means multiplying, which means amplifying the differences between logits. A gap of 5.6 between “Paris” (14.2) and “Lyon” (8.6) becomes a gap of 11.2 at T=0.5. After softmax, high-logit tokens receive dramatically more probability mass. The distribution sharpens into a spike. At T approaching zero, the process degenerates to argmax — always pick the single most likely token, which is what “greedy decoding” is.

At T above 1.0, you divide by a number larger than one, compressing the logits toward each other. The same gap of 5.6 between Paris and Lyon shrinks to 2.8 at T=2.0. After softmax, the distribution flattens. Tokens that were very unlikely — “moonrise,” “labyrinth,” “croissant” — become meaningfully probable. The model can now surprise you.

The important intuition: temperature does not change which tokens are possible. It does not add knowledge or remove hallucinations. It only changes the shape of the distribution over all tokens that already existed. A model that does not know the answer at T=1.0 will not know it at T=0.1 — it will just hallucinate with more conviction.

The same logits, two temperatures. Low T pours almost all mass onto the top token. High T spreads it across the vocabulary.

Top-k: a hard vocabulary fence

Now that the distribution exists (post-temperature), you could sample from it directly. In a 100,000-token vocabulary, even tokens with a probability of 0.000001 can occasionally win the lottery. Sometimes that is desirable; usually it just produces noise.

Top-k truncation solves this bluntly: keep only the k highest-probability tokens, set the rest to zero probability, renormalize to sum to 1, then sample. With k=50, you are always choosing among the 50 most plausible next tokens. With k=1, you get greedy decoding.

Top-k is easy to understand and fast to implement, but it has a structural weakness. The parameter k is fixed regardless of how the probability mass is actually distributed. Sometimes the top token holds 95% of the probability, and the next 49 tokens share 5% — in which case k=50 is letting you sample from a lot of junk. Other times the model is genuinely uncertain: maybe 30 tokens each hold 2-3% probability. Here k=50 is actually too conservative; you are cutting off tokens the model considered plausible.

The distribution’s shape varies wildly across token positions, and k is a constant that knows nothing about that shape. That observation is exactly what motivated nucleus sampling.

Top-p: the nucleus

Top-p, also called nucleus sampling, was introduced in a 2020 paper by Holtzman et al. with the memorable argument that the “unreliable tail” of the distribution — low-probability tokens that make the output incoherent — should be cut dynamically, not with a fixed count.

The algorithm: sort tokens by probability in descending order, then accumulate them one by one until the running total crosses a threshold p. Everything in that prefix is the nucleus. Everything outside it gets zeroed. Renormalize. Sample.

If you set p=0.9, you are always sampling from the smallest set of tokens that together account for at least 90% of the probability mass. When the model is confident — “Paris” alone has 97% — the nucleus is just one token, and you effectively get greedy output. When the model is uncertain — maybe it has genuinely diverse valid continuations — the nucleus expands to include many tokens, and you get correspondingly diverse outputs.

This adaptivity is the point. Top-p is not a better version of top-k in all cases; it is a more distribution-aware version. It asks: “how much of the model’s belief do I want to respect?” rather than “how many options do I want to allow?”

Tokens sorted by probability. The nucleus (colored) contains everything needed to reach cumulative p=0.90. The tail (muted) is discarded before sampling.

Why they interact — and why you usually set all three

In practice, these three settings layer on top of one another in order: temperature first (rescale logits), then top-k (hard truncate), then top-p (nucleus), then sample. They are filters in a pipeline, each operating on the output of the one before it.

That ordering has consequences. If you set a very low temperature and then a top-p of 0.95, you will almost certainly never see the tail anyway — the temperature already collapsed the distribution into the top one or two tokens. The top-p filter becomes nearly inert. Conversely, if you use a high temperature with a very small top-k (say 10), you are first flattening the distribution and then immediately chopping most of it away. You end up with a roughly uniform sample from 10 tokens regardless of what the model actually believed — an odd combination that throws away the model’s nuance at both ends.

The practical implication: these parameters do not act independently, and treating them as three separate creativity sliders is how you get into trouble. Think of temperature as setting the raw shape of the distribution, and top-p/top-k as deciding how much of the tail you are willing to tolerate.

The actual decision: when to use what

Code generation and structured extraction reward low temperature. Here the right answer is usually uniquely correct or at least very constrained. You do not want the model to explore. Set temperature around 0.0 to 0.3, top-p around 0.9 (nucleus is still useful as a safety net for unusual token positions), and either leave top-k at a large value or disable it entirely.

Brainstorming, creative writing, and diverse output generation reward higher temperature. Set temperature around 0.8 to 1.2. Keep top-p at 0.9 to 0.95 — you still want to cut the true junk from the tail. Top-k around 50 to 100 is a reasonable safety net to prevent genuinely bizarre tokens from winning even at high temperature.

Factual question answering sits somewhere between. The model usually has a high-confidence answer, and temperature around 0.3 to 0.5 lets you get deterministic-ish behavior while still allowing some variation in phrasing. Top-p at 0.9 keeps the nucleus adaptive.

The most common mistake: leaving temperature at 1.0 (the API default) for extraction tasks and then wondering why the model gives slightly different answers each time. A default of 1.0 is a neutral prior, not an endorsement. It means the sampling distribution matches what the model learned, which includes considerable uncertainty that may be appropriate for language modeling in general but is actively harmful for deterministic pipelines.

The thing temperature cannot fix

Here is the claim worth sitting with: no combination of temperature, top-k, and top-p will make a model factually accurate if it was not accurate at T=1.0. These parameters reshape how you sample from the model’s beliefs; they do not alter the beliefs.

If a model has calibrated uncertainty — it assigns high probability to correct answers and lower probability to wrong ones — then low temperature is extraordinarily useful: it collapses you onto the right answer. If a model is confidently wrong — high logit for an incorrect token — then low temperature makes things worse: you get deterministic incorrectness instead of random incorrectness.

This is why sampling parameters are a signal about confidence but not a substitute for capability. When you see practitioners report that “lower temperature reduces hallucinations,” they are usually observing a model that is more correct at the top of its distribution than in its tail. The tail, for such a model, is where the noise and confabulation live. Temperature suppression cuts the tail. So, mechanically, does top-p. They are not reducing hallucinations — they are increasing the fraction of samples that avoid the hallucinating part of the distribution.

The deeper lever is always the model itself — its training data, RLHF calibration, and knowledge. Sampling parameters tune which part of the model you see. Choose your temperature knowing that.

A rule of thumb that actually holds

Think of temperature as deciding how much of the model’s uncertainty you want expressed in the output. If you want the model’s best single guess, drive temperature low. If you want a sample of plausible outputs that reflects the model’s genuine uncertainty about what should come next, keep temperature closer to 1.0 or above.

Think of top-p as deciding how far into the probability tail you are willing to reach. 0.9 is a reasonable default almost everywhere. Below 0.7 you are aggressively cutting options; above 0.99 you are essentially sampling the whole distribution.

Think of top-k as a blunt guardrail that says “never go below the top k tokens no matter what.” It is most useful at small values when you want hard vocabulary control, and effectively irrelevant at large values like k=200 or k=512.

The three knobs do not make a model smarter. They make it behave in a way that matches your task’s tolerance for uncertainty. That is a smaller claim than most playground sliders suggest — but it is the correct one, and building your intuition around it will save you a lot of confused debugging.