NLP & LLMs Medium Asked at OpenAIAsked at AnthropicAsked at GoogleAsked at Meta

What is the difference between temperature, top-k, and top-p sampling in LLMs?

For AI / LLM Engineer ML Engineer Data Scientist

The short answer

Temperature rescales the logits before softmax — lower values sharpen the distribution toward the most likely token, higher values flatten it. Top-k restricts sampling to the k highest-probability tokens; top-p (nucleus sampling) restricts it to the smallest set of tokens whose cumulative probability reaches p. In practice top-p adapts the candidate pool dynamically while top-k uses a fixed count.

How to think about it

All three parameters control which token the model picks at each step, but they operate at different points in the pipeline.

Temperature

The model produces raw logits z. Before applying softmax, each logit is divided by temperature T:

p_i = softmax(z_i / T)

T = 1.0 — unchanged distribution (default).
T < 1.0 — logit differences are amplified; the model becomes more deterministic, favouring its top predictions.
T > 1.0 — logit differences shrink; the distribution flattens and lower-ranked tokens get more weight.

Top-k

After applying temperature, only the k tokens with the highest probability are kept in the candidate pool. The probabilities of the remaining tokens are set to zero and renormalized. If k = 1, this is greedy decoding.

A fixed k is brittle: when the model is highly confident (one token at 95%), k = 50 still forces consideration of 49 nearly-zero-probability tokens.

Top-p (nucleus sampling)

Tokens are sorted by probability in descending order. The nucleus is the smallest prefix of that sorted list whose cumulative probability equals or exceeds p. Only tokens inside the nucleus are candidates.

When the distribution is sharp, the nucleus is small (few tokens). When it is flat, the nucleus is large. This self-adjusts with model confidence — the main advantage over top-k.

Typical production defaults

Setting	Common value	Effect
temperature	0.7–1.0	Balanced creativity
top-p	0.9–0.95	Dynamic nucleus
top-k	40–50 (or off)	Hard ceiling

Many APIs apply temperature then top-p then top-k in that order.

Learn it properly The Transformer Architecture