How do temperature, top-k, and top-p sampling control LLM generation?

For AI / LLM Engineer Data Scientist ML Engineer

The short answer

Temperature rescales the logits before softmax: low values sharpen the distribution toward greedy deterministic output and high values flatten it for more randomness. Top-k restricts sampling to the k most likely tokens, and top-p or nucleus sampling restricts it to the smallest set of tokens whose cumulative probability exceeds p, both trimming the unlikely tail.

How to think about it

Temperature rescales the logits before softmax: low values sharpen the distribution toward greedy deterministic output and high values flatten it for more randomness. Top-k restricts sampling to the k most likely tokens, and top-p or nucleus sampling restricts it to the smallest set of tokens whose cumulative probability exceeds p, both trimming the unlikely tail.

Learn it properly Structured outputs

Keep practising

What is the difference between temperature, top-k, and top-p sampling in LLMs? How does an LLM generate text — what is next-token prediction and autoregression? How would you reduce the cost of serving an ML or LLM model in production without hurting quality? How does tokenization work, and why do LLMs rely on subword tokenizers like BPE? What techniques reduce LLM cost and latency in production?

All NLP & LLMs questions

Explore further

Sampling: temperature, top-k, top-p Softmax The autoregressive loop Speculative Decoding

Top-p Sampling Temperature Softmax RLHF