How does an LLM generate text — what is next-token prediction and autoregression?
An LLM generates text one token at a time by computing a probability distribution over its entire vocabulary for the next token, sampling from that distribution, appending the result, and repeating — a process called autoregression. Each new token is conditioned on all previously generated tokens, so the output at step N is only as good as the choices made at steps 1 through N-1.
How to think about it
At its core an LLM is a function that takes a sequence of tokens and outputs a probability vector of length |V| — one score per vocabulary entry — representing how likely each word-piece is to come next. The highest-scoring token is not automatically chosen; a sampling strategy selects from the distribution, the selected token is appended to the context, and the model runs again.
The autoregressive loop
- Tokenize the prompt into integer IDs.
- Pass the full token sequence through the transformer stack.
- Read the logit vector from the final position’s hidden state.
- Apply softmax (optionally scaled by temperature) to get probabilities.
- Sample one token ID from the distribution.
- Append that ID and go back to step 2.
The loop terminates when a special end-of-sequence token is sampled or a maximum length is reached.
Why this matters for quality
Errors compound. If an early token steers the context in a wrong direction, every subsequent token is conditioned on that mistake. This is why beam search, speculative decoding, and careful sampling all exist — they try to recover diversity or accuracy within the autoregressive constraint.