datarekha
NLP & LLMs Easy Asked at OpenAIAsked at AnthropicAsked at GoogleAsked at Meta

How does an LLM generate text — what is next-token prediction and autoregression?

The short answer

An LLM generates text one token at a time by computing a probability distribution over its entire vocabulary for the next token, sampling from that distribution, appending the result, and repeating — a process called autoregression. Each new token is conditioned on all previously generated tokens, so the output at step N is only as good as the choices made at steps 1 through N-1.

How to think about it

At its core an LLM is a function that takes a sequence of tokens and outputs a probability vector of length |V| — one score per vocabulary entry — representing how likely each word-piece is to come next. The highest-scoring token is not automatically chosen; a sampling strategy selects from the distribution, the selected token is appended to the context, and the model runs again.

The autoregressive loop

  1. Tokenize the prompt into integer IDs.
  2. Pass the full token sequence through the transformer stack.
  3. Read the logit vector from the final position’s hidden state.
  4. Apply softmax (optionally scaled by temperature) to get probabilities.
  5. Sample one token ID from the distribution.
  6. Append that ID and go back to step 2.

The loop terminates when a special end-of-sequence token is sampled or a maximum length is reached.

Prompt tokensTransformerstackSoftmaxover vocabSamplenext tokenappend token and repeat
Each generated token is fed back as input before the next forward pass.

Why this matters for quality

Errors compound. If an early token steers the context in a wrong direction, every subsequent token is conditioned on that mistake. This is why beam search, speculative decoding, and careful sampling all exist — they try to recover diversity or accuracy within the autoregressive constraint.

Learn it properly The Transformer Architecture

Keep practising

All NLP & LLMs questions

Explore further

Skip to content