What an LLM is

You type “The capital of France is” into ChatGPT. It answers “Paris.” Underneath, no lookup happened — the model just computed which of its ~100,000 tokens was most likely to come next, and “Paris” won by a wide margin. Mechanically, a large language model is a function that takes a sequence of tokens and predicts the next one. That’s it. Everything else — the reasoning, the conversation, the code — emerges from doing this prediction extremely well, at scale, repeatedly.

Tokens, not words

LLMs don’t see characters or words — they see tokens. A token is typically a few characters: common words become one token, rare words get split into pieces.

For example, the phrase "the quick brown fox" might become:

['the', ' quick', ' brown', ' fox']      # 4 tokens

But "datarekha" might become:

['data', 'rekha']                         # 2 tokens

The model has a fixed vocabulary (usually 50k–200k tokens). Pricing, context windows, and speed are all measured in tokens — never characters or words. One token is not one word: common short words are one token, long or rare words are several, and whitespace often merges into the following token.

An LLM is a next-token predictor. Text becomes IDs, IDs become vectors, vectors produce a probability over every token in the vocabulary, and one token is sampled.

The autoregressive loop

To generate text, an LLM does this in a loop:

Look at the input tokens so far.
Produce a probability distribution over the next token.
Sample (or pick the most likely) one.
Append it to the input.
Go to step 1.

Stop when the model produces an end-of-text token, hits a length limit, or hits a stop sequence you specified.

Prompt:   "The capital of France is"
Step 1:   → "Paris"     (highest probability)
Step 2:   → "."         (sentence ends naturally)
Step 3:   → <end>       (model stops)

That’s the whole thing. The model is only trained to predict the next token. The fact that this looks like reasoning, conversation, and code generation is an emergent property of having seen enormous amounts of text.

One token at a time: each sampled token is appended and fed back in, so every choice constrains the next.

Why next-token prediction can do so much

Reading vast amounts of text, a model implicitly learns:

Spelling and grammar (which tokens follow which).
Facts (“the capital of France is” → “Paris”).
Reasoning chains (Q&A pairs, worked math, code).
Style (academic vs casual vs code).
Instruction-following (after RLHF — a second training stage).

What it does not learn:

Truth. It predicts plausible text. Plausible ≠ true.
Real-time information. It only knows what was in its training data.
Its own internals. It can’t introspect “why did I say that?”

These last two are why LLMs need tool use (to look things up live) and evals (to catch hallucinations).

The implications for building

Once you internalize “it’s predicting the next token”, a lot follows:

Prompting matters. The first few tokens of the response steer everything that follows. “Output JSON only” produces JSON; “Output JSON” alone often doesn’t.
Few-shot examples work. Showing 3 examples of Q: ... A: ... makes the model pattern-match a 4th. This is the basis of in-context learning.
Chain-of-thought helps. Saying “Let’s think step by step” expands the response and gives the model more tokens to use for intermediate reasoning — often improving accuracy.
Streaming is natural. Since the model emits one token at a time, you can show the response as it’s generated — that’s why ChatGPT feels live.
Cost = tokens. Both input (your prompt) and output (the response) cost money. Big context windows are expensive.

In one breath

An LLM is one function: tokens in, a probability over the next token out.
Generation is the autoregressive loop — sample a token, append it, repeat; the first tokens steer everything after.
It learns spelling, facts, reasoning patterns, and (after RLHF) instruction-following — but not truth, live data, or self-knowledge.
Hallucination is just plausible text winning, not a failed lookup — so pair the model with tools, retrieval, and evals.
Everything downstream — prompting, few-shot, streaming, cost-in-tokens — follows from “it predicts one token at a time.”

Quick check

0/3

Q1An LLM confidently answers 'the capital of Westeros is King's Landing'. What mechanism produced that confident-sounding wrong answer?

Q2Why does 'Output JSON only.' usually produce JSON, while just 'Output JSON' often doesn't?

Q3Your LLM keeps giving outdated stock prices. What's the right fix?

The next lessons go deeper: sampling parameters (temperature, top-p), structured outputs with Pydantic, and the prompt patterns that actually make a difference in production.

Questions about this lesson

How does an LLM actually generate text?

It predicts the next token from the preceding ones, appends it, and repeats — producing one token at a time. Sampling settings like temperature decide how it picks among the likely next tokens.

Do LLMs understand text or just predict it?

Mechanically, an LLM predicts statistically likely continuations; it has no beliefs or grounding in the world. It can produce remarkably coherent, useful output, but that's pattern completion at scale — which is also why it can state falsehoods confidently.

What is a token, and why does it matter?

A token is a chunk of text — often a word or word-piece — that the model reads and generates. Tokens matter because context limits and pricing are measured in them, and odd tokenisation can affect how the model handles numbers, spelling, and rare words.

What you'll learn

Before you start

Tokens, not words

The autoregressive loop

Why next-token prediction can do so much

The implications for building

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Questions about this lesson

Practice this in an interview

Related lessons

Explore further