What an LLM is
A demystified, no-magic explanation of large language models — what they actually do, and the implications for how you build with them.
What you'll learn
- ✓ The single mechanism behind every LLM (it's surprisingly simple)
- ✓ Why "next token prediction" is enough to do everything they do
- ✓ What this implies for prompting and reliability
A large language model is, mechanically, a function that takes a sequence of tokens and predicts the next one. That’s it. Everything else — the reasoning, the conversation, the code — emerges from doing this prediction extremely well, at scale, repeatedly.
Tokens, not words
LLMs don’t see characters or words — they see tokens. A token is typically a few characters: common words become one token, rare words get split into pieces.
For example, the phrase "the quick brown fox" might become:
['the', ' quick', ' brown', ' fox'] # 4 tokens
But "datarekha" might become:
['data', 'rekha'] # 2 tokens
The model has a fixed vocabulary (usually 50k–200k tokens). Pricing, context windows, and speed are all measured in tokens — never characters or words.
The autoregressive loop
To generate text, an LLM does this in a loop:
- Look at the input tokens so far.
- Produce a probability distribution over the next token.
- Sample (or pick the most likely) one.
- Append it to the input.
- Go to step 1.
Stop when the model produces an end-of-text token, hits a length limit, or hits a stop sequence you specified.
Prompt: "The capital of France is"
Step 1: → "Paris" (highest probability)
Step 2: → "." (sentence ends naturally)
Step 3: → <end> (model stops)
That’s the whole thing. The model is only trained to predict the next token. The fact that this looks like reasoning, conversation, and code generation is an emergent property of having seen enormous amounts of text.
Why next-token prediction can do so much
Reading vast amounts of text, a model implicitly learns:
- Spelling and grammar (which tokens follow which).
- Facts (“the capital of France is” → “Paris”).
- Reasoning chains (Q&A pairs, worked math, code).
- Style (academic vs casual vs code).
- Instruction-following (after RLHF — a second training stage).
What it does not learn:
- Truth. It predicts plausible text. Plausible ≠ true.
- Real-time information. It only knows what was in its training data.
- Its own internals. It can’t introspect “why did I say that?”
These last two are why LLMs need tool use (to look things up live) and evals (to catch hallucinations).
The implications for building
Once you internalize “it’s predicting the next token”, a lot follows:
- Prompting matters. The first few tokens of the response steer everything that follows. “Output JSON only” produces JSON; “Output JSON” alone often doesn’t.
- Few-shot examples work. Showing 3 examples of
Q: ... A: ...makes the model pattern-match a 4th. This is the basis of in-context learning. - Chain-of-thought helps. Saying “Let’s think step by step” expands the response and gives the model more tokens to use for intermediate reasoning — often improving accuracy.
- Streaming is natural. Since the model emits one token at a time, you can show the response as it’s generated — that’s why ChatGPT feels live.
- Cost = tokens. Both input (your prompt) and output (the response) cost money. Big context windows are expensive.
Next
The next lessons go deeper: sampling parameters (temperature, top-p), structured outputs with Pydantic, and the prompt patterns that actually make a difference in production.
Finished the lesson?
Mark it complete to track your progress and keep your streak alive. +20 XP