Tokenization & BPE
A model never sees letters or words — it sees token IDs. How byte-pair encoding chops text into subwords, and why that explains cost, context limits, and why LLMs can't count the r's in strawberry.
What you'll learn
- Why models work on subword tokens, not characters or words
- How byte-pair encoding (BPE) builds its vocabulary by merging frequent pairs
- Why tokenization drives cost, context limits, and famous LLM failures
Before you start
Before a transformer can do anything with your text, the text has to become
numbers. A model never sees the word “running” — it sees something like token
12973, looks that ID up in an embedding table, and gets a vector. The rule that
turns text into those IDs is tokenization, and it quietly explains a
surprising amount: why API calls cost what they cost, why your context window
fills faster in some languages, and why a frontier model can write an essay but
can’t count the letters in “strawberry.”
Why not just words, or just characters?
Two obvious choices both fail:
- One token per word → the vocabulary is unbounded (typos, names, new words, every language) and the model has no idea that “run”, “running”, and “runner” are related.
- One token per character → tiny vocabulary, but sequences become enormous, and attention is quadratic in length — far too expensive.
The winning compromise is subwords: common words stay whole (“the”, “model”), rare words split into reusable pieces (“token” + “ization”). That keeps the vocabulary fixed (~50k–200k tokens) while handling any input. Type into the explorer and watch how real text fragments — notice spaces, digits, and rare words behave very differently:
Byte-pair encoding, the algorithm
The dominant method, BPE, learns its vocabulary from data with one beautifully simple loop: start from individual characters (or bytes), then repeatedly find the most frequent adjacent pair and merge it into a new token. Do that a few thousand times and frequent sequences naturally become single tokens.
Build a tiny BPE trainer and watch merges form on a toy corpus:
That’s the whole idea behind GPT’s tiktoken, SentencePiece, and friends — real
tokenizers run this merge process over billions of characters and store the
result as a vocabulary plus a merge table.
Token → embedding
Once text is a list of token IDs, the model’s embedding table (a big lookup
matrix of shape vocab_size × d_model) maps each ID to a vector. That vector is
the actual input to self-attention. The input
and output embeddings are often tied (shared weights) to save parameters.
import torch.nn as nn
embed = nn.Embedding(num_embeddings=50257, embedding_dim=768) # GPT-2 sizes
ids = tokenizer.encode("hello world") # → [15496, 995]
vectors = embed(torch.tensor(ids)) # → shape (2, 768)
Quick check
Quick check
Next
Now that text is a sequence of embedded tokens, we can do the thing transformers are famous for: let every token look at every other token, with self-attention.
Practice this in an interview
All questionsTokenization splits text into integer IDs the model can process; subword tokenizers like Byte-Pair Encoding start from characters or bytes and iteratively merge the most frequent adjacent pairs into a vocabulary. Subwords keep common words intact while decomposing rare or unseen words into known pieces, avoiding out-of-vocabulary problems and balancing vocabulary size against sequence length.
BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pair of symbols until a target vocabulary size is reached. The resulting subword units handle rare and unseen words gracefully without any out-of-vocabulary tokens.
A token is the smallest unit a language model processes — typically a word, sub-word fragment, or punctuation mark produced by a byte-pair encoding (BPE) or similar algorithm. Pricing is per token because each token requires one forward-pass position in the attention matrix, directly driving compute and memory cost regardless of whether it maps to a full word or a single letter.
Tokenization splits raw text into discrete units — words, subwords, or characters — that a model can process numerically. The strategy chosen controls vocabulary size, out-of-vocabulary rate, and how well the model handles rare or morphologically complex words.