datarekha

Tokenization & BPE

A model never sees letters or words — it sees token IDs. How byte-pair encoding chops text into subwords, and why that explains cost, context limits, and why LLMs can't count the r's in strawberry.

7 min read Beginner Deep Learning Lesson 17 of 27

What you'll learn

  • Why models work on subword tokens, not characters or words
  • How byte-pair encoding (BPE) builds its vocabulary by merging frequent pairs
  • Why tokenization drives cost, context limits, and famous LLM failures

Before you start

Before a transformer can do anything with your text, the text has to become numbers. A model never sees the word “running” — it sees something like token 12973, looks that ID up in an embedding table, and gets a vector. The rule that turns text into those IDs is tokenization, and it quietly explains a surprising amount: why API calls cost what they cost, why your context window fills faster in some languages, and why a frontier model can write an essay but can’t count the letters in “strawberry.”

Why not just words, or just characters?

Two obvious choices both fail:

  • One token per word → the vocabulary is unbounded (typos, names, new words, every language) and the model has no idea that “run”, “running”, and “runner” are related.
  • One token per character → tiny vocabulary, but sequences become enormous, and attention is quadratic in length — far too expensive.

The winning compromise is subwords: common words stay whole (“the”, “model”), rare words split into reusable pieces (“token” + “ization”). That keeps the vocabulary fixed (~50k–200k tokens) while handling any input. Type into the explorer and watch how real text fragments — notice spaces, digits, and rare words behave very differently:

Byte-pair encoding, the algorithm

The dominant method, BPE, learns its vocabulary from data with one beautifully simple loop: start from individual characters (or bytes), then repeatedly find the most frequent adjacent pair and merge it into a new token. Do that a few thousand times and frequent sequences naturally become single tokens.

start: charactersl o w e s tmerge most frequent pair e+s → esl o w es tmerge again es+t → estl o w est”est” is now ONE token,reused across “lowest”,“fastest”, “newest”…frequent sequences becomesingle tokens; rare onesstay split into pieces.
BPE repeatedly merges the most frequent adjacent pair, growing a subword vocabulary from characters.

Build a tiny BPE trainer and watch merges form on a toy corpus:

That’s the whole idea behind GPT’s tiktoken, SentencePiece, and friends — real tokenizers run this merge process over billions of characters and store the result as a vocabulary plus a merge table.

Token → embedding

Once text is a list of token IDs, the model’s embedding table (a big lookup matrix of shape vocab_size × d_model) maps each ID to a vector. That vector is the actual input to self-attention. The input and output embeddings are often tied (shared weights) to save parameters.

import torch.nn as nn
embed = nn.Embedding(num_embeddings=50257, embedding_dim=768)  # GPT-2 sizes
ids = tokenizer.encode("hello world")    # → [15496, 995]
vectors = embed(torch.tensor(ids))       # → shape (2, 768)

Quick check

Quick check

0/3
Q1Why do modern LLMs use subword tokens instead of whole words or single characters?
Q2What single operation does byte-pair encoding (BPE) repeat to build its vocabulary?
Q3Why can't an LLM reliably count the letters in 'strawberry'?

Next

Now that text is a sequence of embedded tokens, we can do the thing transformers are famous for: let every token look at every other token, with self-attention.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How does tokenization work, and why do LLMs rely on subword tokenizers like BPE?

Tokenization splits text into integer IDs the model can process; subword tokenizers like Byte-Pair Encoding start from characters or bytes and iteratively merge the most frequent adjacent pairs into a vocabulary. Subwords keep common words intact while decomposing rare or unseen words into known pieces, avoiding out-of-vocabulary problems and balancing vocabulary size against sequence length.

How does Byte-Pair Encoding (BPE) tokenization work?

BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pair of symbols until a target vocabulary size is reached. The resulting subword units handle rare and unseen words gracefully without any out-of-vocabulary tokens.

What are tokens in an LLM and why is API pricing per token rather than per word or character?

A token is the smallest unit a language model processes — typically a word, sub-word fragment, or punctuation mark produced by a byte-pair encoding (BPE) or similar algorithm. Pricing is per token because each token requires one forward-pass position in the attention matrix, directly driving compute and memory cost regardless of whether it maps to a full word or a single letter.

What is tokenization in NLP and why does it matter?

Tokenization splits raw text into discrete units — words, subwords, or characters — that a model can process numerically. The strategy chosen controls vocabulary size, out-of-vocabulary rate, and how well the model handles rare or morphologically complex words.

Related lessons

Explore further

Skip to content