How does tokenization work, and why do LLMs rely on subword tokenizers like BPE?

Tokenization splits text into integer IDs the model can process; subword tokenizers like Byte-Pair Encoding start from characters or bytes and iteratively merge the most frequent adjacent pairs into a vocabulary. Subwords keep common words intact while decomposing rare or unseen words into known pieces, avoiding out-of-vocabulary problems and balancing vocabulary size against sequence length.

How does Byte-Pair Encoding (BPE) tokenization work?

BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pair of symbols until a target vocabulary size is reached. The resulting subword units handle rare and unseen words gracefully without any out-of-vocabulary tokens.

What are tokens in an LLM and why is API pricing per token rather than per word or character?

A token is the smallest unit a language model processes — typically a word, sub-word fragment, or punctuation mark produced by a byte-pair encoding (BPE) or similar algorithm. Pricing is per token because each token requires one forward-pass position in the attention matrix, directly driving compute and memory cost regardless of whether it maps to a full word or a single letter.

What is tokenization in NLP and why does it matter?

Tokenization splits raw text into discrete units — words, subwords, or characters — that a model can process numerically. The strategy chosen controls vocabulary size, out-of-vocabulary rate, and how well the model handles rare or morphologically complex words.

Tokenization & BPE — Deep Learning

Before a transformer can do anything with your text, the text has to become numbers. A model never sees the word “running” — it sees something like token 12973, looks that ID up in an embedding table, and gets a vector. The rule that turns text into those IDs is tokenization, and it quietly explains a surprising amount: why API calls cost what they cost, why your context window fills faster in some languages, and why a frontier model can write an essay but can’t count the letters in “strawberry.”

Why not just words, or just characters?

Two obvious choices both fail:

One token per word → the vocabulary is unbounded (typos, names, new words, every language) and the model has no idea that “run”, “running”, and “runner” are related.
One token per character → tiny vocabulary, but sequences become enormous, and attention is quadratic in length — far too expensive.

The winning compromise is subwords: common words stay whole (“the”, “model”), rare words split into reusable pieces (“token” + “ization”). That keeps the vocabulary fixed (~50k–200k tokens) while handling any input. Type into the explorer and watch how real text fragments — notice spaces, digits, and rare words behave very differently:

Byte-pair encoding, the algorithm

The dominant method, BPE, learns its vocabulary from data with one beautifully simple loop: start from individual characters (or bytes), then repeatedly find the most frequent adjacent pair and merge it into a new token. Do that a few thousand times and frequent sequences naturally become single tokens.

BPE repeatedly merges the most frequent adjacent pair, growing a subword vocabulary from characters.

Build a tiny BPE trainer and watch merges form on a toy corpus:

That’s the whole idea behind GPT’s tiktoken, SentencePiece, and friends — real tokenizers run this merge process over billions of characters and store the result as a vocabulary plus a merge table.

Token → embedding

Once text is a list of token IDs, the model’s embedding table (a big lookup matrix of shape vocab_size × d_model) maps each ID to a vector. That vector is the actual input to self-attention. The input and output embeddings are often tied (shared weights) to save parameters.

import torch.nn as nn
embed = nn.Embedding(num_embeddings=50257, embedding_dim=768)  # GPT-2 sizes
ids = tokenizer.encode("hello world")    # → [15496, 995]
vectors = embed(torch.tensor(ids))       # → shape (2, 768)

Quick check

0/3

Q1Why do modern LLMs use subword tokens instead of whole words or single characters?

Q2What single operation does byte-pair encoding (BPE) repeat to build its vocabulary?

Q3Why can't an LLM reliably count the letters in 'strawberry'?

Now that text is a sequence of embedded tokens, we can do the thing transformers are famous for: let every token look at every other token, with self-attention.

Tokenization & BPE

What you'll learn

Before you start

Why not just words, or just characters?

Byte-pair encoding, the algorithm

Token → embedding

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further