datarekha
NLP & LLMs Medium Asked at OpenAIAsked at GoogleAsked at Meta

How does Byte-Pair Encoding (BPE) tokenization work?

The short answer

BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pair of symbols until a target vocabulary size is reached. The resulting subword units handle rare and unseen words gracefully without any out-of-vocabulary tokens.

How to think about it

BPE (Sennrich et al., 2016) was originally developed for neural machine translation to handle open vocabularies and has since become the tokenizer of choice for GPT-family models.

Training algorithm

  1. Initialize vocabulary with every character plus a special end-of-word symbol.
  2. Count all adjacent symbol pairs across the corpus.
  3. Merge the most frequent pair into a new symbol.
  4. Repeat steps 2-3 until the vocabulary reaches the target size (e.g. 50,000).

Example

Corpus: low low low lower lowest

After several merges: l+o → lo, lo+w → low, low+e → lowe, and so on. The token low becomes atomic while lower decomposes to low + er.

# Using HuggingFace tokenizers library
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(vocab_size=500, special_tokens=["[UNK]"])
tokenizer.train_from_iterator(
    ["low low lower lowest newer new"],
    trainer=trainer
)
print(tokenizer.encode("lowest").tokens)

WordPiece vs BPE

WordPiece (used in BERT) selects merges that maximize language-model likelihood rather than raw frequency, which tends to produce more linguistically meaningful subwords. Both produce similar practical results.

The key benefit over word tokenization: a vocabulary of 30k-50k subword tokens can represent any word, keeping sequences shorter than character-level tokenization while eliminating the OOV problem entirely.

Keep practising

All NLP & LLMs questions

Explore further

Skip to content