How does tokenization work, and why do LLMs rely on subword tokenizers like BPE?
Tokenization splits text into integer IDs the model can process; subword tokenizers like Byte-Pair Encoding start from characters or bytes and iteratively merge the most frequent adjacent pairs into a vocabulary. Subwords keep common words intact while decomposing rare or unseen words into known pieces, avoiding out-of-vocabulary problems and balancing vocabulary size against sequence length.
How to think about it
Tokenization splits text into integer IDs the model can process; subword tokenizers like Byte-Pair Encoding start from characters or bytes and iteratively merge the most frequent adjacent pairs into a vocabulary. Subwords keep common words intact while decomposing rare or unseen words into known pieces, avoiding out-of-vocabulary problems and balancing vocabulary size against sequence length.