What is tokenization in NLP and why does it matter?
Tokenization splits raw text into discrete units — words, subwords, or characters — that a model can process numerically. The strategy chosen controls vocabulary size, out-of-vocabulary rate, and how well the model handles rare or morphologically complex words.
How to think about it
Tokenization is the first step in almost every NLP pipeline: it converts a string into a sequence of tokens that can be mapped to integer IDs and fed to a model.
Three common strategies
| Strategy | Example | Trade-off |
|---|---|---|
| Word-level | ["play","ing"] | Large vocab; OOV problem |
| Character-level | ["p","l","a","y",...] | No OOV; very long sequences |
| Subword (BPE/WordPiece) | ["play","##ing"] | Balances both |
Word tokenization splits on whitespace and punctuation. Simple but brittle — “playing”, “plays”, and “played” become three unrelated IDs.
Subword tokenization learns frequent byte-pair or character n-gram merges from a corpus so common words stay intact while rare words decompose into known pieces: "unhappiness" → ["un","happiness"]. This is how BERT (WordPiece) and GPT (BPE) work.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Tokenization matters a lot")
print(tokens)
# ['token', '##ization', 'matters', 'a', 'lot']
Why it matters for models: the tokenization choice fixes the sequence length and vocabulary size, directly influencing memory, speed, and the model’s ability to generalize across inflections and compound words.