datarekha
NLP & LLMs Easy Asked at GoogleAsked at AmazonAsked at Meta

What is tokenization in NLP and why does it matter?

The short answer

Tokenization splits raw text into discrete units — words, subwords, or characters — that a model can process numerically. The strategy chosen controls vocabulary size, out-of-vocabulary rate, and how well the model handles rare or morphologically complex words.

How to think about it

Tokenization is the first step in almost every NLP pipeline: it converts a string into a sequence of tokens that can be mapped to integer IDs and fed to a model.

Three common strategies

StrategyExampleTrade-off
Word-level["play","ing"]Large vocab; OOV problem
Character-level["p","l","a","y",...]No OOV; very long sequences
Subword (BPE/WordPiece)["play","##ing"]Balances both

Word tokenization splits on whitespace and punctuation. Simple but brittle — “playing”, “plays”, and “played” become three unrelated IDs.

Subword tokenization learns frequent byte-pair or character n-gram merges from a corpus so common words stay intact while rare words decompose into known pieces: "unhappiness"["un","happiness"]. This is how BERT (WordPiece) and GPT (BPE) work.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Tokenization matters a lot")
print(tokens)
# ['token', '##ization', 'matters', 'a', 'lot']

Why it matters for models: the tokenization choice fixes the sequence length and vocabulary size, directly influencing memory, speed, and the model’s ability to generalize across inflections and compound words.

Keep practising

All NLP & LLMs questions

Explore further

Skip to content