NLP & LLMs Easy Asked at GoogleAsked at AmazonAsked at Meta

What is tokenization in NLP and why does it matter?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Tokenization splits raw text into discrete units — words, subwords, or characters — that a model can process numerically. The strategy chosen controls vocabulary size, out-of-vocabulary rate, and how well the model handles rare or morphologically complex words.

How to think about it

Tokenization is the first step in almost every NLP pipeline: it converts a string into a sequence of tokens that can be mapped to integer IDs and fed to a model.

Three common strategies

Strategy	Example	Trade-off
Word-level	`["play","ing"]`	Large vocab; OOV problem
Character-level	`["p","l","a","y",...]`	No OOV; very long sequences
Subword (BPE/WordPiece)	`["play","##ing"]`	Balances both

Word tokenization splits on whitespace and punctuation. Simple but brittle — “playing”, “plays”, and “played” become three unrelated IDs.

Subword tokenization learns frequent byte-pair or character n-gram merges from a corpus so common words stay intact while rare words decompose into known pieces: "unhappiness" → ["un","happiness"]. This is how BERT (WordPiece) and GPT (BPE) work.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Tokenization matters a lot")
print(tokens)
# ['token', '##ization', 'matters', 'a', 'lot']

Why it matters for models: the tokenization choice fixes the sequence length and vocabulary size, directly influencing memory, speed, and the model’s ability to generalize across inflections and compound words.

What is tokenization in NLP and why does it matter?

Keep practising

Explore further