datarekha
NLP & LLMs Medium Asked at GoogleAsked at MetaAsked at Amazon

What are out-of-vocabulary (OOV) words and how do modern NLP systems handle them?

The short answer

OOV words are tokens unseen during vocabulary construction that a model cannot look up in its embedding table. Classical word-level models replace them with a generic UNK token, losing all information, while subword tokenizers (BPE, WordPiece) eliminate OOV entirely by decomposing any word into known subunits.

How to think about it

The OOV problem

A vocabulary is fixed at training time. At inference, user input may contain proper nouns, neologisms, domain jargon, or typos that were never seen. A word-level model maps all of these to [UNK], collapsing distinct signals into a single meaningless token.

Strategy 1: UNK token (word-level)

Simple but lossy. “COVID-19”, “ChatGPT”, and “xyzfoo” all become [UNK]. The model cannot distinguish between them.

Strategy 2: Character n-grams (fastText)

FastText represents each word as the sum of its character n-gram vectors. “unhappiness” decomposes into overlapping trigrams: <un, unh, nha, …, ess>. New words at inference still get meaningful vectors from shared substrings.

import fasttext

model = fasttext.train_unsupervised("corpus.txt", model="skipgram")

# OOV word still gets a vector via character n-grams
vec = model.get_word_vector("unhappiness")
print(vec.shape)  # (100,)

Strategy 3: Subword tokenization (BPE / WordPiece)

The vocabulary contains subword units. Any surface form decomposes into known pieces — OOV is structurally impossible.

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
# Novel word that was never in BERT's training vocabulary
print(tok.tokenize("zooomorphic"))
# ['zoo', '##omo', '##rph', '##ic']  — decomposed, no UNK

Strategy 4: Byte-level BPE (GPT-2 and later)

The vocabulary is built over raw bytes. Every possible UTF-8 string can be represented without OOV, making it robust to any language, emoji, or special character.

Comparison

ApproachOOV rateSemantic qualityInference speed
Word-level + UNKHighPoor for OOVFast
fastText n-gramsZeroGoodFast
BPE / WordPieceZeroGoodModerate
Byte-level BPEZeroExcellentModerate

Keep practising

All NLP & LLMs questions

Explore further

Skip to content