What are out-of-vocabulary (OOV) words and how do modern NLP systems handle them?
OOV words are tokens unseen during vocabulary construction that a model cannot look up in its embedding table. Classical word-level models replace them with a generic UNK token, losing all information, while subword tokenizers (BPE, WordPiece) eliminate OOV entirely by decomposing any word into known subunits.
How to think about it
The OOV problem
A vocabulary is fixed at training time. At inference, user input may contain proper nouns, neologisms, domain jargon, or typos that were never seen. A word-level model maps all of these to [UNK], collapsing distinct signals into a single meaningless token.
Strategy 1: UNK token (word-level)
Simple but lossy. “COVID-19”, “ChatGPT”, and “xyzfoo” all become [UNK]. The model cannot distinguish between them.
Strategy 2: Character n-grams (fastText)
FastText represents each word as the sum of its character n-gram vectors. “unhappiness” decomposes into overlapping trigrams: <un, unh, nha, …, ess>. New words at inference still get meaningful vectors from shared substrings.
import fasttext
model = fasttext.train_unsupervised("corpus.txt", model="skipgram")
# OOV word still gets a vector via character n-grams
vec = model.get_word_vector("unhappiness")
print(vec.shape) # (100,)
Strategy 3: Subword tokenization (BPE / WordPiece)
The vocabulary contains subword units. Any surface form decomposes into known pieces — OOV is structurally impossible.
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
# Novel word that was never in BERT's training vocabulary
print(tok.tokenize("zooomorphic"))
# ['zoo', '##omo', '##rph', '##ic'] — decomposed, no UNK
Strategy 4: Byte-level BPE (GPT-2 and later)
The vocabulary is built over raw bytes. Every possible UTF-8 string can be represented without OOV, making it robust to any language, emoji, or special character.
Comparison
| Approach | OOV rate | Semantic quality | Inference speed |
|---|---|---|---|
| Word-level + UNK | High | Poor for OOV | Fast |
| fastText n-grams | Zero | Good | Fast |
| BPE / WordPiece | Zero | Good | Moderate |
| Byte-level BPE | Zero | Excellent | Moderate |