What are n-grams and when should you use them in NLP?
An n-gram is a contiguous sequence of n tokens from text — bigrams capture two-word phrases, trigrams capture three. They add local word-order context to bag-of-words models, improving tasks like language modelling, spell-checking, and text classification where short phrases are discriminative.
How to think about it
A unigram model treats each word independently. N-grams extend this by considering sequences, capturing limited local context without requiring a neural network.
Notation
- Unigram (n=1):
["new", "york", "city"] - Bigram (n=2):
["new york", "york city"] - Trigram (n=3):
["new york city"]
The phrase “New York” has very different meaning from “New” and “York” separately — bigrams capture this.
N-gram language models estimate the probability of the next word given the previous n-1 words using maximum-likelihood counts from a corpus. Larger n gives more context but requires exponentially more data (sparsity problem).
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"new york city is large",
"new york is expensive",
"los angeles is sunny",
]
# Extract unigrams and bigrams together
vec = CountVectorizer(ngram_range=(1, 2))
X = vec.fit_transform(corpus)
print(vec.get_feature_names_out())
# includes 'new york', 'york city', 'los angeles', ...
Combining with TF-IDF is common in production text classifiers: TfidfVectorizer(ngram_range=(1, 2)) adds bigram features on top of unigrams with no extra code.
Trade-offs
| n | Context | Data needed | Sparsity |
|---|---|---|---|
| 1 | None | Low | Low |
| 2 | Local pair | Moderate | Moderate |
| 3 | Short phrase | High | High |
| 4+ | Rarely practical | Very high | Severe |