datarekha
NLP & LLMs Easy Asked at GoogleAsked at Amazon

What are n-grams and when should you use them in NLP?

The short answer

An n-gram is a contiguous sequence of n tokens from text — bigrams capture two-word phrases, trigrams capture three. They add local word-order context to bag-of-words models, improving tasks like language modelling, spell-checking, and text classification where short phrases are discriminative.

How to think about it

A unigram model treats each word independently. N-grams extend this by considering sequences, capturing limited local context without requiring a neural network.

Notation

  • Unigram (n=1): ["new", "york", "city"]
  • Bigram (n=2): ["new york", "york city"]
  • Trigram (n=3): ["new york city"]

The phrase “New York” has very different meaning from “New” and “York” separately — bigrams capture this.

N-gram language models estimate the probability of the next word given the previous n-1 words using maximum-likelihood counts from a corpus. Larger n gives more context but requires exponentially more data (sparsity problem).

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "new york city is large",
    "new york is expensive",
    "los angeles is sunny",
]

# Extract unigrams and bigrams together
vec = CountVectorizer(ngram_range=(1, 2))
X = vec.fit_transform(corpus)
print(vec.get_feature_names_out())
# includes 'new york', 'york city', 'los angeles', ...

Combining with TF-IDF is common in production text classifiers: TfidfVectorizer(ngram_range=(1, 2)) adds bigram features on top of unigrams with no extra code.

Trade-offs

nContextData neededSparsity
1NoneLowLow
2Local pairModerateModerate
3Short phraseHighHigh
4+Rarely practicalVery highSevere

Keep practising

All NLP & LLMs questions

Explore further

Skip to content