datarekha
NLP & LLMs Easy Asked at GoogleAsked at AmazonAsked at Microsoft

What is TF-IDF and how does it improve on raw bag-of-words counts?

The short answer

TF-IDF weights each term by how often it appears in a document (TF) scaled down by how common it is across the whole corpus (IDF), so words that are frequent everywhere — like 'the' — get low scores while distinctive terms get high scores. This makes document vectors more informative than raw counts for retrieval and classification.

How to think about it

Bag-of-words (BoW) represents a document as a vector of raw term counts. It ignores word order but captures term frequency — good enough for many classification tasks, yet dominated by stop words.

TF-IDF applies a two-factor weight to each term:

  • TF (term frequency): count(t, d) / total_terms(d) — how dominant this term is in document d.
  • IDF (inverse document frequency): log(N / df(t)) — penalizes terms that appear in almost every document; rewards rare, specific terms.

Final score: TF-IDF(t, d) = TF(t, d) * IDF(t)

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "the cat sat on the mat",
    "the dog ran across the yard",
    "cats and dogs are common pets",
]

vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
print(vec.get_feature_names_out())
# 'the' will have near-zero weight; 'mat', 'yard' will be high

Why it beats raw BoW for retrieval: in a search engine, if “the” scores high, every document looks equally relevant to every query. TF-IDF suppresses such noise and surfaces documents that genuinely discuss the query term.

Typical use cases: document clustering, keyword extraction, feature engineering for text classifiers, and classical information retrieval.

Keep practising

All NLP & LLMs questions

Explore further

Skip to content