What is TF-IDF and how does it improve on raw bag-of-words counts?
TF-IDF weights each term by how often it appears in a document (TF) scaled down by how common it is across the whole corpus (IDF), so words that are frequent everywhere — like 'the' — get low scores while distinctive terms get high scores. This makes document vectors more informative than raw counts for retrieval and classification.
How to think about it
Bag-of-words (BoW) represents a document as a vector of raw term counts. It ignores word order but captures term frequency — good enough for many classification tasks, yet dominated by stop words.
TF-IDF applies a two-factor weight to each term:
- TF (term frequency):
count(t, d) / total_terms(d)— how dominant this term is in document d. - IDF (inverse document frequency):
log(N / df(t))— penalizes terms that appear in almost every document; rewards rare, specific terms.
Final score: TF-IDF(t, d) = TF(t, d) * IDF(t)
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"the cat sat on the mat",
"the dog ran across the yard",
"cats and dogs are common pets",
]
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)
print(vec.get_feature_names_out())
# 'the' will have near-zero weight; 'mat', 'yard' will be high
Why it beats raw BoW for retrieval: in a search engine, if “the” scores high, every document looks equally relevant to every query. TF-IDF suppresses such noise and surfaces documents that genuinely discuss the query term.
Typical use cases: document clustering, keyword extraction, feature engineering for text classifiers, and classical information retrieval.