Content-based filtering
Recommend new items nobody has rated yet — by representing every item as a feature vector and matching it to what the user has already loved.
What you'll learn
- How to represent items and users as feature vectors using TF-IDF
- Why cosine similarity: comparing directions, not magnitudes
- Strengths (solves item cold-start, explainable) and weaknesses (filter bubble, feature dependency)
Before you start
Why content-based filtering exists
Collaborative filtering (CF) is powerful, but it has a fundamental dependency: it needs overlap — users who have rated the same items. A new item has no overlap with anything. This is the item cold-start problem.
Content-based filtering sidesteps it entirely by ignoring other users. Instead, it asks: “What are the properties of this item, and which users have consistently liked items with those same properties?” No ratings on the new item are needed — only its description, genre tags, or metadata.
Building blocks
Item profiles
An item profile is a numeric vector that encodes the properties of one item. For a movie, the features might be its genre tags (action, sci-fi, thriller). For a product, they might be category, price tier, and keywords from the description. For a document, they might be word frequencies.
The most practical way to turn free-text descriptions into item profiles is TF-IDF (Term Frequency-Inverse Document Frequency). TF-IDF assigns a high weight to words that appear often in this document but rarely across all documents — words that actually distinguish the item. Common words like “the” or “movie” get near-zero weight; distinctive words like “cyberpunk” or “heist” score high.
User profiles
A user profile is built by aggregating the item profiles of everything the user has liked. In the simplest form: average the feature vectors of all positively-rated items. A user who liked three sci-fi films will have a profile with a strong sci-fi signal. A user who liked romantic comedies will have a strong romance-comedy signal. The profile encodes taste without ever consulting other users.
Cosine similarity
Once both items and users live in the same feature space, recommendation is a nearest-neighbor search. The distance metric of choice is cosine similarity — the cosine of the angle between two vectors. It ranges from -1 (opposite) to 1 (identical direction). Two vectors pointing in the same direction are similar even if one is much longer than the other, which matters because a user who rated 200 films will have a larger magnitude profile than a user who rated 5, but their taste direction is what we want to compare.
Vector geometry — item profiles and the user profile
The diagram below shows five item vectors in a two-dimensional feature space (imagine “sci-fi intensity” on the horizontal axis and “action intensity” on the vertical). The user profile is the average of the items the user liked (filled dots). The recommendation algorithm picks the item whose angle to the user profile vector is smallest — closest by cosine.
Items A and B were liked; their average becomes the user profile vector (amber dashed). Item C is the next-closest by angle — the top recommendation. D and E are far away.
Strengths
Handles item cold-start. A brand-new item with zero ratings can be recommended as soon as it has metadata. No other users needed.
Explainable recommendations. Because the system knows why an item was recommended — it matched specific features — you can tell the user: “Because you liked sci-fi films with ensemble casts.” Collaborative filtering rarely offers this.
No popularity bias. Every item is judged by its features, not how many people have rated it. Niche items get a fair shot.
Needs good features. If the item metadata is sparse, wrong, or missing, the system can’t do its job. Garbage features in, garbage recommendations out.
Code: TF-IDF item profiles and cosine similarity
The scores should confirm your intuition: films that share distinctive vocabulary with “Gravity” (words like “astronaut,” “space,” “survive”) score high; films that share no relevant terms score near zero.
From item similarity to user profiles
In the snippet above, a single liked item serves as the query. In a real system, the user profile is the mean (or weighted mean) of TF-IDF vectors for all items the user has positively rated. This aggregated vector is then compared against every candidate item in the catalog, and the top-K by cosine similarity are surfaced as recommendations.
# Conceptual sketch — not runnable here
liked_indices = [0, 1] # user liked Gravity and Interstellar
user_profile = tfidf_matrix[liked_indices].mean(axis=0)
sims = cosine_similarity(user_profile, tfidf_matrix).flatten()
Summary
Content-based filtering builds item profiles from features (TF-IDF over descriptions is a strong default), aggregates liked-item vectors into a user profile, and ranks candidates by cosine similarity to that profile. It solves the item cold-start problem and produces explainable recommendations, but risks creating a filter bubble and depends heavily on having rich, accurate item metadata.
Quick check
Quick check
Practice this in an interview
All questionsFeed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.
Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.
Filter methods score features independently of the model using statistics like mutual information or correlation; they are fast but ignore feature interactions. Wrapper methods search subsets by actually training the model, finding better subsets at high computational cost. Embedded methods perform selection during training — LASSO and tree-based feature importances are the most common — offering a balance of quality and speed.
Keyword search (BM25) excels at exact term matching — product codes, proper nouns, rare abbreviations. Semantic search (dense embeddings) captures meaning and handles paraphrases. Hybrid search runs both in parallel and merges scores with Reciprocal Rank Fusion, giving the best of both worlds for most production RAG systems.