datarekha

Content-based filtering

Recommend new items nobody has rated yet — by representing every item as a feature vector and matching it to what the user has already loved.

9 min read Intermediate Recommender Systems Lesson 3 of 11

What you'll learn

  • How to represent items and users as feature vectors using TF-IDF
  • Why cosine similarity: comparing directions, not magnitudes
  • Strengths (solves item cold-start, explainable) and weaknesses (filter bubble, feature dependency)

Before you start

Why content-based filtering exists

Collaborative filtering (CF) is powerful, but it has a fundamental dependency: it needs overlap — users who have rated the same items. A new item has no overlap with anything. This is the item cold-start problem.

Content-based filtering sidesteps it entirely by ignoring other users. Instead, it asks: “What are the properties of this item, and which users have consistently liked items with those same properties?” No ratings on the new item are needed — only its description, genre tags, or metadata.

Building blocks

Item profiles

An item profile is a numeric vector that encodes the properties of one item. For a movie, the features might be its genre tags (action, sci-fi, thriller). For a product, they might be category, price tier, and keywords from the description. For a document, they might be word frequencies.

The most practical way to turn free-text descriptions into item profiles is TF-IDF (Term Frequency-Inverse Document Frequency). TF-IDF assigns a high weight to words that appear often in this document but rarely across all documents — words that actually distinguish the item. Common words like “the” or “movie” get near-zero weight; distinctive words like “cyberpunk” or “heist” score high.

User profiles

A user profile is built by aggregating the item profiles of everything the user has liked. In the simplest form: average the feature vectors of all positively-rated items. A user who liked three sci-fi films will have a profile with a strong sci-fi signal. A user who liked romantic comedies will have a strong romance-comedy signal. The profile encodes taste without ever consulting other users.

Cosine similarity

Once both items and users live in the same feature space, recommendation is a nearest-neighbor search. The distance metric of choice is cosine similarity — the cosine of the angle between two vectors. It ranges from -1 (opposite) to 1 (identical direction). Two vectors pointing in the same direction are similar even if one is much longer than the other, which matters because a user who rated 200 films will have a larger magnitude profile than a user who rated 5, but their taste direction is what we want to compare.

Vector geometry — item profiles and the user profile

The diagram below shows five item vectors in a two-dimensional feature space (imagine “sci-fi intensity” on the horizontal axis and “action intensity” on the vertical). The user profile is the average of the items the user liked (filled dots). The recommendation algorithm picks the item whose angle to the user profile vector is smallest — closest by cosine.

sci-fi intensity →action intensity →A — The MatrixB — InterstellarC — ArrivalD — Die HardE — TitanicUser profileFilled dots = items user liked (A, B used to build profile). Closest angle wins.

Items A and B were liked; their average becomes the user profile vector (amber dashed). Item C is the next-closest by angle — the top recommendation. D and E are far away.

Strengths

Handles item cold-start. A brand-new item with zero ratings can be recommended as soon as it has metadata. No other users needed.

Explainable recommendations. Because the system knows why an item was recommended — it matched specific features — you can tell the user: “Because you liked sci-fi films with ensemble casts.” Collaborative filtering rarely offers this.

No popularity bias. Every item is judged by its features, not how many people have rated it. Niche items get a fair shot.

Needs good features. If the item metadata is sparse, wrong, or missing, the system can’t do its job. Garbage features in, garbage recommendations out.

Code: TF-IDF item profiles and cosine similarity

The scores should confirm your intuition: films that share distinctive vocabulary with “Gravity” (words like “astronaut,” “space,” “survive”) score high; films that share no relevant terms score near zero.

From item similarity to user profiles

In the snippet above, a single liked item serves as the query. In a real system, the user profile is the mean (or weighted mean) of TF-IDF vectors for all items the user has positively rated. This aggregated vector is then compared against every candidate item in the catalog, and the top-K by cosine similarity are surfaced as recommendations.

# Conceptual sketch — not runnable here
liked_indices = [0, 1]   # user liked Gravity and Interstellar
user_profile  = tfidf_matrix[liked_indices].mean(axis=0)
sims          = cosine_similarity(user_profile, tfidf_matrix).flatten()

Summary

Content-based filtering builds item profiles from features (TF-IDF over descriptions is a strong default), aggregates liked-item vectors into a user profile, and ranks candidates by cosine similarity to that profile. It solves the item cold-start problem and produces explainable recommendations, but risks creating a filter bubble and depends heavily on having rich, accurate item metadata.

Quick check

Quick check

0/3
Q1A streaming platform launches 50 new documentaries overnight. None has a single user rating yet. Which recommendation approach can immediately recommend relevant ones to the right viewers?
Q2A user has rated 40 horror films highly and nothing else. After months of use, the recommender keeps suggesting only horror. Which weakness of content-based filtering does this illustrate?
Q3Two movies share no words in their TF-IDF descriptions, but both belong to the 'sci-fi' genre encoded as a binary feature. If the TF-IDF vectorizer only sees the description text and ignores the genre field, what happens to their cosine similarity?

Practice this in an interview

All questions
How would you design a metric to evaluate the relevance of a content recommendation feed?

Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.

How would you design a metric to measure the quality of a search feature inside an e-commerce app?

Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.

What are filter, wrapper, and embedded feature selection methods, and when do you use each?

Filter methods score features independently of the model using statistics like mutual information or correlation; they are fast but ignore feature interactions. Wrapper methods search subsets by actually training the model, finding better subsets at high computational cost. Embedded methods perform selection during training — LASSO and tree-based feature importances are the most common — offering a balance of quality and speed.

What is hybrid search and when should you use semantic vs keyword retrieval?

Keyword search (BM25) excels at exact term matching — product codes, proper nouns, rare abbreviations. Semantic search (dense embeddings) captures meaning and handles paraphrases. Hybrid search runs both in parallel and merges scores with Reciprocal Rank Fusion, giving the best of both worlds for most production RAG systems.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content