datarekha

Item-based collaborative filtering

User-based CF breaks at scale. Item-based CF flips the axis — compare items instead of users — and lets you precompute similarities offline. This is the engine behind Amazon's 'customers who bought this also bought.'

9 min read Intermediate Recommender Systems Lesson 5 of 11

What you'll learn

  • Why item-item similarity is more stable and precomputable than user-user similarity
  • How to score recommendations: weighted sum of a user's own ratings
  • The cold-start and popularity-bias limits that still apply

Before you start

The axis flip: from user-user to item-item

Recall the utility matrix — rows are users, columns are items, cells are ratings (or implicit signals like clicks).

User-based CF reads across rows: find rows (users) whose rating patterns look like the target user’s row.

Item-based CF reads down columns: find columns (items) whose rating patterns are similar to each other. Two items are similar if the same users tend to rate them in the same direction — both get high marks together, or both get low marks together.

The key insight: once you have item-item similarities, you don’t need to find similar users at all. You look at what the target user has already rated, find the items most similar to those, and surface them as recommendations. The user’s own ratings become the scoring weights.

Why this scales and stays stable

Precomputable offline. Item-item similarities depend only on the rating matrix — not on who the current user is. You compute the full similarity table once per day (or per week), store it, and at request time you just do a lookup plus a short weighted sum. A user-based system must compute similarities on the fly for each incoming user, which does not scale.

Fewer items than users. Most platforms have millions of users but thousands (or low tens of thousands) of items. The similarity matrix you need to store and compute is item-count squared, not user-count squared — orders of magnitude smaller.

Stable relationships. Whether a thriller novel is similar to another thriller changes slowly as new ratings arrive. Whether two randomly chosen users are similar can flip overnight as their tastes diverge. Item-item similarities converge faster and stay reliable longer.

Denser co-rating. For two items to be compared, at least some users must have rated both. Popular items have been rated by thousands of users, giving a rich, low-variance similarity estimate. For two users to be compared, they must have rated the same items — which becomes sparse as the catalog grows.

How recommendations are generated

Given a target user u and a candidate item j (that u has not yet rated), the predicted score is:

score(u, j) = Σ_i [ sim(i, j) × rating(u, i) ]  /  Σ_i |sim(i, j)|

where the sum is over items i that u has rated and that are similar to j. In plain English: the predicted rating for j is a weighted average of u’s ratings on items similar to j, weighted by how similar each is to j.

The item-item similarity (first use of jargon: the scalar between 0 and 1, or -1 and 1, measuring how alike two items are in rating space) is typically cosine similarity computed from the rating columns.

Precompute offline (first use) means you run this computation in a batch job — not during the live request — so the result is cached and served instantly.

Diagram: item similarity graph and recommendation

Item A(liked, rated 5)Item Bsim = 0.91Item Csim = 0.78Item Dsim = 0.550.910.780.55Recommendations1. Item B (highest sim)2. Item C 3. Item DUser liked Item A → nearest neighbours ranked by similarity → recommend
Item A is the target user’s liked item. Dashed edges show precomputed cosine similarities. The top-k neighbours become the recommendation list.

Contrast with user-based CF

AxisUser-based CFItem-based CF
Similarity computed betweenUsersItems
Computed whenAt request time (or expensively cached)Offline, in batch
Scales to millions of usersPoorlyWell
Signal stabilityNoisy; user tastes shiftStable; item relationships persist
Cold start problemNew user with no ratingsNew item with no ratings
Classic exampleSocial recommendations (“your friends liked”)“Customers who bought this also bought”

The code: cosine item-item similarity from scratch

The output will show that items rated highly by the same cluster of users cluster together in similarity space — exactly the stable, precomputable signal item-based CF relies on.

What “Amazon also-bought” actually does

The paper “Amazon.com Recommendations: Item-to-Item Collaborative Filtering (Linden, Smith, York — IEEE Internet Computing, 2003) describes this pipeline almost exactly:

  1. Offline nightly job: scan all purchase histories, compute item-item co-occurrence and similarity.
  2. Store a similarity table: for each item, keep the top-N most similar items.
  3. At request time: look up the items in the current user’s history, union their top-N neighbour lists, score by similarity weighted by the user’s own purchase frequency, return ranked list.

The entire online step is a handful of table lookups and a sort — trivially fast even under enormous traffic. That is why item-based CF replaced memory-based user-CF as the default approach at scale.

Quick check

Quick check

0/3
Q1Why can item-item similarity be precomputed offline, but user-user similarity typically cannot?
Q2In item-based CF, the predicted score for item j is a weighted average of the target user's ratings on items similar to j. What are the weights?
Q3A startup launches a new product on its platform. The item-based CF system recommends it to almost no one for the first week. What is the most likely cause, and what is a practical mitigation?

Practice this in an interview

All questions
How would you design a metric to evaluate the relevance of a content recommendation feed?

Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.

How would you design a metric to measure the quality of a search feature inside an e-commerce app?

Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.

What is hybrid search and when should you use semantic vs keyword retrieval?

Keyword search (BM25) excels at exact term matching — product codes, proper nouns, rare abbreviations. Semantic search (dense embeddings) captures meaning and handles paraphrases. Hybrid search runs both in parallel and merges scores with Reciprocal Rank Fusion, giving the best of both worlds for most production RAG systems.

What are filter, wrapper, and embedded feature selection methods, and when do you use each?

Filter methods score features independently of the model using statistics like mutual information or correlation; they are fast but ignore feature interactions. Wrapper methods search subsets by actually training the model, finding better subsets at high computational cost. Embedded methods perform selection during training — LASSO and tree-based feature importances are the most common — offering a balance of quality and speed.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content