Item-based collaborative filtering
User-based CF breaks at scale. Item-based CF flips the axis — compare items instead of users — and lets you precompute similarities offline. This is the engine behind Amazon's 'customers who bought this also bought.'
What you'll learn
- Why item-item similarity is more stable and precomputable than user-user similarity
- How to score recommendations: weighted sum of a user's own ratings
- The cold-start and popularity-bias limits that still apply
Before you start
The axis flip: from user-user to item-item
Recall the utility matrix — rows are users, columns are items, cells are ratings (or implicit signals like clicks).
User-based CF reads across rows: find rows (users) whose rating patterns look like the target user’s row.
Item-based CF reads down columns: find columns (items) whose rating patterns are similar to each other. Two items are similar if the same users tend to rate them in the same direction — both get high marks together, or both get low marks together.
The key insight: once you have item-item similarities, you don’t need to find similar users at all. You look at what the target user has already rated, find the items most similar to those, and surface them as recommendations. The user’s own ratings become the scoring weights.
Why this scales and stays stable
Precomputable offline. Item-item similarities depend only on the rating matrix — not on who the current user is. You compute the full similarity table once per day (or per week), store it, and at request time you just do a lookup plus a short weighted sum. A user-based system must compute similarities on the fly for each incoming user, which does not scale.
Fewer items than users. Most platforms have millions of users but thousands (or low tens of thousands) of items. The similarity matrix you need to store and compute is item-count squared, not user-count squared — orders of magnitude smaller.
Stable relationships. Whether a thriller novel is similar to another thriller changes slowly as new ratings arrive. Whether two randomly chosen users are similar can flip overnight as their tastes diverge. Item-item similarities converge faster and stay reliable longer.
Denser co-rating. For two items to be compared, at least some users must have rated both. Popular items have been rated by thousands of users, giving a rich, low-variance similarity estimate. For two users to be compared, they must have rated the same items — which becomes sparse as the catalog grows.
How recommendations are generated
Given a target user u and a candidate item j (that u has not yet rated), the predicted score is:
score(u, j) = Σ_i [ sim(i, j) × rating(u, i) ] / Σ_i |sim(i, j)|
where the sum is over items i that u has rated and that are similar to j. In plain English: the predicted rating for j is a weighted average of u’s ratings on items similar to j, weighted by how similar each is to j.
The item-item similarity (first use of jargon: the scalar between 0 and 1, or -1 and 1, measuring how alike two items are in rating space) is typically cosine similarity computed from the rating columns.
Precompute offline (first use) means you run this computation in a batch job — not during the live request — so the result is cached and served instantly.
Diagram: item similarity graph and recommendation
Contrast with user-based CF
| Axis | User-based CF | Item-based CF |
|---|---|---|
| Similarity computed between | Users | Items |
| Computed when | At request time (or expensively cached) | Offline, in batch |
| Scales to millions of users | Poorly | Well |
| Signal stability | Noisy; user tastes shift | Stable; item relationships persist |
| Cold start problem | New user with no ratings | New item with no ratings |
| Classic example | Social recommendations (“your friends liked”) | “Customers who bought this also bought” |
The code: cosine item-item similarity from scratch
The output will show that items rated highly by the same cluster of users cluster together in similarity space — exactly the stable, precomputable signal item-based CF relies on.
What “Amazon also-bought” actually does
The paper “Amazon.com Recommendations: Item-to-Item Collaborative Filtering” (Linden, Smith, York — IEEE Internet Computing, 2003) describes this pipeline almost exactly:
- Offline nightly job: scan all purchase histories, compute item-item co-occurrence and similarity.
- Store a similarity table: for each item, keep the top-N most similar items.
- At request time: look up the items in the current user’s history, union their top-N neighbour lists, score by similarity weighted by the user’s own purchase frequency, return ranked list.
The entire online step is a handful of table lookups and a sort — trivially fast even under enormous traffic. That is why item-based CF replaced memory-based user-CF as the default approach at scale.
Quick check
Quick check
Practice this in an interview
All questionsFeed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.
Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.
Keyword search (BM25) excels at exact term matching — product codes, proper nouns, rare abbreviations. Semantic search (dense embeddings) captures meaning and handles paraphrases. Hybrid search runs both in parallel and merges scores with Reciprocal Rank Fusion, giving the best of both worlds for most production RAG systems.
Filter methods score features independently of the model using statistics like mutual information or correlation; they are fast but ignore feature interactions. Wrapper methods search subsets by actually training the model, finding better subsets at high computational cost. Embedded methods perform selection during training — LASSO and tree-based feature importances are the most common — offering a balance of quality and speed.