User-based collaborative filtering
How 'people similar to you liked X' works — finding a neighborhood of like-minded users and using their ratings to predict yours.
What you'll learn
- How to measure user–user similarity from a ratings matrix using cosine or Pearson correlation
- How to predict a missing rating as a similarity-weighted average of neighbor ratings (with mean-centering)
- Strengths and weaknesses: no item features needed, but sparsity and scalability are real limits
Before you start
Why behavior alone is enough
Every time a user rates an item, clicks on something, or adds it to a playlist, they are revealing something about their taste. When two users have left very similar trails of behavior — liking the same obscure things, disliking the same popular ones — their future preferences are likely to overlap too.
Collaborative filtering (CF) is the family of techniques that exploits this idea: use the collective behavior of many users to filter items for any one user, without ever looking at what those items actually are. No genre labels, no plot summaries, no product descriptions — pure signal from human choices.
User-based CF is the most direct form: find the users most similar to the target user, then let their ratings speak for the items the target has not yet seen.
Step 1 — Build the ratings matrix
Start with a utility matrix (rows = users, columns = items, cells = ratings or NaN when unobserved). This is the input to every step that follows.
Item A Item B Item C Item D
Alice 5 3 NaN 1
Bob 4 NaN 4 1
Carol NaN 2 5 2
Dave 5 4 NaN NaN
Most cells are empty. This sparsity is the defining challenge of the whole problem.
Step 2 — Measure user–user similarity
For every pair of users, compute how similar their rating vectors are. Two popular measures:
Cosine similarity treats each user’s ratings as a vector in item-space and measures the angle between them. Only items both users have rated contribute to the dot product (co-rated items).
Pearson correlation does the same after subtracting each user’s mean rating first, which corrects for the fact that some people rate everything 4–5 and others use the full 1–5 scale. In practice, Pearson often performs better in CF because it removes this per-user rating bias.
The output is a user–user similarity matrix where entry (u, v) is a score in [-1, 1].
Step 3 — Select the neighborhood
For a target user u, sort all other users by their similarity to u and keep the top-k most similar. This group is called the neighborhood — or k nearest neighbors (kNN).
Choosing k involves a trade-off: too small and there is not enough signal; too large and dissimilar users start to dilute the prediction.
Step 4 — Predict the missing rating
For item i that the target user u has not rated, collect all neighbors who have rated i. Then compute a similarity-weighted average:
predicted(u, i) = mean(u) + sum_v [ sim(u,v) * (rating(v,i) - mean(v)) ]
/ sum_v [ |sim(u,v)| ]
The subtraction of each neighbor’s mean rating — mean-centering — is the key detail. If User B generously rates everything 4 or 5, their raw rating of 4 for Item X is actually lukewarm relative to their usual. Mean-centering converts their rating into a deviation (above or below their baseline), which is a more honest signal. We add the target’s own mean back at the end to put the prediction in their personal scale.
Strengths of user-based CF
- No item features needed. The algorithm works on any domain — movies, songs, products, research papers — without a single word of description about the items themselves.
- Discovers non-obvious connections. Two users might both love a niche documentary and a pop album; no content-based system would link those items, but CF finds it instantly.
- Transparent rationale. “Because users similar to you liked it” is an explanation users intuitively understand.
Weaknesses
User cold-start is a third problem: a brand-new user has no rating history, so their neighborhood is empty and no prediction is possible.
Code — user-user cosine similarity and prediction
Run the cell and observe:
- The similarity scores reflect how much each neighbor’s taste overlaps with the target user’s.
- The prediction lands near the ratings of the most similar neighbors (weighted by closeness, adjusted for each neighbor’s baseline generosity).
- Changing k — commenting out lower-similarity neighbors — shifts the prediction.
Summary
User-based collaborative filtering works by:
- Treating each user’s ratings as a vector.
- Computing pairwise similarity (cosine or Pearson) across co-rated items.
- Selecting a neighborhood of the top-k most similar users.
- Predicting a missing rating as a mean-centered, similarity-weighted average of the neighbors’ ratings.
Its elegance is that it needs zero knowledge about what the items actually are. Its Achilles heel is sparsity (unreliable similarities with few co-rated items) and the O(U²) scaling wall — the two forces that push production systems toward item-based CF and latent-factor models.
Quick check
Practice this in an interview
All questionsFeed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.
Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.
RLHF (Reinforcement Learning from Human Feedback) aligns a language model's outputs to human preferences by training a reward model on ranked human comparisons, then using that reward signal to fine-tune the policy with reinforcement learning. It solves the gap between a model that is good at next-token prediction and a model that is genuinely helpful, harmless, and honest.