datarekha

User-based collaborative filtering

How 'people similar to you liked X' works — finding a neighborhood of like-minded users and using their ratings to predict yours.

9 min read Intermediate Recommender Systems Lesson 4 of 11

What you'll learn

  • How to measure user–user similarity from a ratings matrix using cosine or Pearson correlation
  • How to predict a missing rating as a similarity-weighted average of neighbor ratings (with mean-centering)
  • Strengths and weaknesses: no item features needed, but sparsity and scalability are real limits

Before you start

Why behavior alone is enough

Every time a user rates an item, clicks on something, or adds it to a playlist, they are revealing something about their taste. When two users have left very similar trails of behavior — liking the same obscure things, disliking the same popular ones — their future preferences are likely to overlap too.

Collaborative filtering (CF) is the family of techniques that exploits this idea: use the collective behavior of many users to filter items for any one user, without ever looking at what those items actually are. No genre labels, no plot summaries, no product descriptions — pure signal from human choices.

User-based CF is the most direct form: find the users most similar to the target user, then let their ratings speak for the items the target has not yet seen.


Step 1 — Build the ratings matrix

Start with a utility matrix (rows = users, columns = items, cells = ratings or NaN when unobserved). This is the input to every step that follows.

         Item A  Item B  Item C  Item D
Alice       5       3      NaN     1
Bob         4       NaN     4      1
Carol       NaN     2       5      2
Dave        5       4      NaN    NaN

Most cells are empty. This sparsity is the defining challenge of the whole problem.


Step 2 — Measure user–user similarity

For every pair of users, compute how similar their rating vectors are. Two popular measures:

Cosine similarity treats each user’s ratings as a vector in item-space and measures the angle between them. Only items both users have rated contribute to the dot product (co-rated items).

Pearson correlation does the same after subtracting each user’s mean rating first, which corrects for the fact that some people rate everything 4–5 and others use the full 1–5 scale. In practice, Pearson often performs better in CF because it removes this per-user rating bias.

The output is a user–user similarity matrix where entry (u, v) is a score in [-1, 1].


Step 3 — Select the neighborhood

For a target user u, sort all other users by their similarity to u and keep the top-k most similar. This group is called the neighborhood — or k nearest neighbors (kNN).

Choosing k involves a trade-off: too small and there is not enough signal; too large and dissimilar users start to dilute the prediction.

TargetUserUser Asim=0.91User Bsim=0.78User Csim=0.65Predicted rating for Item Xrated: 4rated: 5rated: 3
Three neighbors (kNN) with their similarity scores and ratings for Item X flow into a weighted-average prediction for the target user.

Step 4 — Predict the missing rating

For item i that the target user u has not rated, collect all neighbors who have rated i. Then compute a similarity-weighted average:

predicted(u, i) = mean(u) + sum_v [ sim(u,v) * (rating(v,i) - mean(v)) ]
                             / sum_v [ |sim(u,v)| ]

The subtraction of each neighbor’s mean rating — mean-centering — is the key detail. If User B generously rates everything 4 or 5, their raw rating of 4 for Item X is actually lukewarm relative to their usual. Mean-centering converts their rating into a deviation (above or below their baseline), which is a more honest signal. We add the target’s own mean back at the end to put the prediction in their personal scale.


Strengths of user-based CF

  • No item features needed. The algorithm works on any domain — movies, songs, products, research papers — without a single word of description about the items themselves.
  • Discovers non-obvious connections. Two users might both love a niche documentary and a pop album; no content-based system would link those items, but CF finds it instantly.
  • Transparent rationale. “Because users similar to you liked it” is an explanation users intuitively understand.

Weaknesses

User cold-start is a third problem: a brand-new user has no rating history, so their neighborhood is empty and no prediction is possible.


Code — user-user cosine similarity and prediction

Run the cell and observe:

  • The similarity scores reflect how much each neighbor’s taste overlaps with the target user’s.
  • The prediction lands near the ratings of the most similar neighbors (weighted by closeness, adjusted for each neighbor’s baseline generosity).
  • Changing k — commenting out lower-similarity neighbors — shifts the prediction.

Summary

User-based collaborative filtering works by:

  1. Treating each user’s ratings as a vector.
  2. Computing pairwise similarity (cosine or Pearson) across co-rated items.
  3. Selecting a neighborhood of the top-k most similar users.
  4. Predicting a missing rating as a mean-centered, similarity-weighted average of the neighbors’ ratings.

Its elegance is that it needs zero knowledge about what the items actually are. Its Achilles heel is sparsity (unreliable similarities with few co-rated items) and the O(U²) scaling wall — the two forces that push production systems toward item-based CF and latent-factor models.


Quick check

0/3
Q1Mean-centering each neighbor's rating before computing the weighted average is done to:
Q2Why does sparsity make user–user similarity scores unreliable?
Q3A music streaming service has 50 million users and 10 million tracks. A new user signs up and immediately rates 3 songs. Which statement best describes the limitations of user-based CF in this scenario?

Practice this in an interview

All questions

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content