How would you design a metric to evaluate the relevance of a content recommendation feed?

Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.

What are embeddings, and how do you measure similarity between them for vector search?

Embeddings are dense vectors that map text or other data into a geometric space where semantically similar items are close together. Vector search ranks candidates by similarity, most commonly cosine similarity or dot product and sometimes Euclidean distance, retrieving the nearest vectors to a query embedding.

How would you design a metric to measure the quality of a search feature inside an e-commerce app?

Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.

What is hybrid search and why is it often better than pure vector search?

Hybrid search combines dense vector similarity with sparse keyword search such as BM25, then fuses the rankings. Dense retrieval captures semantic meaning while keyword search nails exact terms, identifiers, and rare tokens, so combining them improves recall and precision over either alone.

User-based collaborative filtering — Recommender Systems

We ended the last lesson with a question about magic. Content-based filtering, we said, can only ever recommend more of what you already like — same words, same tags, same genre. It can never tell you that a hundred people whose taste is uncannily like yours all adored one strange film that shares nothing with your history. So we asked: where does that surprise come from?

The answer is other people. Not the items — the crowd. If we throw away every word of description about what the items are, and instead look only at who liked what, a different kind of signal appears. That signal is the subject of this lesson.

A first picture

Imagine the whole world of users as a crowd in a room. You are standing somewhere in it. Most people are strangers — their taste has nothing to do with yours. But a handful, scattered here and there, turn out to have rated the last fifty films almost exactly as you did. They are your taste-neighbors.

Now suppose one of those neighbors raves about a film you have never seen. Should you trust the recommendation? Of course you should — that is the whole bet. People who agreed with you fifty times running are likely to agree with you the fifty-first time too. User-based collaborative filtering is nothing more than this bet, written down precisely enough for a computer to make it.

The word collaborative is the heart of it. We are filtering items for you using the collective behavior of everyone else — their ratings collaborate to predict yours. And filtering is the goal: out of a catalog of millions, surface the few you are most likely to love. No genre labels, no plot summaries, no descriptions. Pure signal from human choices.

Start with the ratings matrix

Everything begins with the utility matrix from the previous chapter — rows are users, columns are items, and a cell holds a rating when we have one and NaN when we do not.

         Item A  Item B  Item C  Item D
Alice       5       3      NaN     1
Bob         4       NaN     4      1
Carol       NaN     2       5      2
Dave        5       4      NaN    NaN

Look at how empty it is. Alice never rated Item C; Bob skipped Item B; nobody has a full row. This emptiness has a name — sparsity — and it is not a minor inconvenience. It is the central difficulty of the entire problem, and it will come back to bite us at the end of the lesson. Hold onto it.

Measuring how alike two users are

To find your neighbors, we need a number that says how alike any two users are. Two measures dominate.

Cosine similarity treats each user’s row as an arrow in item-space and measures the angle between two arrows. Point the same way, and the cosine is near 1; point at right angles, and it is near 0. Crucially, only the items both users rated — the co-rated items — enter the calculation. Everything else is NaN and sits out.

Pearson correlation does the same thing, but first subtracts each user’s own average rating. Why bother? Because people use the scale differently. Some rate everything a generous 4 or 5; others spread across the full 1-to-5 range. Subtracting each user’s mean strips out that personal bias, leaving only the shape of their opinions — what they liked more than their own average, and what they liked less. In practice this correction matters, which is why Pearson often beats raw cosine in collaborative filtering.

Either way, the output is a user–user similarity table: for every pair (u, v), a score that lives somewhere in the range −1 to 1.

Choosing the neighborhood

Now pick a target user. Sort everyone else by similarity to that user, and keep the top few. That short list of the most-similar users is the neighborhood — the same idea as the k nearest neighbors you may have met in classification, only here the “points” are people.

How many neighbors? That is a genuine trade-off. Keep too few and a single quirky neighbor swings the whole prediction. Keep too many and lukewarm, barely-similar users dilute the signal until it tastes of nothing.

Three neighbors (kNN) with their similarity scores and ratings for Item X flow into a weighted-average prediction for the target user.

Predicting the missing rating

Here is the payoff. Pick an item the target user has not rated. Gather every neighbor who has rated it. Then blend their ratings — but not as a plain average. Weight each neighbor by how similar they are to the target, so close neighbors speak loudly and distant ones whisper:

predicted(u, i) = mean(u) + sum_v [ sim(u,v) * (rating(v,i) - mean(v)) ]
                             / sum_v [ |sim(u,v)| ]

The piece that looks fussy — subtracting each neighbor’s mean, rating(v,i) - mean(v) — is mean-centering, and it is the single most important detail in the formula. Suppose a neighbor showers every film with a 4 or 5. Their 4 for this item is not praise; for them, a 4 is a shrug. Mean-centering converts each raw rating into a deviation — how far above or below that neighbor’s own baseline it sits — which is a far more honest signal than the raw number. We add the target’s own mean back at the very end to return the answer to their personal scale.

Watch one prediction unfold

Let us run the whole pipeline on a small five-user, five-item matrix and trace the result number by number. User 0 is our target; they have rated items 0, 1, and 3, but never item 2. Our job is to predict item 2.

import numpy as np

# Rows = users (0-4), Cols = items (0-4); np.nan = not rated
R = np.array([
    [5,  3,  np.nan, 1,  np.nan],  # User 0 (target)
    [4,  np.nan, 4, 1,  np.nan],   # User 1
    [np.nan, 2, 5, 2,  4     ],    # User 2
    [5,  4,  np.nan, np.nan, 3],   # User 3
    [1,  1,  2, np.nan, 2     ],   # User 4
])

def cosine_sim(a, b):
    """Cosine similarity using only co-rated items."""
    mask = ~np.isnan(a) & ~np.isnan(b)
    if mask.sum() == 0:
        return 0.0
    a_c, b_c = a[mask], b[mask]
    denom = np.linalg.norm(a_c) * np.linalg.norm(b_c)
    return float(np.dot(a_c, b_c) / denom) if denom > 0 else 0.0

target_user = 0
target_item = 2  # User 0 has not rated item 2

# Similarity of the target to every user who rated item 2
sims = {}
for u in range(R.shape[0]):
    if u == target_user:
        continue
    s = cosine_sim(R[target_user], R[u])
    if not np.isnan(R[u, target_item]):
        sims[u] = s
        print(f"  sim(User 0, User {u}) = {s:.3f}  |  rating for item {target_item}: {R[u, target_item]}")

# Mean-centered, similarity-weighted average
u_mean = np.nanmean(R[target_user])
numerator = sum(sims[v] * (R[v, target_item] - np.nanmean(R[v])) for v in sims)
denominator = sum(abs(sims[v]) for v in sims)
prediction = u_mean + (numerator / denominator if denominator > 0 else 0)

print(f"\nUser 0 mean rating: {u_mean:.2f}")
print(f"Predicted rating for item {target_item}: {prediction:.2f}")

  sim(User 0, User 1) = 0.999  |  rating for item 2: 4.0
  sim(User 0, User 2) = 0.894  |  rating for item 2: 5.0
  sim(User 0, User 4) = 0.970  |  rating for item 2: 2.0

User 0 mean rating: 3.00
Predicted rating for item 2: 4.06

Read those numbers slowly, because the whole story is in them. User 1 comes back at 0.999 — almost a perfect twin of User 0 across their co-rated items — and User 1 rated item 2 a solid 4. User 2 (similarity 0.894) rated it even higher, a 5. User 4 is also quite similar (0.970) but rated item 2 a low 2, which tugs the estimate down. Blend all three, weighted by closeness and corrected for each person’s baseline generosity, and the prediction settles at 4.06 — comfortably above User 0’s own average of 3.00. The crowd is telling us: you will probably like item 2. And notice we never once looked at what item 2 actually is.

Why this is worth doing

No item features required. The algorithm runs on movies, songs, books, products, research papers — any domain at all — without a single word describing the items. It needs only the matrix of who-liked-what.
It finds connections nothing else can. Two users both adore one obscure documentary and one mainstream pop album. No content-based system would ever link those two items — they share nothing. Collaborative filtering links them instantly, through the people who loved both.
The reason is human-readable. “Because people with taste like yours liked it” is an explanation anyone understands without a lecture on vectors.

Where it breaks

Watch out

Sparsity and scale are the two walls this method runs into.

Remember the emptiness we flagged at the start. Most users have rated a vanishing fraction of the catalog, so any two users share only a handful of co-rated items — sometimes just one. A similarity computed from a single shared rating is not evidence; it is a coincidence dressed up as a number. And it gets worse as the catalog grows, because each user samples an ever-smaller slice of it.

Scale is the second wall. Comparing every pair of users costs on the order of U² work in the number of users. At a few thousand users that is fine. At fifty million it is hopeless — and every new rating threatens to make the whole similarity table stale.

These two pressures, sparsity and scale, are exactly what push real systems toward item-based CF and matrix-factorization methods.

There is a third crack worth naming: user cold-start. A brand-new user has rated nothing, so their neighborhood is empty and no prediction is possible. The crowd can only help once you have told it something about yourself.

In one breath

User-based collaborative filtering finds the people whose ratings most resemble yours, then predicts your opinion of an unseen item as a similarity-weighted, mean-centered blend of what those neighbors thought — needing no knowledge of the items themselves, but limited by sparse overlap between users and by the U² cost of comparing everyone to everyone.

Practice

Before the quiz, try tracing the worked example by hand. User 1 had similarity 0.999 and rated item 2 a 4; User 2 had 0.894 and rated it 5; User 4 had 0.970 and rated it 2. Which neighbor pulls the prediction up, and which pulls it down? If you dropped User 4 from the neighborhood entirely, would the predicted 4.06 rise or fall? Reason it out before checking the explanations below.

Quick check

0/3

Q1Mean-centering each neighbor's rating before computing the weighted average is done to:

Q2Why does sparsity make user–user similarity scores unreliable?

Q3A music streaming service has 50 million users and 10 million tracks. A new user signs up and immediately rates 3 songs. Which statement best describes the limitations of user-based CF in this scenario?

A question to carry forward

We just hit two walls, and they share a single root. Sparsity bites because any two users share so few co-rated items. The U² cost bites because there are so many users — tens of millions of them, each one a moving target whose taste can shift overnight. Both walls are walls of the user axis.

So here is the question to carry into the next lesson. What if we turned the matrix on its side? There are usually far fewer items than users, and the relationship between two items — whether a thriller resembles another thriller — barely changes from one month to the next. If we compared items to each other instead of people to each other, could we precompute the whole similarity table once, overnight, and then serve a recommendation with nothing more than a lookup? That single flip of the axis is the idea behind item-based collaborative filtering, and it is what made recommendation practical at internet scale.

User-based collaborative filtering

What you'll learn

Before you start