Similarity metrics
Cosine, Pearson, and Jaccard: which similarity measure to use in collaborative filtering, and why choosing the wrong one gives you wrong neighbors.
What you'll learn
- Cosine similarity measures the angle between rating vectors and ignores magnitude.
- Pearson correlation is cosine on mean-centered ratings, removing each user's rating bias.
- Jaccard similarity is the right tool for binary implicit data like clicks or purchases.
Before you start
Why the metric choice matters
A similarity metric is an encoding of what similar means in your domain. Two users who both rated action films highly are similar in one sense. Two users who gave the same relative ratings — one generous, one harsh — may be similar in a different and often more useful sense. Two users who clicked the same products are similar in yet another sense.
Picking the wrong metric does not produce an error. It silently produces bad neighbors, and therefore bad recommendations. The sections below build each metric from scratch so you can match it to your data.
Cosine similarity
Cosine similarity between two vectors measures the cosine of the angle between them. Formally, for rating vectors u and v:
cosine(u, v) = (u · v) / (||u|| * ||v||)
The dot product in the numerator captures shared high-ratings; dividing by the product of magnitudes normalizes away scale. The result lives in [-1, 1] for signed vectors, and [0, 1] for non-negative ratings.
The key property: cosine ignores magnitude and cares only about direction. Two users who gave ratings [2, 4, 6] and [1, 2, 3] are perfectly cosine-similar (cosine = 1) even though their raw numbers differ. This is exactly right when you care only about the shape of taste, not the absolute level.
Pearson correlation (mean-centered cosine)
Raw cosine has a silent flaw with explicit ratings. A generous user who rates everything 4 or 5, and a harsh user who rates everything 1 or 2, may have nearly identical preferences while producing very different raw vectors. Their cosine similarity will be low even though they agree on relative rankings.
Rating bias is the systematic offset each user adds to their ratings. Pearson correlation removes it by subtracting each user’s mean rating before computing cosine:
pearson(u, v) = cosine(u - mean(u), v - mean(v))
After mean-centering, a generous 5 from a 4-5 rater and a generous 2 from a 1-2 rater both become positive deviations. The correlation now captures agreement in relative preference, which is what collaborative filtering actually needs.
Pearson correlation also lives in [-1, 1]. A value near 1 means the users move together (both rate action films above their personal average and both rate romances below). A value near -1 means they consistently disagree.
Jaccard similarity
Sometimes you have no ratings at all — only a signal that a user interacted with an item. Clicks, purchases, streams, and page views are implicit, binary signals: either the interaction happened or it did not.
Treating binary data as if it were a rating vector and running cosine or Pearson on it is a category error. A user who clicked 200 items and a user who clicked 5 items might look cosine-dissimilar for the wrong reasons.
Jaccard similarity is built for sets. For two users with item sets A and B:
jaccard(A, B) = |A ∩ B| / |A ∪ B|
It is the fraction of items either user touched that both touched. It lives in [0, 1]. A user who bought {milk, eggs, bread} and one who bought {eggs, bread, butter} share 2 items out of 4 distinct ones: Jaccard = 0.5.
Jaccard is symmetric, interpretable, and immune to the problem of one user having far more interactions than another — because size appears in both numerator and denominator.
Adjusted cosine (item-based CF)
When computing item-item similarity, a closely related variant appears: adjusted cosine. Instead of centering by user mean (Pearson), you center each rating by the item’s mean. This corrects for the fact that some items attract systematically higher or lower ratings, and is the standard choice in item-based CF. See the item-based CF lesson for details.
Euclidean distance: a note of caution
Euclidean distance (L2 norm of the difference vector) is intuitive but poorly suited to sparse rating matrices. With millions of items and each user rating only dozens, most dimensions are zero. Users with no overlap in ratings have the same Euclidean distance as users with diametrically opposite tastes. Prefer the metrics above for recommendation tasks.
Decision guide
| Data type | Recommended metric | Why |
|---|---|---|
| Explicit ratings (1–5 stars) | Pearson correlation | Removes per-user rating bias |
| Binary / implicit (clicks, purchases) | Jaccard similarity | Designed for set overlap |
| Dense embeddings from a model | Cosine similarity | Embeddings are already calibrated; angle captures direction of meaning |
| Item-item similarity (explicit ratings) | Adjusted cosine | Centers by item mean, not user mean |
Compute cosine vs. Pearson — see the difference
The playground below computes cosine and Pearson on two rating vectors where one user is systematically generous and one is systematically harsh. Watch how centering reverses or clarifies the similarity verdict.
A user with the same relative taste but a different rating scale will show low cosine but high Pearson after centering. A user with genuinely different relative preferences will show a lower Pearson regardless of scale.
Jaccard in one minute
def jaccard(set_a, set_b):
intersection = len(set_a & set_b)
union = len(set_a | set_b)
return intersection / union if union > 0 else 0.0
user_x = {"item_1", "item_3", "item_7", "item_9"}
user_y = {"item_3", "item_7", "item_11"}
print(jaccard(user_x, user_y)) # 2 shared out of 5 distinct = 0.4
No numpy required. No ratings required. Just two sets of item IDs.
Quick check
Practice this in an interview
All questionsCosine similarity measures the angle between two vectors, making it invariant to vector magnitude — so a short document and a long document on the same topic score high regardless of length differences. Euclidean distance conflates directional difference with scale difference, which is misleading for sparse or length-varying text.
Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.
Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.
F1 is the harmonic mean of precision and recall: 2PR/(P+R). The harmonic mean penalises extreme imbalance between the two — a model with 1.0 precision and 0.01 recall gets F1 = 0.02, not 0.505. F1 is the wrong metric when the classes are heavily imbalanced or when the costs of false positives and false negatives differ sharply, in which case F-beta, PR-AUC, or a cost-weighted metric is more appropriate.