Why is cosine similarity preferred over Euclidean distance for comparing text vectors?

Cosine similarity measures the angle between two vectors, making it invariant to vector magnitude — so a short document and a long document on the same topic score high regardless of length differences. Euclidean distance conflates directional difference with scale difference, which is misleading for sparse or length-varying text.

What are embeddings, and how do you measure similarity between them for vector search?

Embeddings are dense vectors that map text or other data into a geometric space where semantically similar items are close together. Vector search ranks candidates by similarity, most commonly cosine similarity or dot product and sometimes Euclidean distance, retrieving the nearest vectors to a query embedding.

How would you design a metric to evaluate the relevance of a content recommendation feed?

Feed relevance has no single ground-truth label, so it requires a tiered metric system: an implicit behavioural signal (long dwell time, saves, shares) as the online primary metric; an explicit user-satisfaction signal (thumbs-up/down, survey) as the periodic validation; and an offline ranking metric (NDCG computed from historical high-engagement items) for fast model iteration. The three tiers must converge to be trusted.

How would you design a metric to measure the quality of a search feature inside an e-commerce app?

Search quality has two sides: relevance (did results match intent?) and utility (did the user accomplish their goal?). A good metric system combines an offline relevance signal — such as NDCG computed against human-labelled queries — with an online behavioural signal — such as click-through rate at rank 1 and zero-result rate — tied to a downstream business outcome like add-to-cart rate.

Similarity metrics — Recommender Systems

For three lessons now we have reached for cosine similarity without once stopping to justify it. It found our user neighborhoods, it powered the item-item table, and then — at the end of the last lesson — it embarrassed us, handing back a flawless 1.00 for two items that shared a single rater. We promised to open the drawer and ask the question we had been dodging: is cosine even the right ruler?

This lesson opens that drawer fully. There is more than one ruler in there, and the difference between them is not academic. Your similarity function is the single biggest lever you have on recommendation quality. Choose the wrong ruler and nothing downstream can save you — you will find the wrong neighbors, every time, silently.

A metric is a definition of “similar”

Pause on what a similarity score actually is. It is a number that encodes what you mean by similar — and that meaning is a choice, not a fact about the data.

Two users who both rated action films highly are similar in one sense. Two users who gave the same relative rankings — one generous, one harsh — are similar in a second, often more useful sense. Two users who clicked the same products, with no ratings at all, are similar in a third sense entirely. No single formula captures all three. So the work of this lesson is to match each ruler to the kind of “similar” you actually care about.

And here is the trap that makes this dangerous: picking the wrong metric never throws an error. It quietly produces bad neighbors, and therefore bad recommendations, while every line of code runs green. The only defense is to understand each metric from the inside. So let us build them.

Cosine: the angle between two tastes

Cosine similarity measures the cosine of the angle between two rating vectors. For users u and v:

cosine(u, v) = (u · v) / (||u|| * ||v||)

The dot product on top rewards ratings that are high in the same places. Dividing by the two magnitudes strips out scale, so only direction survives. The result sits in [−1, 1] for signed vectors, and [0, 1] when ratings are non-negative.

That last property is the whole personality of cosine: it is blind to magnitude and sees only direction. Two users who rated three films [2, 4, 6] and [1, 2, 3] are perfectly cosine-similar — cosine exactly 1 — even though one rated twice as high across the board. When you care about the shape of taste and not its loudness, that blindness is exactly what you want.

Two rating vectors and the angle θ between them. Cosine similarity equals cos(θ).

Pearson: cosine after removing each user’s bias

Cosine has a silent flaw the moment ratings are explicit stars. Think of a generous user who marks everything 4 or 5, and a harsh user who marks everything 1 or 2. They might agree perfectly on what is better than what — and yet their raw vectors point in noticeably different directions, so cosine reports them as only middling-similar.

The culprit is rating bias: the systematic offset each person adds to every score. Pearson correlation removes it by subtracting each user’s mean rating before taking the cosine:

pearson(u, v) = cosine(u - mean(u), v - mean(v))

That subtraction is mean-centering again, the same move from user-based CF. After centering, a “high for me” rating becomes a positive deviation no matter whether “high” meant 5 or 2, and a “low for me” rating becomes negative. What remains is pure relative preference — agreement on the ranking, which is precisely what collaborative filtering needs. Pearson, too, lives in [−1, 1]: near 1 the users rise and fall together; near −1 they reliably disagree.

Watch the two rulers disagree

Words make the difference sound subtle. The numbers make it stark. Take User A, a generous rater, and User B, a harsh one, who rank the same four films in the same order. Then bring in User C, whose taste is genuinely different. Watch what each ruler says.

import numpy as np

# User A: generous rater.  User B: harsh rater, same relative order.
a = np.array([5.0, 4.0, 5.0, 3.0])
b = np.array([2.0, 1.0, 2.0, 0.0])

def cosine(x, y):
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

cos_raw = cosine(a, b)

# Pearson = cosine after mean-centering each vector
a_centered = a - a.mean()
b_centered = b - b.mean()
pearson = cosine(a_centered, b_centered)

print(f"A mean: {a.mean():.2f}   B mean: {b.mean():.2f}")
print(f"Cosine (raw):         {cos_raw:.4f}")
print(f"Pearson (centered):   {pearson:.4f}")
print("A centered:", a_centered)
print("B centered:", b_centered)

# Now a genuinely different taste profile
c = np.array([5.0, 1.0, 2.0, 4.0])
print(f"\nCosine A vs C (raw):  {cosine(a, c):.4f}")
print(f"Pearson A vs C:       {cosine(a_centered, c - c.mean()):.4f}")

A mean: 4.25   B mean: 1.25
Cosine (raw):         0.9238
Pearson (centered):   1.0000
A centered: [ 0.75 -0.25  0.75 -1.25]
B centered: [ 0.75 -0.25  0.75 -1.25]

Cosine A vs C (raw):  0.8683
Pearson A vs C:       0.0000

Look at what just happened, because both halves of this output are a lesson. For A and B — the same taste at different volumes — raw cosine says 0.9238: high, but it docks them for the scale gap. Pearson says 1.0000, a perfect match, and the centered vectors show why: both collapse to the identical deviation profile [0.75, -0.25, 0.75, -1.25]. Pearson saw past the generosity to the taste underneath.

Now the more alarming half. User C genuinely disagrees with A — yet raw cosine reports 0.8683, which would have made C look like a close neighbor and earned A a stream of recommendations C would hate. Pearson reports 0.0000: no relative agreement whatsoever. The wrong ruler did not error out. It just quietly pointed at the wrong neighbor, exactly as warned.

Jaccard: when there are no ratings at all

Sometimes there is nothing to center, because there are no ratings. You know only that a user interacted with an item — clicked it, bought it, streamed it, viewed the page. These are implicit, binary signals: the event either happened or it did not.

Running cosine or Pearson on binary data is a category error. A user who clicked 200 items and one who clicked 5 will look dissimilar for reasons that have nothing to do with taste. The right tool treats each user as a set of touched items. Jaccard similarity does exactly that:

jaccard(A, B) = |A ∩ B| / |A ∪ B|

It is the fraction of all items either user touched that both touched — the overlap divided by the union, a number in [0, 1]. Crucially, because set size sits in both the numerator and the denominator, Jaccard is immune to one user being far more active than another. Here it is end to end:

def jaccard(set_a, set_b):
    intersection = len(set_a & set_b)
    union = len(set_a | set_b)
    return intersection / union if union > 0 else 0.0

user_x = {"item_1", "item_3", "item_7", "item_9"}
user_y = {"item_3", "item_7", "item_11"}

print(jaccard(user_x, user_y))

0.4

Two shared items (item_3, item_7) out of five distinct ones gives 0.4. No numpy, no ratings, no mean-centering — just two sets of IDs.

Two more rulers, briefly

Adjusted cosine is the variant that belongs in item-based CF. Instead of centering by each user’s mean (as Pearson does), it centers each rating by the item’s mean — correcting for the fact that some items attract systematically higher scores than others. It is the standard choice when you compute item-item similarity on explicit ratings.

Euclidean distance — the straight-line length of the difference vector — is intuitive but a poor fit for sparse rating matrices. With millions of items and each user rating only a few dozen, almost every dimension is zero. Two users with no items in common end up the same distance apart as two users with diametrically opposite tastes. Reach for the angle-based and set-based rulers above instead.

Which ruler for which data

Data type	Recommended metric	Why
Explicit ratings (1–5 stars)	Pearson correlation	Removes per-user rating bias
Binary / implicit (clicks, purchases)	Jaccard similarity	Designed for set overlap
Dense embeddings from a model	Cosine similarity	Embeddings are already calibrated; angle captures direction of meaning
Item-item similarity (explicit ratings)	Adjusted cosine	Centers by item mean, not user mean

In one breath

A similarity metric is a chosen definition of “similar,” and the choice is the biggest lever on recommendation quality: cosine compares the angle of two rating vectors and ignores scale, Pearson is cosine after mean-centering away each user’s generosity bias (so it sees relative taste — A and B scored 1.00 while raw cosine docked them to 0.92), and Jaccard measures set overlap for binary implicit data where there is nothing to center.

Practice

Before the quiz, reason through the worked numbers one more time. User A and User C came back at raw cosine 0.8683 — high enough to look like neighbors — but Pearson 0.0000. In your own words, what did mean-centering reveal about A and C that the raw angle hid? And if you were building a recommender on play-counts rather than stars, why would reaching for either of those two rulers be the wrong move?

Quick check

0/3

Q1A user who always rates 5 stars and a user who always rates 1 star happen to give the same relative rankings across films. Which similarity metric correctly identifies them as highly similar?

Q2You are building a recommendation system for a music streaming app. Users have no ratings — you only know which songs each user has played at least once. Which metric should you use for user-user similarity?

Q3Your team uses cosine similarity on 128-dimensional user embeddings produced by a neural model. A colleague suggests switching to Pearson. What is the most likely outcome?

A question to carry forward

Step back and notice what every ruler in this lesson had in common. Cosine, Pearson, Jaccard, adjusted cosine — each one compares two rows (or two columns) of the raw matrix directly, cell against cell. That is the entire family of neighborhood methods, and we have now seen its ceiling: it can only ever compare what is literally there, so it stays shallow and it stays noisy on a matrix that is 99% empty.

So here is the question to carry forward. What if, instead of comparing users to users in the raw rating space, we learned a small set of hidden taste dimensions — action-intensity, indie-ness, prestige-drama — and placed every user and every item as a short vector in that learned space? Then similarity would not be measured on the sparse, noisy surface; it would emerge from a compressed model of taste itself. That leap — from comparing rows to learning latent factors — is matrix factorization, the idea that won the Netflix Prize, and it is the next lesson.

Similarity metrics

What you'll learn

Before you start