Why is cosine similarity preferred over Euclidean distance for comparing text vectors?
Cosine similarity measures the angle between two vectors, making it invariant to vector magnitude — so a short document and a long document on the same topic score high regardless of length differences. Euclidean distance conflates directional difference with scale difference, which is misleading for sparse or length-varying text.
How to think about it
Given two vectors a and b:
cosine_similarity(a, b) = (a · b) / (||a|| * ||b||)
The result ranges from -1 (opposite) to 1 (identical direction), with 0 meaning orthogonal (no shared signal).
Why magnitude-invariance matters for text
A TF-IDF vector for a 1,000-word article about “machine learning” has a much larger L2 norm than a 50-word tweet on the same topic. Euclidean distance would flag them as dissimilar purely because of document length, even though they discuss the same concepts. Cosine similarity normalizes this out.
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
docs = [
"machine learning is a subset of AI",
"AI and machine learning are closely related",
"the cat sat on the mat",
]
vec = TfidfVectorizer()
X = vec.fit_transform(docs)
sim = cosine_similarity(X)
print(sim[0, 1].round(3)) # high — docs 0 and 1 are similar
print(sim[0, 2].round(3)) # low — doc 2 is unrelated
With dense embeddings the same formula applies. Libraries like Faiss and annoy use inner-product search (equivalent to cosine when vectors are L2-normalised) for approximate nearest-neighbour retrieval at scale.
When Euclidean distance is acceptable: if all vectors have been L2-normalised (unit sphere), cosine similarity and Euclidean distance produce the same ranking because ||a - b||^2 = 2 - 2 * cosine(a, b) for unit vectors.