datarekha
Machine Learning Medium Asked at GoogleAsked at MetaAsked at Microsoft

How does the curse of dimensionality affect KNN?

The short answer

In high-dimensional spaces all pairwise distances concentrate around the same value, so the concept of a 'nearest' neighbour breaks down — the k-th nearest neighbour is almost as far as every other point. KNN's accuracy degrades sharply as dimensionality increases unless the data has much lower intrinsic dimensionality.

How to think about it

The curse of dimensionality is the core reason KNN (and any distance-based method) degrades in high-dimensional feature spaces. Explaining it precisely separates candidates who understand why from those who only know the rule.

What happens geometrically

In d dimensions, the volume of a unit hypercube grows as 1, but the volume of a sphere of radius r shrinks relative to it. To capture a fixed fraction of the data volume, a neighbourhood must expand to cover nearly the entire space — at which point “nearest” neighbours are no longer near.

More concretely: consider n points uniform in [0,1]^d. The expected distance from a query point to its nearest neighbour scales as n^(-1/d). In 2D with 1,000 points, nearest-neighbour distance ≈ 0.03. In 100D you need 10^200 points to achieve the same density. Real datasets never provide that.

Distance concentration

For many distributions, as d grows:

max_dist / min_dist → 1

All points become equidistant. The signal-to-noise ratio of “nearest neighbour” collapses to zero. A k-d tree’s O(d log n) guarantee degrades to O(dn) brute-force because axis-aligned splits stop partitioning usefully.

Practical consequences

  • KNN accuracy on raw high-dimensional data (e.g., raw pixels, raw text TF-IDF) is often poor.
  • Approximate nearest neighbour methods (FAISS, HNSW) help at scale but do not fix the statistical problem.
  • The real fix is dimensionality reduction (PCA, UMAP) or feature selection before computing distances.
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("pca", PCA(n_components=50, whiten=True)),   # reduce before KNN
    ("knn", KNeighborsClassifier(n_neighbors=5)),
])
pipe.fit(X_train, y_train)
Learn it properly Curse of Dimensionality

Keep practising

All Machine Learning questions

Explore further

Skip to content