How does the curse of dimensionality affect KNN?
In high-dimensional spaces all pairwise distances concentrate around the same value, so the concept of a 'nearest' neighbour breaks down — the k-th nearest neighbour is almost as far as every other point. KNN's accuracy degrades sharply as dimensionality increases unless the data has much lower intrinsic dimensionality.
How to think about it
The curse of dimensionality is the core reason KNN (and any distance-based method) degrades in high-dimensional feature spaces. Explaining it precisely separates candidates who understand why from those who only know the rule.
What happens geometrically
In d dimensions, the volume of a unit hypercube grows as 1, but the volume of a sphere of radius r shrinks relative to it. To capture a fixed fraction of the data volume, a neighbourhood must expand to cover nearly the entire space — at which point “nearest” neighbours are no longer near.
More concretely: consider n points uniform in [0,1]^d. The expected distance from a query point to its nearest neighbour scales as n^(-1/d). In 2D with 1,000 points, nearest-neighbour distance ≈ 0.03. In 100D you need 10^200 points to achieve the same density. Real datasets never provide that.
Distance concentration
For many distributions, as d grows:
max_dist / min_dist → 1
All points become equidistant. The signal-to-noise ratio of “nearest neighbour” collapses to zero. A k-d tree’s O(d log n) guarantee degrades to O(dn) brute-force because axis-aligned splits stop partitioning usefully.
Practical consequences
- KNN accuracy on raw high-dimensional data (e.g., raw pixels, raw text TF-IDF) is often poor.
- Approximate nearest neighbour methods (FAISS, HNSW) help at scale but do not fix the statistical problem.
- The real fix is dimensionality reduction (PCA, UMAP) or feature selection before computing distances.
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("pca", PCA(n_components=50, whiten=True)), # reduce before KNN
("knn", KNeighborsClassifier(n_neighbors=5)),
])
pipe.fit(X_train, y_train)