The Curse of Dimensionality
Add features and your data gets exponentially lonelier. In high dimensions everything is far away, distances stop meaning anything, and intuition breaks.
What you'll learn
- Why a neighborhood that holds 10% of the data must span 80% of every axis in 10 dimensions
- Why nearly all volume in a high-dimensional hypercube lives near the surface, not the center
- Why nearest-neighbor distances become meaningless as dimensions grow, and what to do about it
Before you start
A team built a KNN classifier with 2 features and got 91% accuracy. They added 18 more features — customer age, tenure, click counts, session lengths — and accuracy fell to 74%. More information made the model worse. That is the curse of dimensionality in production.
This lesson explains why it happens and what you can do about it.
The problem: data gets exponentially lonely
Imagine you have 1000 training points spread uniformly along a line (1 dimension). If you want a local neighborhood that covers 10% of the range, you grab 100 nearby points — plenty to work with.
Now spread those same 1000 points across a 10-dimensional unit hypercube. To capture 10% of the total volume, how wide must each axis of your neighborhood box be?
You need each axis to span r^(1/d) of the full range, where r is the target volume fraction and d is the number of dimensions. Plugging in:
r = 0.10, d = 10
axis span = 0.10^(1/10) = 0.794 = 79%
To grab 10% of the data in 10 dimensions you must reach across 79% of every axis. That neighborhood is not local at all. The table below shows how fast locality collapses:
| Dimensions | Axis span to capture 10% of data |
|---|---|
| 1 | 10% |
| 2 | 32% |
| 5 | 63% |
| 10 | 79% |
| 20 | 89% |
Every extra feature exponentially dilutes the density of your data. With a fixed dataset size the points are not removed — they simply become vanishingly sparse relative to the space they inhabit.
Intuition 1: the empty hypercube
Picture 8 points, one at each corner of a unit cube (3 dimensions). The cube has a center at (0.5, 0.5, 0.5). No point is near the center — all 8 are at the boundary. That pattern gets much worse as dimensions grow.
Intuition 2: volume hides near the boundary
Take a unit hypercube in d dimensions. Peel off a thin shell of thickness 0.05 from every face. What fraction of the total volume is in that shell?
shell fraction = 1 - (1 - 2 * 0.05)^d = 1 - 0.90^d
| Dimensions | Volume in the outer 5% shell |
|---|---|
| 1 | 10% |
| 2 | 19% |
| 10 | 65% |
| 100 | ~100% |
In 10 dimensions, 65% of the volume lives within 5% of the boundary. In 100 dimensions essentially everything is near the surface. The center of the space is essentially empty. Points sampled uniformly are almost all boundary points — they are extreme in at least one dimension — and any distance metric that cares about the “typical” region will be measuring noise.
Intuition 3: nearest and farthest neighbors converge
If all distances are about the same number, ranking them means nothing. This is distance concentration: as d grows, the ratio max_distance / min_distance approaches 1.
You should see output close to:
d=2 ratio max/min = 123.94
d=10 ratio max/min = 4.60
d=50 ratio max/min = 1.88
d=200 ratio max/min = 1.35
At d=2 the farthest pair is 124 times further than the nearest — distances are highly informative. At d=200 the farthest pair is only 35% further than the nearest. Every point is almost equally far from every other point. Asking “who is my nearest neighbor?” becomes nearly meaningless.
Why this breaks specific algorithms
KNN finds the k nearest points and votes. If distances are all nearly equal, “nearest” is arbitrary — the vote is noise.
k-means assigns points to their closest centroid. Centroids in high dimensions sit in empty regions; the notion of a tight cluster dissolves.
Kernel density estimation builds a density model from local point concentrations. With empty interiors and boundary-hugging points, the density estimate becomes flat and uninformative.
Linear regression and tree-based models are largely immune: linear regression cares about global coefficient weights, not local neighborhoods; decision trees split one feature at a time, so they do not suffer from the same exponential volume growth.
The cures
-
Dimensionality reduction. PCA projects data onto the directions of maximum variance, discarding dimensions that add noise. UMAP/t-SNE are nonlinear alternatives used for visualization.
-
Feature selection. Remove features with low variance, low correlation to the target, or high redundancy. Fewer features = denser, more meaningful neighborhoods.
-
Switch to distance-free models. Gradient-boosted trees and random forests split on individual features; they do not compute pairwise distances and scale well to hundreds of features.
-
Collect more data. Sparsity is a density problem. More points help, but the required sample size grows exponentially with
d— this rarely fully solves the problem. -
Regularize aggressively. Ridge and lasso regression add a penalty that implicitly performs feature selection or shrinkage, limiting the damage from irrelevant features.
Next
Dimensionality reduction in practice with PCA.
Practice this in an interview
All questionsIn high-dimensional spaces all pairwise distances concentrate around the same value, so the concept of a 'nearest' neighbour breaks down — the k-th nearest neighbour is almost as far as every other point. KNN's accuracy degrades sharply as dimensionality increases unless the data has much lower intrinsic dimensionality.
As the number of features grows, the volume of the feature space increases exponentially, so training data becomes exponentially sparse. Distance-based algorithms degrade because points become approximately equidistant; density estimation requires data that grows exponentially; and overfitting risk rises for any fixed training set size.
Depth enables hierarchical feature composition — each layer can re-use and recombine features from the layer below, representing exponentially more functions than a single wide layer of the same parameter count. Width alone increases capacity linearly, while depth increases it exponentially in the class of computable functions.
One-hot encoding becomes impractical when a categorical feature has hundreds or thousands of unique levels, producing a sparse matrix that slows training and causes overfitting on rare categories. Better approaches include target encoding with smoothing, frequency encoding, hashing, learned embeddings, or grouping rare categories into an 'Other' bucket, each with different tradeoffs on leakage risk and information retention.