What is the curse of dimensionality, and how does it affect machine learning models?
As the number of features grows, the volume of the feature space increases exponentially, so training data becomes exponentially sparse. Distance-based algorithms degrade because points become approximately equidistant; density estimation requires data that grows exponentially; and overfitting risk rises for any fixed training set size.
How to think about it
Coined by Bellman (1957), the curse of dimensionality refers to a cluster of related phenomena that emerge as feature dimensionality d grows.
Exponential volume growth. A unit hypercube in d dimensions has volume 1 regardless of d, but a ball of radius r inside it has volume proportional to r^d. To cover 10% of the input space with a ball, the ball radius must be r = 0.1^(1/d). At d = 10, r ≈ 0.79 — the “local” neighborhood is nearly the entire space.
Distance concentration. In high dimensions, the ratio (max distance - min distance) / min distance approaches 0. All pairwise distances become nearly equal, so k-NN’s notion of “nearest” neighbor loses discriminative power. This affects any kernel method or metric-learning approach.
Data sparsity. To maintain the same sampling density as a 1D problem with 10 points, a 10D problem needs 10^10 points. With a fixed training set of size n, the effective data density drops exponentially with d.
Overfitting. With many features and few samples, models can find spurious correlations. A linear model with d > n is under-determined; tree-based models can construct splits that perfectly separate training noise.
Practical consequences and mitigations:
| Problem | Mitigation |
|---|---|
| Sparse neighborhoods | PCA, t-SNE, autoencoders for dimensionality reduction |
| Spurious correlations | Regularization (L1 drops irrelevant features), feature selection |
| Slow distance computation | Approximate nearest neighbors (FAISS, HNSW) |
| Overfitting | Cross-validation, regularization, more data |
L1 regularization (Lasso) is particularly powerful: it produces sparse solutions that zero out irrelevant dimensions, effectively performing implicit feature selection.