datarekha

The Curse of Dimensionality

Add features and your data gets exponentially lonelier. In high dimensions everything is far away, distances stop meaning anything, and intuition breaks.

8 min read Intermediate Machine Learning Lesson 4 of 17

What you'll learn

  • Why a neighborhood that holds 10% of the data must span 80% of every axis in 10 dimensions
  • Why nearly all volume in a high-dimensional hypercube lives near the surface, not the center
  • Why nearest-neighbor distances become meaningless as dimensions grow, and what to do about it

Before you start

A team built a KNN classifier with 2 features and got 91% accuracy. They added 18 more features — customer age, tenure, click counts, session lengths — and accuracy fell to 74%. More information made the model worse. That is the curse of dimensionality in production.

This lesson explains why it happens and what you can do about it.

The problem: data gets exponentially lonely

Imagine you have 1000 training points spread uniformly along a line (1 dimension). If you want a local neighborhood that covers 10% of the range, you grab 100 nearby points — plenty to work with.

Now spread those same 1000 points across a 10-dimensional unit hypercube. To capture 10% of the total volume, how wide must each axis of your neighborhood box be?

You need each axis to span r^(1/d) of the full range, where r is the target volume fraction and d is the number of dimensions. Plugging in:

r = 0.10, d = 10
axis span = 0.10^(1/10) = 0.794 = 79%

To grab 10% of the data in 10 dimensions you must reach across 79% of every axis. That neighborhood is not local at all. The table below shows how fast locality collapses:

DimensionsAxis span to capture 10% of data
110%
232%
563%
1079%
2089%

Every extra feature exponentially dilutes the density of your data. With a fixed dataset size the points are not removed — they simply become vanishingly sparse relative to the space they inhabit.

Intuition 1: the empty hypercube

Picture 8 points, one at each corner of a unit cube (3 dimensions). The cube has a center at (0.5, 0.5, 0.5). No point is near the center — all 8 are at the boundary. That pattern gets much worse as dimensions grow.

8 points — same count, growing space1Ddense2Dsparser3D (projected)very sparse
Same 8 points, growing space. Adding one dimension doubles the volume; the points become isolated.

Intuition 2: volume hides near the boundary

Take a unit hypercube in d dimensions. Peel off a thin shell of thickness 0.05 from every face. What fraction of the total volume is in that shell?

shell fraction = 1 - (1 - 2 * 0.05)^d = 1 - 0.90^d
DimensionsVolume in the outer 5% shell
110%
219%
1065%
100~100%

In 10 dimensions, 65% of the volume lives within 5% of the boundary. In 100 dimensions essentially everything is near the surface. The center of the space is essentially empty. Points sampled uniformly are almost all boundary points — they are extreme in at least one dimension — and any distance metric that cares about the “typical” region will be measuring noise.

Intuition 3: nearest and farthest neighbors converge

If all distances are about the same number, ranking them means nothing. This is distance concentration: as d grows, the ratio max_distance / min_distance approaches 1.

You should see output close to:

d=2    ratio max/min = 123.94
d=10   ratio max/min = 4.60
d=50   ratio max/min = 1.88
d=200  ratio max/min = 1.35

At d=2 the farthest pair is 124 times further than the nearest — distances are highly informative. At d=200 the farthest pair is only 35% further than the nearest. Every point is almost equally far from every other point. Asking “who is my nearest neighbor?” becomes nearly meaningless.

Why this breaks specific algorithms

KNN finds the k nearest points and votes. If distances are all nearly equal, “nearest” is arbitrary — the vote is noise.

k-means assigns points to their closest centroid. Centroids in high dimensions sit in empty regions; the notion of a tight cluster dissolves.

Kernel density estimation builds a density model from local point concentrations. With empty interiors and boundary-hugging points, the density estimate becomes flat and uninformative.

Linear regression and tree-based models are largely immune: linear regression cares about global coefficient weights, not local neighborhoods; decision trees split one feature at a time, so they do not suffer from the same exponential volume growth.

The cures

  1. Dimensionality reduction. PCA projects data onto the directions of maximum variance, discarding dimensions that add noise. UMAP/t-SNE are nonlinear alternatives used for visualization.

  2. Feature selection. Remove features with low variance, low correlation to the target, or high redundancy. Fewer features = denser, more meaningful neighborhoods.

  3. Switch to distance-free models. Gradient-boosted trees and random forests split on individual features; they do not compute pairwise distances and scale well to hundreds of features.

  4. Collect more data. Sparsity is a density problem. More points help, but the required sample size grows exponentially with d — this rarely fully solves the problem.

  5. Regularize aggressively. Ridge and lasso regression add a penalty that implicitly performs feature selection or shrinkage, limiting the damage from irrelevant features.

Next

Dimensionality reduction in practice with PCA.


Practice this in an interview

All questions
How does the curse of dimensionality affect KNN?

In high-dimensional spaces all pairwise distances concentrate around the same value, so the concept of a 'nearest' neighbour breaks down — the k-th nearest neighbour is almost as far as every other point. KNN's accuracy degrades sharply as dimensionality increases unless the data has much lower intrinsic dimensionality.

What is the curse of dimensionality, and how does it affect machine learning models?

As the number of features grows, the volume of the feature space increases exponentially, so training data becomes exponentially sparse. Distance-based algorithms degrade because points become approximately equidistant; density estimation requires data that grows exponentially; and overfitting risk rises for any fixed training set size.

Why does depth help more than width for learning complex functions?

Depth enables hierarchical feature composition — each layer can re-use and recombine features from the layer below, representing exponentially more functions than a single wide layer of the same parameter count. Width alone increases capacity linearly, while depth increases it exponentially in the class of computable functions.

How do you handle high-cardinality categorical features in machine learning?

One-hot encoding becomes impractical when a categorical feature has hundreds or thousands of unique levels, producing a sparse matrix that slows training and causes overfitting on rare categories. Better approaches include target encoding with smoothing, frequency encoding, hashing, learned embeddings, or grouping rare categories into an 'Other' bucket, each with different tradeoffs on leakage risk and information retention.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content