What is the curse of dimensionality, and how does it affect machine learning models?

As the number of features grows, the volume of the feature space increases exponentially, so training data becomes exponentially sparse. Distance-based algorithms degrade because points become approximately equidistant; density estimation requires data that grows exponentially; and overfitting risk rises for any fixed training set size.

How does the curse of dimensionality affect KNN?

In high-dimensional spaces all pairwise distances concentrate around the same value, so the concept of a 'nearest' neighbour breaks down — the k-th nearest neighbour is almost as far as every other point. KNN's accuracy degrades sharply as dimensionality increases unless the data has much lower intrinsic dimensionality.

How does k-nearest neighbours work, and why is it called a lazy learner?

KNN stores the entire training set and defers all computation to prediction time: for a new point it finds the k closest training examples by distance, then returns the majority class (classification) or mean value (regression). It is called lazy because there is no training phase — the model is the data itself.

What are t-SNE and UMAP, how do they differ from PCA, and what are their limitations for ML workflows?

t-SNE and UMAP are nonlinear dimensionality reduction algorithms designed primarily for 2D/3D visualization of high-dimensional data. Unlike PCA, they preserve local neighborhood structure rather than global variance, producing cleaner cluster separations in plots. Neither should be used as a preprocessing step for training a supervised model because they are transductive and their output is not stable across runs.

The Curse of Dimensionality — Machine Learning

A team built a KNN classifier with 2 features and got 91% accuracy. They added 18 more features — customer age, tenure, click counts, session lengths — and accuracy fell to 74%. More information made the model worse. That is the curse of dimensionality in production.

This lesson explains why it happens and what you can do about it.

The problem: data gets exponentially lonely

Imagine you have 1000 training points spread uniformly along a line (1 dimension). If you want a local neighborhood that covers 10% of the range, you grab 100 nearby points — plenty to work with.

Now spread those same 1000 points across a 10-dimensional unit hypercube. To capture 10% of the total volume, how wide must each axis of your neighborhood box be?

You need each axis to span r^(1/d) of the full range, where r is the target volume fraction and d is the number of dimensions. Plugging in:

r = 0.10, d = 10
axis span = 0.10^(1/10) = 0.794 = 79%

To grab 10% of the data in 10 dimensions you must reach across 79% of every axis. That neighborhood is not local at all. The table below shows how fast locality collapses:

Dimensions	Axis span to capture 10% of data
1	10%
2	32%
5	63%
10	79%
20	89%

Every extra feature exponentially dilutes the density of your data. With a fixed dataset size the points are not removed — they simply become vanishingly sparse relative to the space they inhabit.

Intuition 1: the empty hypercube

Picture 8 points, one at each corner of a unit cube (3 dimensions). The cube has a center at (0.5, 0.5, 0.5). No point is near the center — all 8 are at the boundary. That pattern gets much worse as dimensions grow.

Same 8 points, growing space. Adding one dimension doubles the volume; the points become isolated.

Intuition 2: volume hides near the boundary

Take a unit hypercube in d dimensions. Peel off a thin shell of thickness 0.05 from every face. What fraction of the total volume is in that shell?

shell fraction = 1 - (1 - 2 * 0.05)^d = 1 - 0.90^d

Dimensions	Volume in the outer 5% shell
1	10%
2	19%
10	65%
100	~100%

In 10 dimensions, 65% of the volume lives within 5% of the boundary. In 100 dimensions essentially everything is near the surface. The center of the space is essentially empty. Points sampled uniformly are almost all boundary points — they are extreme in at least one dimension — and any distance metric that cares about the “typical” region will be measuring noise.

Intuition 3: nearest and farthest neighbors converge

If all distances are about the same number, ranking them means nothing. This is distance concentration: as d grows, the ratio max_distance / min_distance approaches 1.

import numpy as np

rng = np.random.default_rng(0)
n = 100

for d in [2, 10, 50, 200]:
    pts = rng.uniform(0, 1, size=(n, d))
    dists = []
    for i in range(n):
        for j in range(i + 1, n):
            diff = pts[i] - pts[j]
            dists.append(float(np.sqrt(np.dot(diff, diff))))
    lo = min(dists)
    hi = max(dists)
    print("d=" + str(d) + "  ratio max/min = " + str(round(hi / lo, 2)))

d=2  ratio max/min = 123.94
d=10  ratio max/min = 4.6
d=50  ratio max/min = 1.88
d=200  ratio max/min = 1.35

At d=2 the farthest pair is 124 times further than the nearest — distances are highly informative. At d=200 the farthest pair is only 35% further than the nearest. Every point is almost equally far from every other point. Asking “who is my nearest neighbor?” becomes nearly meaningless.

Why this breaks specific algorithms

KNN finds the k nearest points and votes. If distances are all nearly equal, “nearest” is arbitrary — the vote is noise.

k-means assigns points to their closest centroid. Centroids in high dimensions sit in empty regions; the notion of a tight cluster dissolves.

Kernel density estimation builds a density model from local point concentrations. With empty interiors and boundary-hugging points, the density estimate becomes flat and uninformative.

Linear regression and tree-based models are largely immune: linear regression cares about global coefficient weights, not local neighborhoods; decision trees split one feature at a time, so they do not suffer from the same exponential volume growth.

The cures

Dimensionality reduction. PCA projects data onto the directions of maximum variance, discarding dimensions that add noise. UMAP/t-SNE are nonlinear alternatives used for visualization.
Feature selection. Remove features with low variance, low correlation to the target, or high redundancy. Fewer features = denser, more meaningful neighborhoods.
Switch to distance-free models. Gradient-boosted trees and random forests split on individual features; they do not compute pairwise distances and scale well to hundreds of features.
Collect more data. Sparsity is a density problem. More points help, but the required sample size grows exponentially with d — this rarely fully solves the problem.
Regularize aggressively. Ridge and lasso regression add a penalty that implicitly performs feature selection or shrinkage, limiting the damage from irrelevant features.

In one breath

Adding features makes data exponentially sparse: to capture 10% of the volume in 10-D, a neighborhood must span 79% of every axis — “local” stops being local.
In high dimensions almost all volume hugs the boundary (65% within 5% of the surface at 10-D); the interior is empty, so uniform points are all “extreme.”
Distances concentrate — the max/min pairwise distance ratio → 1 as d grows, so “nearest neighbor” becomes nearly meaningless.
This breaks distance-based methods (KNN, k-means, KDE); linear models and trees are largely immune (global weights / one-feature splits, no pairwise distances).
Cures: dimensionality reduction (PCA, UMAP), feature selection, switch to tree models, regularize — and more data helps, but the required sample grows exponentially with d.

Quick check

0/3

Q1You want a local neighborhood that contains 1% of the data volume in a 20-dimensional space. Approximately what fraction of each axis must the neighborhood span?

Q2Why does KNN accuracy often drop when you add many weakly predictive features?

Q3A company trains a fraud-detection model on 150 transaction features. KNN achieves only 61% F1. A colleague suggests switching to a gradient-boosted tree. The real reason this switch is likely to help is:

The first cure, in practice: PCA & dimensionality reduction — projecting data onto the directions of maximum variance to shed the noisy dimensions.

The Curse of Dimensionality

What you'll learn

Before you start

The problem: data gets exponentially lonely

Intuition 1: the empty hypercube

Intuition 2: volume hides near the boundary

Intuition 3: nearest and farthest neighbors converge

Why this breaks specific algorithms

The cures

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further