How does PCA work, and how do you choose the number of components?

PCA finds orthogonal directions (principal components) of maximum variance by computing the eigenvectors of the covariance matrix, then projects data onto the top components. Choose the number of components by the cumulative explained variance ratio (e.g. enough to retain 95%), a scree-plot elbow, or downstream task performance. Always standardize features first, since PCA is variance-driven.

What are the assumptions and limitations of PCA, and when would it hurt your model?

PCA assumes linear relationships, that variance equals importance, and that components should be orthogonal. It can hurt when the predictive signal lives in low-variance directions, when relationships are nonlinear, or when interpretability matters, since components mix original features. It's also sensitive to scaling and outliers and is unsupervised, so it ignores the target.

What is PCA, when should you use it, and what are its key limitations?

PCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.

What's the difference between feature selection and dimensionality reduction like PCA?

Feature selection keeps a subset of the original features and discards the rest, so the surviving features stay interpretable. Dimensionality reduction like PCA creates new features that are combinations of the originals, compressing information but losing direct interpretability. Choose feature selection when you need to explain which inputs matter, and PCA when you mainly need a compact representation and don't need named features.

PCA & dimensionality reduction — Machine Learning

The curse of dimensionality says too many features hurt. PCA (principal component analysis) is the classic cure: it finds the handful of directions your data actually varies along and lets you throw the rest away — keeping most of the information in far fewer dimensions. It’s the most important unsupervised technique after clustering, and a guaranteed interview topic.

TryPCA axes

Drag points — watch the principal axes re-fit live

Variance explained

PC191.7%

PC28.3%

mean(0.4, 0.3)

PC1 (solid) points along maximum variance. PC2 (dashed) is perpendicular. Toggling "project onto PC1" collapses the cloud to 1D — that's the 2D-to-1D reduction in action.

The idea: rotate to where the variance is

Real features are usually correlated, so the data doesn’t fill its space evenly — it stretches along certain directions. PCA finds those directions, called principal components:

PC1 is the direction of maximum variance (the long axis of the cloud).
PC2 is the direction of most remaining variance, orthogonal to PC1.
…and so on, each capturing less.

Mathematically, the principal components are the eigenvectors of the covariance matrix, and each one’s eigenvalue is the variance along it. PCA is just a rotation onto those axes. See it — and project away the low-variance direction:

Once you’re on the principal axes, the low-variance components carry little information, so you keep the top k and drop the rest. That’s dimensionality reduction.

How many components to keep

You don’t guess — you read the explained-variance ratio. Each component explains some fraction of the total variance; sum them up and keep enough components to retain, say, 95%.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits

X, _ = load_digits(return_X_y=True)   # 64 features (8x8 pixel images)
Xs = StandardScaler().fit_transform(X)

pca = PCA().fit(Xs)
cum = np.cumsum(pca.explained_variance_ratio_)
for k in [2, 10, 20, 30, 40]:
    print(f"  {k:2d} components keep {cum[k-1]*100:5.1f}% of the variance")

# How many components to reach 95%?
k95 = int(np.argmax(cum >= 0.95)) + 1
print(f"\n{k95} of 64 components retain 95% of the information.")

   2 components keep  21.6% of the variance
  10 components keep  58.9% of the variance
  20 components keep  79.3% of the variance
  30 components keep  89.3% of the variance
  40 components keep  95.1% of the variance

40 of 64 components retain 95% of the information.

Two components already capture 21.6% — a usable thumbnail — but you need 40 of the 64 to preserve 95%. The digits genuinely live in many dimensions; the explained-variance ratio tells you exactly how many you can afford to drop.

What PCA is good (and bad) at

Good: speeding up downstream models (fewer features), denoising (dropping low-variance components removes noise), decorrelating features, and compression.
Visualization: projecting to 2 components to eyeball structure — though for seeing clusters, the nonlinear methods t-SNE and UMAP usually reveal more.
The catch: PCA is linear — it can only find straight-line directions. If your structure is curved (a spiral), PCA misses it. And the components are combinations of all original features, so they’re less interpretable than the raw columns.

In one breath

Correlated features make data stretch along a few directions; PCA finds them — the principal components, eigenvectors of the covariance matrix, ordered by variance.
PC1 is the direction of maximum spread, PC2 the most remaining variance orthogonal to it, and so on.
Keep the top k by reading the cumulative explained-variance ratio (e.g. enough for 95%) and drop the rest — that is the dimensionality reduction.
Always standardize first, or PCA just chases your largest-scaled column.
It’s linear and its components aren’t nameable — for curved structure use t-SNE/UMAP, and to know which feature matters use feature selection or SHAP.

Quick check

0/3

Q1What is the first principal component (PC1)?

Q2How do you decide how many principal components to keep?

Q3Why must you standardize features before PCA?

For seeing clusters in high-dimensional data, the nonlinear projections t-SNE & UMAP often beat PCA. And to group what you’ve reduced, revisit k-means and DBSCAN.

PCA & dimensionality reduction

What you'll learn

Before you start

Drag points — watch the principal axes re-fit live

The idea: rotate to where the variance is

How many components to keep

What PCA is good (and bad) at

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further