datarekha

PCA & dimensionality reduction

Find the few directions your data actually varies along, and drop the rest. How principal component analysis compresses high-dimensional data while keeping most of the information.

8 min read Intermediate Machine Learning Lesson 28 of 33

What you'll learn

  • PCA as rotation to the axes of maximum variance (eigenvectors of the covariance)
  • Reading explained-variance to choose how many components to keep
  • When PCA helps — visualization, denoising, speeding up models — and its limits

Before you start

The curse of dimensionality says too many features hurt. PCA (principal component analysis) is the classic cure: it finds the handful of directions your data actually varies along and lets you throw the rest away — keeping most of the information in far fewer dimensions. It’s the most important unsupervised technique after clustering, and a guaranteed interview topic.

The idea: rotate to where the variance is

Real features are usually correlated, so the data doesn’t fill its space evenly — it stretches along certain directions. PCA finds those directions, called principal components:

  • PC1 is the direction of maximum variance (the long axis of the cloud).
  • PC2 is the direction of most remaining variance, orthogonal to PC1.
  • …and so on, each capturing less.

Mathematically, the principal components are the eigenvectors of the covariance matrix, and each one’s eigenvalue is the variance along it. PCA is just a rotation onto those axes. See it — and project away the low-variance direction:

Once you’re on the principal axes, the low-variance components carry little information, so you keep the top k and drop the rest. That’s dimensionality reduction.

How many components to keep

You don’t guess — you read the explained-variance ratio. Each component explains some fraction of the total variance; sum them up and keep enough components to retain, say, 95%.

What PCA is good (and bad) at

  • Good: speeding up downstream models (fewer features), denoising (dropping low-variance components removes noise), decorrelating features, and compression.
  • Visualization: projecting to 2 components to eyeball structure — though for seeing clusters, the nonlinear methods t-SNE and UMAP usually reveal more.
  • The catch: PCA is linear — it can only find straight-line directions. If your structure is curved (a spiral), PCA misses it. And the components are combinations of all original features, so they’re less interpretable than the raw columns.

Quick check

Quick check

0/3
Q1What is the first principal component (PC1)?
Q2How do you decide how many principal components to keep?
Q3Why must you standardize features before PCA?

Next

For seeing clusters in high-dimensional data, the nonlinear projections t-SNE & UMAP often beat PCA. And to group what you’ve reduced, revisit k-means and DBSCAN.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What is PCA, when should you use it, and what are its key limitations?

PCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.

How does PCA work, and how do you choose the number of components?

PCA finds orthogonal directions (principal components) of maximum variance by computing the eigenvectors of the covariance matrix, then projects data onto the top components. Choose the number of components by the cumulative explained variance ratio (e.g. enough to retain 95%), a scree-plot elbow, or downstream task performance. Always standardize features first, since PCA is variance-driven.

What's the difference between feature selection and dimensionality reduction like PCA?

Feature selection keeps a subset of the original features and discards the rest, so the surviving features stay interpretable. Dimensionality reduction like PCA creates new features that are combinations of the originals, compressing information but losing direct interpretability. Choose feature selection when you need to explain which inputs matter, and PCA when you mainly need a compact representation and don't need named features.

What are the assumptions and limitations of PCA, and when would it hurt your model?

PCA assumes linear relationships, that variance equals importance, and that components should be orthogonal. It can hurt when the predictive signal lives in low-variance directions, when relationships are nonlinear, or when interpretability matters, since components mix original features. It's also sensitive to scaling and outliers and is unsupervised, so it ignores the target.

Related lessons

Explore further

Skip to content