What is PCA, when should you use it, and what are its key limitations?

PCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.

What are the assumptions and limitations of PCA, and when would it hurt your model?

PCA assumes linear relationships, that variance equals importance, and that components should be orthogonal. It can hurt when the predictive signal lives in low-variance directions, when relationships are nonlinear, or when interpretability matters, since components mix original features. It's also sensitive to scaling and outliers and is unsupervised, so it ignores the target.

How does PCA work, and how do you choose the number of components?

PCA finds orthogonal directions (principal components) of maximum variance by computing the eigenvectors of the covariance matrix, then projects data onto the top components. Choose the number of components by the cumulative explained variance ratio (e.g. enough to retain 95%), a scree-plot elbow, or downstream task performance. Always standardize features first, since PCA is variance-driven.

What is the kernel trick in SVM, and why does it work?

The kernel trick lets an SVM find a nonlinear decision boundary by implicitly mapping data into a higher-dimensional space where it becomes linearly separable, without ever computing that mapping explicitly. It works because the SVM's dual formulation depends only on dot products between points, and a kernel function computes that dot product directly in the high-dimensional space. Common kernels are linear, polynomial, and RBF.

SVD: the decomposition behind PCA, compression & LoRA — Math for ML

What you'll learn

The factorization A = U Σ Vᵀ and what each piece means geometrically

Singular values as the importance ranking of directions in your data

Why the top-k truncation is the *best possible* low-rank approximation (Eckart–Young)

How SVD computes PCA more stably than eigendecomposing the covariance

Where SVD hides: compression, recommenders, the pseudo-inverse, LoRA

The last lesson ended in a frustration: eigenvectors are a beautiful idea that only speaks to square matrices, and your data is almost never square — it is n rows by d columns, a rectangle. The singular value decomposition is the answer we were promised — the generalization that works for any matrix at all, and arguably the most important factorization in all of machine learning.

Its claim is bold and exact: every matrix A, whatever its shape or contents, can be written as

A = U Σ Vᵀ

a rotation, a stretch along axes, and another rotation. The two things eigendecomposition could not give a rectangle appear here in full — orthogonal input directions in V, orthogonal output directions in U, and one set of stretch factors, the singular values, in Σ.

The three pieces

Any linear map is a rotation, an axis-aligned scaling, and a final rotation.

U (columns = left singular vectors) and V (columns = right singular vectors) are orthonormal — pure rotations/reflections.
Σ is diagonal, holding the singular values σ₁ ≥ σ₂ ≥ … ≥ 0. Each σ says how much the map stretches along that direction.

Because the σs are sorted, the first few directions carry the most of what the matrix does. That ordering is the whole reason SVD is so useful.

See it: low-rank compression

A grayscale image is just a matrix of pixel values. Reconstruct it from only the top-k singular components — A ≈ Σ_{i<k} σ_i u_i v_iᵀ — and watch how few you actually need.

rank-1 reconstruction

Singular values (energy per component)

Energy kept94.5%

Numbers stored41 vs 400

Compression9.8×

rank k1

At k=3 the face is essentially back — three components carry 99% of the image.

That’s the punchline of the Eckart–Young theorem: truncating to the top k singular values gives the mathematically best rank-k approximation of the matrix. Nothing else with the same rank gets closer.

In code

import numpy as np

rng = np.random.default_rng(0)
# A "data" matrix with hidden low-rank structure + a little noise
true = np.outer([3, 1, 4, 1, 5], [2, 7, 1, 8]).astype(float)
A = true + rng.normal(0, 0.3, true.shape)

U, S, Vt = np.linalg.svd(A, full_matrices=False)
print("singular values:", S.round(2))      # first one dominates -> near rank-1

# Best rank-1 approximation (Eckart-Young)
A1 = S[0] * np.outer(U[:, 0], Vt[0])
print("\nrank-1 reconstruction error:", np.linalg.norm(A - A1).round(3))
print("energy in top-1:", (S[0]**2 / (S**2).sum()).round(3))

# Pseudo-inverse for least squares, straight from SVD
b = rng.normal(size=A.shape[0])
x = Vt.T @ np.diag(1/S) @ U.T @ b           # = np.linalg.pinv(A) @ b
print("\nleast-squares solution norm:", np.linalg.norm(x).round(3))

singular values: [78.28  0.75  0.43  0.18]

rank-1 reconstruction error: 0.886
energy in top-1: 1.0

least-squares solution norm: 3.863

The data was built as a single outer product (rank 1) plus a little noise, and the singular values confess it instantly: 78.28 towers over the rest, and energy in top-1 rounds to 1.0 — one direction holds essentially all of the matrix. That is Eckart–Young in action: the rank-1 reconstruction is already within 0.886 (just the noise) of the original.

Where SVD lives in ML

PCA, done right. PCA is the SVD of the centered data matrix. Doing svd(X) is more numerically stable than eigendecomposing XᵀX — which is exactly what scikit-learn’s PCA does internally.
Recommender systems. Factor the user×item ratings matrix; the top singular components are latent “taste” factors. This is the heart of the Netflix-Prize era of collaborative filtering.
The pseudo-inverse & least squares. np.linalg.lstsq uses SVD to solve Ax = b even when A is rank-deficient — no normal equations blowing up.
Denoising & latent semantics. Dropping tiny singular values throws away noise; LSA applied this to word–document matrices long before embeddings.

In one breath

Every matrix — any shape — factors as A = U Σ Vᵀ: a rotation V, an axis-aligned stretch by the singular values σ₁ ≥ σ₂ ≥ … ≥ 0 in Σ, and a rotation U, with U and V orthonormal. Because the singular values come sorted, the first few directions carry most of what the matrix does — and Eckart–Young proves that truncating to the top k is the provably best rank-k approximation (the engine of image compression, denoising, and LoRA). The same factorization is the numerically stable route to PCA (work on X directly, never form XᵀX), the basis of the pseudo-inverse and lstsq, and the latent-factor model behind recommender systems.

Practice

Quick check

0/3

Q1Your data matrix has singular values [50, 48, 2, 0.1]. What does that tell you?

Q2Why prefer SVD of X over eigendecomposing XᵀX for PCA?

Q3Eckart–Young says the top-k truncation of the SVD is…

A question to carry forward

We have now met both halves of one idea: eigendecomposition for square matrices, SVD for any matrix — each finding the directions that matter and ranking them by importance. And twice the same application has flashed past without our stopping on it: “this is basically PCA,” “PCA is the SVD of the centred data,” “sklearn’s PCA runs an SVD inside.” We keep arriving at the doorstep of one specific algorithm and walking on by.

So here is the thread onward, and it closes this chapter: gather everything — covariance, eigenvectors, singular values, variance-along-a-direction, the best-rank-k truncation — and assemble it into the single most-used dimensionality-reduction recipe there is. What are the exact four steps of PCA from scratch, how do you read “variance explained” to decide how many dimensions to keep, and how does collapsing fifty features down to two finally let you see your data as points on a flat plot?

SVD: the decomposition behind PCA, compression & LoRA

What you'll learn

Before you start

The three pieces

See it: low-rank compression

In code

Where SVD lives in ML

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further