datarekha

SVD: the decomposition behind PCA, compression & LoRA

Every matrix — any shape, any contents — factors into a rotation, a stretch, and another rotation. That single fact powers PCA, recommender systems, denoising, the pseudo-inverse, and low-rank fine-tuning.

9 min read Advanced Math for ML Lesson 9 of 30

What you'll learn

  • The factorization A = U Σ Vᵀ and what each piece means geometrically
  • Singular values as the importance ranking of directions in your data
  • Why the top-k truncation is the *best possible* low-rank approximation (Eckart–Young)
  • How SVD computes PCA more stably than eigendecomposing the covariance
  • Where SVD hides: compression, recommenders, the pseudo-inverse, LoRA

Before you start

Eigenvalues only work for square matrices. But your data is almost never square — it’s n rows by d columns. The Singular Value Decomposition is the generalization that works for any matrix, and it’s arguably the most important factorization in all of machine learning.

The claim is bold and exact: every matrix A can be written as

A = U Σ Vᵀ

a rotation, a stretch along axes, and another rotation.

The three pieces

Ayour matrix=UrotateΣstretchVᵀrotateΣ = diag(σ₁ ≥ σ₂ ≥ …)singular values,sorted, all ≥ 0
Any linear map is a rotation, an axis-aligned scaling, and a final rotation.
  • U (columns = left singular vectors) and V (columns = right singular vectors) are orthonormal — pure rotations/reflections.
  • Σ is diagonal, holding the singular values σ₁ ≥ σ₂ ≥ … ≥ 0. Each σ says how much the map stretches along that direction.

Because the σs are sorted, the first few directions carry the most of what the matrix does. That ordering is the whole reason SVD is so useful.

See it: low-rank compression

A grayscale image is just a matrix of pixel values. Reconstruct it from only the top-k singular components — A ≈ Σ_{i<k} σ_i u_i v_iᵀ — and watch how few you actually need.

That’s the punchline of the Eckart–Young theorem: truncating to the top k singular values gives the mathematically best rank-k approximation of the matrix. Nothing else with the same rank gets closer.

In code

Where SVD lives in ML

  • PCA, done right. PCA is the SVD of the centered data matrix. Doing svd(X) is more numerically stable than eigendecomposing XᵀX — which is exactly what scikit-learn’s PCA does internally.
  • Recommender systems. Factor the user×item ratings matrix; the top singular components are latent “taste” factors. This is the heart of the Netflix-Prize era of collaborative filtering.
  • The pseudo-inverse & least squares. np.linalg.lstsq uses SVD to solve Ax = b even when A is rank-deficient — no normal equations blowing up.
  • Denoising & latent semantics. Dropping tiny singular values throws away noise; LSA applied this to word–document matrices long before embeddings.

Quick check

Quick check

0/3
Q1Your data matrix has singular values [50, 48, 2, 0.1]. What does that tell you?
Q2Why prefer SVD of X over eigendecomposing XᵀX for PCA?
Q3Eckart–Young says the top-k truncation of the SVD is…

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What is PCA, when should you use it, and what are its key limitations?

PCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.

What are the assumptions and limitations of PCA, and when would it hurt your model?

PCA assumes linear relationships, that variance equals importance, and that components should be orthogonal. It can hurt when the predictive signal lives in low-variance directions, when relationships are nonlinear, or when interpretability matters, since components mix original features. It's also sensitive to scaling and outliers and is unsupervised, so it ignores the target.

How does PCA work, and how do you choose the number of components?

PCA finds orthogonal directions (principal components) of maximum variance by computing the eigenvectors of the covariance matrix, then projects data onto the top components. Choose the number of components by the cumulative explained variance ratio (e.g. enough to retain 95%), a scree-plot elbow, or downstream task performance. Always standardize features first, since PCA is variance-driven.

What is the kernel trick in SVM, and why does it work?

The kernel trick lets an SVM find a nonlinear decision boundary by implicitly mapping data into a higher-dimensional space where it becomes linearly separable, without ever computing that mapping explicitly. It works because the SVM's dual formulation depends only on dot products between points, and a kernel function computes that dot product directly in the high-dimensional space. Common kernels are linear, polynomial, and RBF.

Related lessons

Explore further

Skip to content