SVD: the decomposition behind PCA, compression & LoRA
Every matrix — any shape, any contents — factors into a rotation, a stretch, and another rotation. That single fact powers PCA, recommender systems, denoising, the pseudo-inverse, and low-rank fine-tuning.
What you'll learn
- The factorization A = U Σ Vᵀ and what each piece means geometrically
- Singular values as the importance ranking of directions in your data
- Why the top-k truncation is the *best possible* low-rank approximation (Eckart–Young)
- How SVD computes PCA more stably than eigendecomposing the covariance
- Where SVD hides: compression, recommenders, the pseudo-inverse, LoRA
Before you start
Eigenvalues only work for square matrices. But your data is almost never
square — it’s n rows by d columns. The Singular Value Decomposition
is the generalization that works for any matrix, and it’s arguably the
most important factorization in all of machine learning.
The claim is bold and exact: every matrix A can be written as
A = U Σ Vᵀ
a rotation, a stretch along axes, and another rotation.
The three pieces
U(columns = left singular vectors) andV(columns = right singular vectors) are orthonormal — pure rotations/reflections.Σis diagonal, holding the singular valuesσ₁ ≥ σ₂ ≥ … ≥ 0. Eachσsays how much the map stretches along that direction.
Because the σs are sorted, the first few directions carry the most of
what the matrix does. That ordering is the whole reason SVD is so useful.
See it: low-rank compression
A grayscale image is just a matrix of pixel values. Reconstruct it from
only the top-k singular components — A ≈ Σ_{i<k} σ_i u_i v_iᵀ — and
watch how few you actually need.
That’s the punchline of the Eckart–Young theorem: truncating to the top
k singular values gives the mathematically best rank-k approximation
of the matrix. Nothing else with the same rank gets closer.
In code
Where SVD lives in ML
- PCA, done right. PCA is the SVD of the centered data matrix. Doing
svd(X)is more numerically stable than eigendecomposingXᵀX— which is exactly what scikit-learn’sPCAdoes internally. - Recommender systems. Factor the user×item ratings matrix; the top singular components are latent “taste” factors. This is the heart of the Netflix-Prize era of collaborative filtering.
- The pseudo-inverse & least squares.
np.linalg.lstsquses SVD to solveAx = beven whenAis rank-deficient — no normal equations blowing up. - Denoising & latent semantics. Dropping tiny singular values throws away noise; LSA applied this to word–document matrices long before embeddings.
Quick check
Quick check
Practice this in an interview
All questionsPCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.
PCA assumes linear relationships, that variance equals importance, and that components should be orthogonal. It can hurt when the predictive signal lives in low-variance directions, when relationships are nonlinear, or when interpretability matters, since components mix original features. It's also sensitive to scaling and outliers and is unsupervised, so it ignores the target.
PCA finds orthogonal directions (principal components) of maximum variance by computing the eigenvectors of the covariance matrix, then projects data onto the top components. Choose the number of components by the cumulative explained variance ratio (e.g. enough to retain 95%), a scree-plot elbow, or downstream task performance. Always standardize features first, since PCA is variance-driven.
The kernel trick lets an SVM find a nonlinear decision boundary by implicitly mapping data into a higher-dimensional space where it becomes linearly separable, without ever computing that mapping explicitly. It works because the SVM's dual formulation depends only on dot products between points, and a kernel function computes that dot product directly in the high-dimensional space. Common kernels are linear, polynomial, and RBF.