Orthogonality, projections & least squares
You can't fit noisy data exactly — so you get as close as possible. "As close as possible" is a projection, and the magic is that the leftover error always comes out perpendicular. That single fact is what linear regression is.
What you'll learn
- Orthogonality (dot product = 0) and why an orthonormal basis is a free coordinate system
- Projecting a vector onto a line and onto a subspace
- Why "best approximation" forces the residual to be orthogonal to the model space
- How that derives the normal equations — i.e. how regression is a projection
- QR / Gram–Schmidt as the numerically stable way to do it
Before you start
Real data has more constraints than knobs: a hundred points, two parameters. You can’t pass a line through all of them. So you stop asking for exact and start asking for closest — and “closest” has a beautiful, exact answer.
Orthogonality: the cleanest relationship
Two vectors are orthogonal when their dot product is zero — they share nothing, they’re perpendicular. A basis of mutually orthogonal unit vectors is orthonormal, and it’s the nicest coordinate system there is: to find a point’s coordinate along an axis, you just take a dot product. No matrix inverse, no solving — orthogonality makes the bookkeeping vanish.
Projection: the closest reachable point
Your model can only produce points in a certain space (a line, a plane, the
column space of X). The target b usually sits outside it. The
projection p is the point inside the model space closest to b:
project b onto the line through a: p = (aᵀb / aᵀa) · a
Here’s the whole secret — drag it and watch:
The error b − p is always perpendicular to the model space. That’s not
a coincidence — it’s why p is closest. If the error had any component
along the model space, you could slide p and get closer. So at the
minimum, the residual is orthogonal to everything the model can represent.
That orthogonality is the normal equations
For regression, the model space is the column space of X, and we want x
so that Xx is the projection of y. “Residual orthogonal to every
column” means:
Xᵀ (y − X x) = 0 ⟹ (XᵀX) x = Xᵀ y
The normal equations — the same linear system from the RREF lesson —
fall straight out of the orthogonality condition. Linear regression is
literally the projection of y onto the span of your features.
The Xᵀ·residual ≈ 0 line is the orthogonality condition holding numerically.
Gram–Schmidt & QR: doing it without blowing up
You could form XᵀX and solve — but that squares the condition number and
loses precision. Instead, Gram–Schmidt orthonormalizes the columns of
X into Q (orthonormal) times R (upper-triangular): X = QR. Then the
least-squares solution is a clean back-substitution, R x = Qᵀ y. It’s what
np.linalg.lstsq and every serious solver actually do.
Quick check
Quick check
Practice this in an interview
All questionsOLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.
OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.
PCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.
L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.