datarekha

Orthogonality, projections & least squares

You can't fit noisy data exactly — so you get as close as possible. "As close as possible" is a projection, and the magic is that the leftover error always comes out perpendicular. That single fact is what linear regression is.

8 min read Intermediate Math for ML Lesson 7 of 30

What you'll learn

  • Orthogonality (dot product = 0) and why an orthonormal basis is a free coordinate system
  • Projecting a vector onto a line and onto a subspace
  • Why "best approximation" forces the residual to be orthogonal to the model space
  • How that derives the normal equations — i.e. how regression is a projection
  • QR / Gram–Schmidt as the numerically stable way to do it

Before you start

Real data has more constraints than knobs: a hundred points, two parameters. You can’t pass a line through all of them. So you stop asking for exact and start asking for closest — and “closest” has a beautiful, exact answer.

Orthogonality: the cleanest relationship

Two vectors are orthogonal when their dot product is zero — they share nothing, they’re perpendicular. A basis of mutually orthogonal unit vectors is orthonormal, and it’s the nicest coordinate system there is: to find a point’s coordinate along an axis, you just take a dot product. No matrix inverse, no solving — orthogonality makes the bookkeeping vanish.

Projection: the closest reachable point

Your model can only produce points in a certain space (a line, a plane, the column space of X). The target b usually sits outside it. The projection p is the point inside the model space closest to b:

project b onto the line through a:   p = (aᵀb / aᵀa) · a

Here’s the whole secret — drag it and watch:

The error b − p is always perpendicular to the model space. That’s not a coincidence — it’s why p is closest. If the error had any component along the model space, you could slide p and get closer. So at the minimum, the residual is orthogonal to everything the model can represent.

That orthogonality is the normal equations

For regression, the model space is the column space of X, and we want x so that Xx is the projection of y. “Residual orthogonal to every column” means:

Xᵀ (y − X x) = 0     ⟹     (XᵀX) x = Xᵀ y

The normal equations — the same linear system from the RREF lesson — fall straight out of the orthogonality condition. Linear regression is literally the projection of y onto the span of your features.

The Xᵀ·residual ≈ 0 line is the orthogonality condition holding numerically.

Gram–Schmidt & QR: doing it without blowing up

You could form XᵀX and solve — but that squares the condition number and loses precision. Instead, Gram–Schmidt orthonormalizes the columns of X into Q (orthonormal) times R (upper-triangular): X = QR. Then the least-squares solution is a clean back-substitution, R x = Qᵀ y. It’s what np.linalg.lstsq and every serious solver actually do.

Quick check

Quick check

0/3
Q1Why is the least-squares residual b − p perpendicular to the model space?
Q2The normal equations (XᵀX)β = Xᵀy come from which condition?
Q3Why do solvers use QR instead of forming XᵀX directly?

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How does Ordinary Least Squares derive the coefficient vector, and what is the closed-form solution?

OLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.

What are the core assumptions of linear regression, and what breaks when each is violated?

OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.

What is PCA, when should you use it, and what are its key limitations?

PCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.

What is the fundamental difference between L1 (Lasso) and L2 (Ridge) regularization, and when do you choose each?

L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.

Related lessons

Explore further

Skip to content