How does Ordinary Least Squares derive the coefficient vector, and what is the closed-form solution?

OLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.

What are the core assumptions of linear regression, and what breaks when each is violated?

OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.

What is PCA, when should you use it, and what are its key limitations?

PCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.

What is the fundamental difference between L1 (Lasso) and L2 (Ridge) regularization, and when do you choose each?

L1 adds the sum of absolute coefficient values to the loss, which drives some coefficients to exactly zero and performs implicit feature selection. L2 adds the sum of squared coefficients, which shrinks all weights proportionally but rarely zeroes any out. Lasso is preferred when you suspect only a few features matter; Ridge is preferred when most features contribute small effects.

Orthogonality, projections & least squares — Math for ML

What you'll learn

Orthogonality (dot product = 0) and why an orthonormal basis is a free coordinate system

Projecting a vector onto a line and onto a subspace

Why "best approximation" forces the residual to be orthogonal to the model space

How that derives the normal equations — i.e. how regression is a projection

QR / Gram–Schmidt as the numerically stable way to do it

The last lesson left us stranded in the most ordinary situation in all of data science: Ax = b with no exact solution — a hundred noisy points, two knobs, and no line that threads them all. You cannot solve it. So you stop demanding exact and start asking for closest — and, as promised, “closest” has a clean, exact, almost magical answer, built entirely out of a single idea: the right angle.

Orthogonality: the cleanest relationship

Two vectors are orthogonal when their dot product is zero — they share nothing, they’re perpendicular. A basis of mutually orthogonal unit vectors is orthonormal, and it’s the nicest coordinate system there is: to find a point’s coordinate along an axis, you just take a dot product. No matrix inverse, no solving — orthogonality makes the bookkeeping vanish.

Projection: the closest reachable point

Your model can only produce points in a certain space (a line, a plane, the column space of X). The target b usually sits outside it. The projection p is the point inside the model space closest to b:

project b onto the line through a:   p = (aᵀb / aᵀa) · a

Here’s the whole secret — drag it and watch:

p is the closest point on the line to b. The error b − p is perpendicular to the line — that's the least-squares condition.

‖error‖ = 3.39 (minimized)

(b−p)·a = 0.00 ≈ 0 ✓

Drag b (the target) or a (the model's direction). The residual stays orthogonal to the model space — always.

The error b − p is always perpendicular to the model space. That’s not a coincidence — it’s why p is closest. If the error had any component along the model space, you could slide p and get closer. So at the minimum, the residual is orthogonal to everything the model can represent.

That orthogonality is the normal equations

For regression, the model space is the column space of X, and we want x so that Xx is the projection of y. “Residual orthogonal to every column” means:

Xᵀ (y − X x) = 0     ⟹     (XᵀX) x = Xᵀ y

The normal equations — the same linear system from the RREF lesson — fall straight out of the orthogonality condition. Linear regression is literally the projection of y onto the span of your features.

import numpy as np

rng = np.random.default_rng(1)
X = np.column_stack([np.ones(20), np.linspace(0, 5, 20)])   # intercept + slope
y = 2 + 1.3 * X[:,1] + rng.normal(0, 0.4, 20)               # noisy line

# Least squares = project y onto the column space of X
beta, *_ = np.linalg.lstsq(X, y, rcond=None)
resid = y - X @ beta
print("fitted [intercept, slope]:", beta.round(3))

# The residual is orthogonal to every column of X (that's the whole point)
print("X^T . residual (~ 0):", (X.T @ resid).round(6))

fitted [intercept, slope]: [2.146 1.247]
X^T . residual (~ 0): [0. 0.]

We built the data from the line y = 2 + 1.3x plus noise, and least squares recovered [2.146, 1.247] — close, but bent slightly by the noise. The second line is the orthogonality condition holding numerically: the residual has zero dot product with every column of X.

Gram–Schmidt & QR: doing it without blowing up

You could form XᵀX and solve — but that squares the condition number and loses precision. Instead, Gram–Schmidt orthonormalizes the columns of X into Q (orthonormal) times R (upper-triangular): X = QR. Then the least-squares solution is a clean back-substitution, R x = Qᵀ y. It’s what np.linalg.lstsq and every serious solver actually do.

In one breath

You cannot fit noisy data exactly, so you fit it closest — and “closest” is a projection: the point p in the model’s reachable space nearest the target b. The defining fact is that the leftover error b − p always comes out perpendicular to that space (if it had any component inside it, you could slide p and get closer). Write “residual orthogonal to every column of X” as Xᵀ(y − Xβ) = 0 and out drop the normal equations (XᵀX)β = Xᵀy — so linear regression is the projection of y onto the span of your features. Serious solvers reach for QR / Gram–Schmidt rather than forming XᵀX, because orthonormalising X directly is far more numerically stable.

Practice

Quick check

0/3

Q1Why is the least-squares residual b − p perpendicular to the model space?

Q2The normal equations (XᵀX)β = Xᵀy come from which condition?

Q3Why do solvers use QR instead of forming XᵀX directly?

A question to carry forward

Orthogonality bought us the best fit: project the target onto what the model can reach, and the perpendicular residual guarantees nothing closer exists. Between this lesson and the last several, we have bent vectors, stretched them, projected them — done very nearly everything one can do to a vector with a matrix.

But there is one question we have never asked from the matrix’s own point of view. A transform sends most vectors off to a new heading. Are there special directions it refuses to turn — vectors that come out of A pointing exactly where they went in, only longer or shorter? Here is the thread onward: do such unrotated directions always exist, what would it mean to find the handful of directions a matrix merely stretches, and why do those directions — the eigenvectors — turn out to be the hidden skeleton of a covariance matrix, of PageRank, and of the stability of every system that feeds its own output back in as the next input?

Orthogonality, projections & least squares

What you'll learn

Before you start

Orthogonality: the cleanest relationship

Projection: the closest reachable point

That orthogonality is the normal equations

Gram–Schmidt & QR: doing it without blowing up

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further