How does Ordinary Least Squares derive the coefficient vector, and what is the closed-form solution?
The short answer
OLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.
How to think about it
OLS solves: minimize over β the loss L = ||y - Xβ||².
Derivation in four steps:
- Expand:
L = (y - Xβ)ᵀ(y - Xβ) = yᵀy - 2βᵀXᵀy + βᵀXᵀXβ - Differentiate with respect to β and set to zero:
∂L/∂β = -2Xᵀy + 2XᵀXβ = 0 - Rearrange to the normal equations:
XᵀXβ = Xᵀy - Solve (when
XᵀXis invertible):β = (XᵀX)⁻¹Xᵀy
The matrix H = X(XᵀX)⁻¹Xᵀ is the hat matrix — it projects y onto the column space of X. Fitted values are ŷ = Hy.
When to use the normal equation vs gradient descent:
| Normal Equation | Gradient Descent | |
|---|---|---|
| Complexity | O(p³ + np²) | O(np) per step |
| n, p regime | small p (≤ ~10k) | large p or sparse |
| Requires tuning | No | Yes (learning rate) |
import numpy as np
# Normal equation — exact solution
beta = np.linalg.lstsq(X, y, rcond=None)[0] # numerically stable via SVD
np.linalg.lstsq uses SVD internally rather than explicitly inverting XᵀX, which is numerically safer.