When should you use gradient descent over the normal equation to fit a linear regression?
The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.
How to think about it
Normal equation: β = (XᵀX)⁻¹Xᵀy
Exact solution, no learning rate, no iterations. The cost is computing (XᵀX)⁻¹, which is O(p³). For p = 1,000 that is trivial; for p = 100,000 it is already 10¹⁵ operations — infeasible.
Gradient descent: iteratively updates β ← β - α * (2/n) Xᵀ(Xβ - y)
Each step costs O(np). Batch gradient descent sees all n samples per step; stochastic (SGD) processes one sample; mini-batch is the practical default.
Head-to-head comparison:
| Criterion | Normal Equation | Gradient Descent |
|---|---|---|
| Time complexity | O(p³ + np²) | O(np) per step |
| Convergence | Exact, single pass | Iterative, needs tuning |
| Hyperparameters | None | Learning rate, epochs |
| Works with regularization | Yes (Ridge adds λI) | Yes |
| Online / streaming | No | Yes (SGD) |
| Large p (features) | Impractical | Preferred |
| Large n (samples) | Bottleneck in np² | Efficient (mini-batch) |
Regularized normal equation (Ridge):
β = (XᵀX + λI)⁻¹Xᵀy
Adding λI also regularizes the inversion — the matrix is always positive definite, so no rank-deficiency issues even under multicollinearity.
from sklearn.linear_model import LinearRegression, SGDRegressor
# Normal equation path (via LAPACK, used for small p)
lr = LinearRegression().fit(X_train, y_train)
# Gradient descent (explicit, scales to large p)
sgd = SGDRegressor(max_iter=1000, learning_rate="adaptive", eta0=0.01)
sgd.fit(X_train, y_train)