datarekha
Machine Learning Medium Asked at GoogleAsked at AmazonAsked at Apple

When should you use gradient descent over the normal equation to fit a linear regression?

The short answer

The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.

How to think about it

Normal equation: β = (XᵀX)⁻¹Xᵀy

Exact solution, no learning rate, no iterations. The cost is computing (XᵀX)⁻¹, which is O(p³). For p = 1,000 that is trivial; for p = 100,000 it is already 10¹⁵ operations — infeasible.

Gradient descent: iteratively updates β ← β - α * (2/n) Xᵀ(Xβ - y)

Each step costs O(np). Batch gradient descent sees all n samples per step; stochastic (SGD) processes one sample; mini-batch is the practical default.

Head-to-head comparison:

CriterionNormal EquationGradient Descent
Time complexityO(p³ + np²)O(np) per step
ConvergenceExact, single passIterative, needs tuning
HyperparametersNoneLearning rate, epochs
Works with regularizationYes (Ridge adds λI)Yes
Online / streamingNoYes (SGD)
Large p (features)ImpracticalPreferred
Large n (samples)Bottleneck in np²Efficient (mini-batch)

Regularized normal equation (Ridge):

β = (XᵀX + λI)⁻¹Xᵀy

Adding λI also regularizes the inversion — the matrix is always positive definite, so no rank-deficiency issues even under multicollinearity.

from sklearn.linear_model import LinearRegression, SGDRegressor

# Normal equation path (via LAPACK, used for small p)
lr = LinearRegression().fit(X_train, y_train)

# Gradient descent (explicit, scales to large p)
sgd = SGDRegressor(max_iter=1000, learning_rate="adaptive", eta0=0.01)
sgd.fit(X_train, y_train)
Learn it properly Gradient descent

Keep practising

All Machine Learning questions

Explore further

Skip to content