Solving linear systems: row echelon & RREF
Almost every closed-form model fit is a linear system in disguise. Gaussian elimination is the algorithm that solves it — and tells you whether a unique answer even exists.
What you'll learn
- The three elementary row operations and why they never change the solution set
- Row echelon form vs reduced row echelon form (RREF) — pivots and the staircase
- Gaussian elimination as a step-by-step algorithm you can run by hand
- How RREF reveals the three outcomes: one solution, infinitely many, or none
- Why the normal equations make linear regression a linear system you must solve
Before you start
Fitting a linear regression, balancing a chemical equation, solving for the weights that make a layer reproduce a target — under the hood these are all the same question:
A x = b
Given a matrix A and a vector b, find the x that works. Three things
can happen: there’s exactly one x, there are infinitely many, or
there are none. Gaussian elimination finds the answer and tells you
which case you’re in. It’s the most-used algorithm in all of applied math,
and it’s just bookkeeping you can do by hand.
The one rule: operations that preserve the solution
Write the system as an augmented matrix [ A | b ] — the coefficients
on the left, the right-hand side after the bar. You’re allowed exactly
three moves, the elementary row operations:
- Swap two rows.
- Scale a row by a non-zero number.
- Add a multiple of one row to another.
The magic: none of these change the set of solutions. Each one is just re-stating the same equations. So we can hammer the matrix into a simple shape and read the answer straight off.
The target shape: echelon, then reduced echelon
We drive the matrix toward a staircase.
A pivot is the leading non-zero entry of a row. Row echelon form
(REF) has the pivots stepping down-and-right with zeros below them.
Reduced row echelon form (RREF) goes further: every pivot is 1, and
its whole column is zero except for that 1. In RREF the answer is
literally written in the last column.
Run it yourself
Switch presets to feel the three outcomes. “Infinitely many” leaves a
column with no pivot — a free variable you can set to anything.
“No solution” ends with a row that says 0 = (nonzero), which is a
contradiction.
Reading the verdict — and meeting rank
Count the pivots after reduction. That count is the rank of A (the
star of the next lesson). For a system with n unknowns:
- rank = n → every variable is pinned down → one solution.
- rank < n, consistent → free variables → infinitely many.
- a
0 = nonzerorow → inconsistent → no solution.
In code
You’ll almost never reduce by hand in practice — but writing the algorithm
once cements it, and numpy does the rest.
The RREF’s last column is the solution; np.linalg.solve gives the same
answer far faster.
Where this lives in ML: the normal equations
Linear regression looks for weights w minimizing ‖Xw − y‖². Setting the
gradient to zero gives the normal equations:
(Xᵀ X) w = Xᵀ y
That’s A w = b with A = XᵀX and b = Xᵀy — a linear system, solved
by exactly this machinery. And the existence question matters: if your
features are linearly dependent (redundant columns), XᵀX is rank
deficient (singular), the system has no unique solution, and the fit is
undefined. That’s the precise reason we reach for regularization — it
nudges XᵀX back to full rank so a unique w exists again.
Quick check
Quick check
Practice this in an interview
All questionsOLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.
EM fits a GMM by alternating two steps: the E-step computes each point's responsibility (posterior probability) under each Gaussian using current parameters, and the M-step updates the means, covariances, and mixing weights to maximize the expected log-likelihood given those responsibilities. It iterates until the likelihood converges. Because the objective is non-convex, EM only reaches a local optimum, so initialization and multiple restarts matter.
The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.
OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.