datarekha

Solving linear systems: row echelon & RREF

Almost every closed-form model fit is a linear system in disguise. Gaussian elimination is the algorithm that solves it — and tells you whether a unique answer even exists.

9 min read Intermediate Math for ML Lesson 5 of 30

What you'll learn

  • The three elementary row operations and why they never change the solution set
  • Row echelon form vs reduced row echelon form (RREF) — pivots and the staircase
  • Gaussian elimination as a step-by-step algorithm you can run by hand
  • How RREF reveals the three outcomes: one solution, infinitely many, or none
  • Why the normal equations make linear regression a linear system you must solve

Before you start

Fitting a linear regression, balancing a chemical equation, solving for the weights that make a layer reproduce a target — under the hood these are all the same question:

A x = b

Given a matrix A and a vector b, find the x that works. Three things can happen: there’s exactly one x, there are infinitely many, or there are none. Gaussian elimination finds the answer and tells you which case you’re in. It’s the most-used algorithm in all of applied math, and it’s just bookkeeping you can do by hand.

The one rule: operations that preserve the solution

Write the system as an augmented matrix [ A | b ] — the coefficients on the left, the right-hand side after the bar. You’re allowed exactly three moves, the elementary row operations:

  1. Swap two rows.
  2. Scale a row by a non-zero number.
  3. Add a multiple of one row to another.

The magic: none of these change the set of solutions. Each one is just re-stating the same equations. So we can hammer the matrix into a simple shape and read the answer straight off.

The target shape: echelon, then reduced echelon

We drive the matrix toward a staircase.

Row echelon form (REF)101001leading 1s step right; zeros belowReduced (RREF)100010001also zero ABOVE each pivot
REF gives a staircase of pivots; RREF cleans the columns above them too — so each variable is isolated.

A pivot is the leading non-zero entry of a row. Row echelon form (REF) has the pivots stepping down-and-right with zeros below them. Reduced row echelon form (RREF) goes further: every pivot is 1, and its whole column is zero except for that 1. In RREF the answer is literally written in the last column.

Run it yourself

Switch presets to feel the three outcomes. “Infinitely many” leaves a column with no pivot — a free variable you can set to anything. “No solution” ends with a row that says 0 = (nonzero), which is a contradiction.

Reading the verdict — and meeting rank

Count the pivots after reduction. That count is the rank of A (the star of the next lesson). For a system with n unknowns:

  • rank = n → every variable is pinned down → one solution.
  • rank < n, consistent → free variablesinfinitely many.
  • a 0 = nonzero row → inconsistentno solution.

In code

You’ll almost never reduce by hand in practice — but writing the algorithm once cements it, and numpy does the rest.

The RREF’s last column is the solution; np.linalg.solve gives the same answer far faster.

Where this lives in ML: the normal equations

Linear regression looks for weights w minimizing ‖Xw − y‖². Setting the gradient to zero gives the normal equations:

(Xᵀ X) w = Xᵀ y

That’s A w = b with A = XᵀX and b = Xᵀy — a linear system, solved by exactly this machinery. And the existence question matters: if your features are linearly dependent (redundant columns), XᵀX is rank deficient (singular), the system has no unique solution, and the fit is undefined. That’s the precise reason we reach for regularization — it nudges XᵀX back to full rank so a unique w exists again.

Quick check

Quick check

0/3
Q1After reducing [A | b] to RREF you get a row that reads `0 0 0 | 5`. What does that mean?
Q2A 3-unknown system reduces to RREF with only 2 pivots and no contradiction row. How many solutions?
Q3Why does linear regression with two perfectly correlated features break?

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How does Ordinary Least Squares derive the coefficient vector, and what is the closed-form solution?

OLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.

Explain the EM algorithm in the context of fitting a Gaussian Mixture Model.

EM fits a GMM by alternating two steps: the E-step computes each point's responsibility (posterior probability) under each Gaussian using current parameters, and the M-step updates the means, covariances, and mixing weights to maximize the expected log-likelihood given those responsibilities. It iterates until the likelihood converges. Because the objective is non-convex, EM only reaches a local optimum, so initialization and multiple restarts matter.

When should you use gradient descent over the normal equation to fit a linear regression?

The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.

What are the core assumptions of linear regression, and what breaks when each is violated?

OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.

Related lessons

Explore further

Skip to content