How does Ordinary Least Squares derive the coefficient vector, and what is the closed-form solution?

OLS minimizes the sum of squared residuals. Setting the gradient of the loss to zero yields the normal equations, whose unique solution is the projection of y onto the column space of X. The closed-form is the hat matrix formula β = (XᵀX)⁻¹Xᵀy.

Explain the EM algorithm in the context of fitting a Gaussian Mixture Model.

EM fits a GMM by alternating two steps: the E-step computes each point's responsibility (posterior probability) under each Gaussian using current parameters, and the M-step updates the means, covariances, and mixing weights to maximize the expected log-likelihood given those responsibilities. It iterates until the likelihood converges. Because the objective is non-convex, EM only reaches a local optimum, so initialization and multiple restarts matter.

When should you use gradient descent over the normal equation to fit a linear regression?

The normal equation gives an exact closed-form solution in O(p³) time but becomes impractical when the number of features p is large (typically above ~10,000) because matrix inversion is cubic. Gradient descent scales as O(np) per iteration, making it the only viable option for large feature spaces or online learning.

What are the core assumptions of linear regression, and what breaks when each is violated?

OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.

Solving linear systems: row echelon & RREF — Math for ML

What you'll learn

The three elementary row operations and why they never change the solution set

Row echelon form vs reduced row echelon form (RREF) — pivots and the staircase

Gaussian elimination as a step-by-step algorithm you can run by hand

How RREF reveals the three outcomes: one solution, infinitely many, or none

Why the normal equations make linear regression a linear system you must solve

The last lesson turned the arrow around. Forward, a matrix takes x and produces Ax; the question now is the reverse — you are handed the transform A and its output b, and asked for the input x that produced it. Fitting a linear regression, balancing a chemical equation, solving for the weights that make a layer reproduce a target — under the hood these are all the same question:

A x = b

Given a matrix A and a vector b, find the x that works. And, as the last lesson warned, the answer is not always a single tidy value: there is exactly one x, or infinitely many, or none at all. Gaussian elimination finds the answer and tells you which of the three you are in. It is the most-used algorithm in all of applied mathematics — and it is just bookkeeping you can do by hand.

The one rule: operations that preserve the solution

Write the system as an augmented matrix [ A | b ] — the coefficients on the left, the right-hand side after the bar. You’re allowed exactly three moves, the elementary row operations:

Swap two rows.
Scale a row by a non-zero number.
Add a multiple of one row to another.

The magic: none of these change the set of solutions. Each one is just re-stating the same equations. So we can hammer the matrix into a simple shape and read the answer straight off.

The target shape: echelon, then reduced echelon

We drive the matrix toward a staircase.

REF gives a staircase of pivots; RREF cleans the columns above them too — so each variable is isolated.

A pivot is the leading non-zero entry of a row. Row echelon form (REF) has the pivots stepping down-and-right with zeros below them. Reduced row echelon form (RREF) goes further: every pivot is 1, and its whole column is zero except for that 1. In RREF the answer is literally written in the last column.

Run it yourself

2	1	-1	8
-3	-1	2	-11
-2	1	2	-3

The augmented matrix [ A | b ]. Goal: make the left block as close to the identity as possible.

Step 0 / 10

Switch presets to feel the three outcomes. “Infinitely many” leaves a column with no pivot — a free variable you can set to anything. “No solution” ends with a row that says 0 = (nonzero), which is a contradiction.

Reading the verdict — and meeting rank

Count the pivots after reduction. That count is the rank of A (the star of the next lesson). For a system with n unknowns:

rank = n → every variable is pinned down → one solution.
rank < n, consistent → free variables → infinitely many.
a 0 = nonzero row → inconsistent → no solution.

In code

You’ll almost never reduce by hand in practice — but writing the algorithm once cements it, and numpy does the rest.

import numpy as np

def rref(M, tol=1e-9):
    M = M.astype(float).copy()
    rows, cols = M.shape
    r = 0
    for c in range(cols):
        piv = np.argmax(np.abs(M[r:, c])) + r       # partial pivoting
        if abs(M[piv, c]) < tol:
            continue
        M[[r, piv]] = M[[piv, r]]                    # swap
        M[r] = M[r] / M[r, c]                        # scale pivot to 1
        for i in range(rows):                        # clear the column
            if i != r:
                M[i] = M[i] - M[i, c] * M[r]
        r += 1
        if r == rows:
            break
    return M

A = np.array([[2, 1, -1], [-3, -1, 2], [-2, 1, 2]], float)
b = np.array([8, -11, -3], float)

aug = np.column_stack([A, b])
print("RREF of [A | b]:\n", rref(aug).round(3))
print("\nrank(A) =", np.linalg.matrix_rank(A), " unknowns =", A.shape[1])
print("solution x =", np.linalg.solve(A, b))   # the workhorse you'll actually call

RREF of [A | b]:
 [[ 1.  0.  0.  2.]
 [ 0.  1.  0.  3.]
 [ 0.  0.  1. -1.]]

rank(A) = 3  unknowns = 3
solution x = [ 2.  3. -1.]

The RREF’s last column is the solution; np.linalg.solve gives the same answer far faster.

Where this lives in ML: the normal equations

Linear regression looks for weights w minimizing ‖Xw − y‖². Setting the gradient to zero gives the normal equations:

(Xᵀ X) w = Xᵀ y

That’s A w = b with A = XᵀX and b = Xᵀy — a linear system, solved by exactly this machinery. And the existence question matters: if your features are linearly dependent (redundant columns), XᵀX is rank deficient (singular), the system has no unique solution, and the fit is undefined. That’s the precise reason we reach for regularization — it nudges XᵀX back to full rank so a unique w exists again.

In one breath

Solving Ax = b runs a transform backward, and Gaussian elimination does it with three solution-preserving moves — swap, scale, add-a-multiple — driving the augmented matrix [A | b] toward RREF, where each variable is isolated and the answer sits in the last column. Count the pivots: that count is the rank, and it decides everything — rank = unknowns gives one solution, rank < unknowns (and consistent) gives infinitely many via a free variable, and a 0 = nonzero row means none. In ML this is the normal equations (XᵀX)w = Xᵀy, which lose their unique solution exactly when features are linearly dependent — the precise reason regularization exists.

Practice

Quick check

0/3

Q1After reducing [A | b] to RREF you get a row that reads `0 0 0 | 5`. What does that mean?

Q2A 3-unknown system reduces to RREF with only 2 pivots and no contradiction row. How many solutions?

Q3Why does linear regression with two perfectly correlated features break?

A question to carry forward

Notice what did the deciding just now. We barely cared about the numbers inside x; the verdict — one solution, infinitely many, or none — came down entirely to a single integer, the number of pivots, which we quietly named the rank of A. That one number knew, before we finished solving, whether an answer existed and whether it was unique.

That is far too powerful to leave as a by-product of elimination. So here is the thread onward: what is rank really — not “pivots left over” but a property of the columns themselves? When does a new feature column add a genuinely new direction, and when is it secretly a copy of one you already have — a “price in cents” sitting beside “price in dollars”? And why do the four ideas that pin this down — span, linear independence, basis, and rank — turn out to be the deepest reason a model can, or cannot, be fit at all?

Solving linear systems: row echelon & RREF

What you'll learn

Before you start