What are the core assumptions of linear regression, and what breaks when each is violated?

OLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.

Which models require feature scaling and which don't, and why?

Distance-based and gradient-based models (KNN, K-means, SVM, PCA, linear/logistic regression with regularization, neural networks) need scaling because they're sensitive to feature magnitudes. Tree-based models (decision trees, random forests, gradient boosting) are scale-invariant because they split on thresholds per feature. Standardization and min-max scaling are the usual choices, fit on training data only.

What are the assumptions and limitations of PCA, and when would it hurt your model?

PCA assumes linear relationships, that variance equals importance, and that components should be orthogonal. It can hurt when the predictive signal lives in low-variance directions, when relationships are nonlinear, or when interpretability matters, since components mix original features. It's also sensitive to scaling and outliers and is unsupervised, so it ignores the target.

What is the difference between standardization and normalization, and which models require feature scaling?

Standardization rescales features to zero mean and unit variance; normalization squashes values into a fixed range, usually [0, 1]. Distance-based and gradient-based models are sensitive to scale and require one of these; tree-based models split on rank order and are scale-invariant.

Rank, linear independence, span & basis — Math for ML

What you'll learn

Span — the set of everything you can build from a group of vectors

Linear independence — when no vector is redundant

Basis & dimension — a minimal coordinate system for a space

Rank — the number of independent directions, and how it ties RREF, SVD, and regression together

Why rank deficiency (multicollinearity) breaks models and how to spot it

The last lesson left rank as a cliffhanger: a single integer — the pivot count — that silently decided whether Ax = b had one solution, many, or none. But “pivots left after elimination” describes how to compute rank, not what it is. Here is what it is, told through the smallest example there is. You add a “price in dollars” column and a “price in cents” column to your dataset; the second carries zero new information — it is just the first times 100. Linear algebra has exact words for that redundancy, and those words decide whether your model can be fit at all.

Span: what you can reach

The span of a set of vectors is every point you can reach by scaling and adding them: all combinations a·v₁ + b·v₂ + …. Two vectors pointing in different directions span the whole 2D plane. Two vectors on the same line span only that line.

Linear independence: is anyone redundant?

A set is linearly independent if no vector can be built from the others — equivalently, the only way to combine them into the zero vector is to multiply them all by zero. The moment one vector is a combination of the rest, the set is dependent, and that vector adds nothing to the span.

rank =2

Independent. Their combinations fill the whole plane — together they're a basis for ℝ².

det = 8.00 (≠ 0 → independent)

Drag either arrowhead. Rank = the number of independent directions = the dimension of the span.

Basis & dimension

A basis is the sweet spot: a set that is independent and spans the space — a minimal, non-redundant coordinate system. The number of vectors in a basis is the dimension of the space. ℝ² needs exactly two; any third vector is necessarily dependent.

Rank: the number that ties it all together

The rank of a matrix is the number of linearly independent columns — equivalently, the dimension of the space its columns span. And it shows up everywhere you’ve already been:

It’s the number of pivots in the RREF (last lesson).
It’s the number of non-zero singular values in the SVD (next idea).
A square matrix is invertible iff it has full rank.

import numpy as np

# price_dollars, price_cents (= 100x), and a genuinely new feature
dollars = np.array([3., 5., 2., 9.])
X = np.column_stack([dollars, dollars * 100, [1., 0., 1., 1.]])

print("shape:", X.shape, " columns:", X.shape[1])
print("rank :", np.linalg.matrix_rank(X))   # 2, not 3 -- one column is redundant

# Rank deficiency => XtX is singular => no unique least-squares solution
XtX = X.T @ X
print("det(XᵀX):", round(np.linalg.det(XtX), 6))   # ~0 -> singular

shape: (4, 3)  columns: 3
rank : 2
det(XᵀX): 0.0

Three columns, but rank 2 — the “cents” column is dependent, so it adds a direction the data already had. XᵀX becomes singular and the normal equations lose their unique solution.

Where this lives in ML

Multicollinearity. Correlated features make the feature matrix near-rank-deficient. Coefficients become unstable and uninterpretable — the model can’t decide how to split credit between the copies.
The dummy-variable trap. One-hot encode a category into k columns and include an intercept, and the columns sum to a constant — instant linear dependence. That’s why you drop one level.
Effective dimensionality. Your data may live in 300 columns but have rank 12 — only 12 directions of real variation. That’s the gap PCA and SVD exploit.

In one breath

Four ideas pin down “redundant information.” The span of a set of vectors is everything you can build from them by scaling and adding; a set is linearly independent when no vector is a combination of the others (no one is redundant); a basis is the sweet spot — independent and spanning, a minimal coordinate system whose size is the space’s dimension; and the rank of a matrix is the number of independent columns — the same number as the pivots in its RREF and the non-zero singular values in its SVD. When rank < number of columns (multicollinearity, the dummy-variable trap), XᵀX is singular and the fit has no unique answer — and that same gap is what PCA and SVD exploit when 300 columns really carry only 12 directions.

Practice

Quick check

0/3

Q1Vectors [1,2], [2,4], [0,1] in ℝ². What is the rank of the matrix with these as columns?

Q2A dataset has 500 feature columns but matrix_rank(X) returns 40. What does that mean?

Q3Why does adding an intercept plus a full one-hot encoding (all k categories) break a linear model?

A question to carry forward

Rank told us how many independent directions a set of columns truly has. But it stayed silent on a softer, more useful question: not merely whether two directions are independent, but whether they are independent in the cleanest way — at right angles, each casting no shadow on the other at all. When the columns of a matrix are mutually perpendicular, a surprising number of hard problems collapse into easy ones.

And that matters most exactly where rank just stranded us. When Ax = b has no exact solution — too many noisy equations, too few unknowns, the everyday reality of regression — we cannot solve it, but we can ask for the next best thing: the x that comes closest. Here is the thread onward: what does “closest” mean geometrically — a projection, a perpendicular dropped onto the space the columns span — why does orthogonality make that closest answer almost effortless to compute, and how is this single idea, least squares, the exact engine that fits a straight line through a cloud of scattered points?

Rank, linear independence, span & basis

What you'll learn

Before you start

Span: what you can reach

Linear independence: is anyone redundant?

Basis & dimension

Rank: the number that ties it all together

Where this lives in ML

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further