Rank, linear independence, span & basis
When does a feature add real information, and when is it just a copy of the others? Span, independence, basis, and rank are the four ideas that answer it — and they decide whether your model can even be fit.
What you'll learn
- Span — the set of everything you can build from a group of vectors
- Linear independence — when no vector is redundant
- Basis & dimension — a minimal coordinate system for a space
- Rank — the number of independent directions, and how it ties RREF, SVD, and regression together
- Why rank deficiency (multicollinearity) breaks models and how to spot it
Before you start
You add a “price in dollars” column and a “price in cents” column to your dataset. The second one carries zero new information — it’s just the first times 100. Linear algebra has precise words for this, and they decide whether your model can be fit at all.
Span: what you can reach
The span of a set of vectors is every point you can reach by scaling
and adding them: all combinations a·v₁ + b·v₂ + …. Two vectors pointing
in different directions span the whole 2D plane. Two vectors on the same
line span only that line.
Linear independence: is anyone redundant?
A set is linearly independent if no vector can be built from the others — equivalently, the only way to combine them into the zero vector is to multiply them all by zero. The moment one vector is a combination of the rest, the set is dependent, and that vector adds nothing to the span.
Basis & dimension
A basis is the sweet spot: a set that is independent and spans
the space — a minimal, non-redundant coordinate system. The number of
vectors in a basis is the dimension of the space. ℝ² needs exactly
two; any third vector is necessarily dependent.
Rank: the number that ties it all together
The rank of a matrix is the number of linearly independent columns — equivalently, the dimension of the space its columns span. And it shows up everywhere you’ve already been:
- It’s the number of pivots in the RREF (last lesson).
- It’s the number of non-zero singular values in the SVD (next idea).
- A square matrix is invertible iff it has full rank.
Three columns, but rank 2 — the “cents” column is dependent, so it adds a
direction the data already had. XᵀX becomes singular and the normal
equations lose their unique solution.
Where this lives in ML
- Multicollinearity. Correlated features make the feature matrix near-rank-deficient. Coefficients become unstable and uninterpretable — the model can’t decide how to split credit between the copies.
- The dummy-variable trap. One-hot encode a category into k columns and include an intercept, and the columns sum to a constant — instant linear dependence. That’s why you drop one level.
- Effective dimensionality. Your data may live in 300 columns but have rank 12 — only 12 directions of real variation. That’s the gap PCA and SVD exploit.
Quick check
Quick check
Practice this in an interview
All questionsOLS linear regression rests on five assumptions: linearity, independence of errors, homoscedasticity, normality of residuals, and no perfect multicollinearity. Violating any one of them degrades coefficient estimates, standard errors, or the validity of hypothesis tests.
Distance-based and gradient-based models (KNN, K-means, SVM, PCA, linear/logistic regression with regularization, neural networks) need scaling because they're sensitive to feature magnitudes. Tree-based models (decision trees, random forests, gradient boosting) are scale-invariant because they split on thresholds per feature. Standardization and min-max scaling are the usual choices, fit on training data only.
PCA assumes linear relationships, that variance equals importance, and that components should be orthogonal. It can hurt when the predictive signal lives in low-variance directions, when relationships are nonlinear, or when interpretability matters, since components mix original features. It's also sensitive to scaling and outliers and is unsupervised, so it ignores the target.
Standardization rescales features to zero mean and unit variance; normalization squashes values into a fixed range, usually [0, 1]. Distance-based and gradient-based models are sensitive to scale and require one of these; tree-based models split on rank order and are scale-invariant.