Variance, covariance & the covariance matrix
Variance measures how one feature spreads; covariance measures how two move together; correlation puts it on a clean −1 to 1 scale. Pack them into the covariance matrix and you have the object PCA, Gaussians, and Mahalanobis distance all run on.
What you'll learn
- Variance and standard deviation — the spread of a single feature
- Covariance — the sign and size of how two features move together
- Correlation — covariance normalized to a unit-free −1 to 1 scale
- The covariance matrix — variances on the diagonal, covariances off it
- Two traps: correlation isn't causation, and covariance only sees linear relationships
Before you start
You already know expected value — the center of a distribution. The next question is how data spreads around that center, and how features move together. Those two questions — spread and co-movement — are the whole content of the covariance matrix, and they power half of classical ML.
Variance: spread of one feature
Variance is the average squared distance from the mean:
Var(X) = E[(X − μ)²]
Squaring keeps everything positive and punishes far-out points. Its square
root, the standard deviation σ, is back in the original units and is
what you usually report.
Covariance: do two features move together?
Covariance asks whether two features deviate from their means in sync:
Cov(X, Y) = E[(X − μₓ)(Y − μᵧ)]
When X is above its mean and Y is above its mean, the product is
positive. Consistently positive → they rise together. Consistently negative
→ one rises as the other falls. Near zero → no linear relationship.
The catch: covariance’s units are units of X × units of Y — meaningless to
compare across pairs. So we normalize.
Correlation: covariance, made comparable
corr(X, Y) = Cov(X, Y) / (σₓ · σᵧ) ∈ [−1, 1]
+1 is a perfect upward line, −1 a perfect downward line, 0 no linear
relationship. Now you can compare relationships across totally different
scales.
The covariance matrix
For d features, collect every pairwise covariance into a d × d matrix
Σ: variances on the diagonal, covariances off it. It’s symmetric
(Cov(X,Y) = Cov(Y,X)) and positive semi-definite. The ellipse you just
rotated is Σ — its eigenvectors are the axes, its eigenvalues the spreads
along them.
Where this lives in ML
- PCA eigendecomposes
Σ— the directions of maximum spread are its top eigenvectors (you’ve seen this twice now). - The multivariate Gaussian is defined by a mean vector and
Σ. - Mahalanobis distance uses
Σ⁻¹to measure distance “in standard deviations,” accounting for correlated, differently-scaled features. - Multicollinearity is just high off-diagonal correlation — the rank problem from a statistical angle.
Quick check
Quick check
Practice this in an interview
All questionsCovariance measures the direction of the linear relationship between two variables and is expressed in the product of their units, making it scale-dependent and hard to interpret across different variable pairs. Correlation normalises covariance by both standard deviations to produce a dimensionless measure bounded between -1 and 1, enabling comparison across pairs.
Variance is the average squared deviation from the mean; standard deviation is its square root and lives in the same units as the data. Variance is mathematically tractable — variances of independent variables add — while standard deviation is interpretable as a typical distance from the mean.
Expected value is the probability-weighted average outcome of a random variable; variance measures average squared deviation from that mean. Both are linear/additive in specific ways — knowing these rules prevents algebraic mistakes under interview pressure.
PCA finds orthogonal directions (principal components) of maximum variance by computing the eigenvectors of the covariance matrix, then projects data onto the top components. Choose the number of components by the cumulative explained variance ratio (e.g. enough to retain 95%), a scree-plot elbow, or downstream task performance. Always standardize features first, since PCA is variance-driven.