datarekha

Variance, covariance & the covariance matrix

Variance measures how one feature spreads; covariance measures how two move together; correlation puts it on a clean −1 to 1 scale. Pack them into the covariance matrix and you have the object PCA, Gaussians, and Mahalanobis distance all run on.

8 min read Intermediate Math for ML Lesson 21 of 30

What you'll learn

  • Variance and standard deviation — the spread of a single feature
  • Covariance — the sign and size of how two features move together
  • Correlation — covariance normalized to a unit-free −1 to 1 scale
  • The covariance matrix — variances on the diagonal, covariances off it
  • Two traps: correlation isn't causation, and covariance only sees linear relationships

Before you start

You already know expected value — the center of a distribution. The next question is how data spreads around that center, and how features move together. Those two questions — spread and co-movement — are the whole content of the covariance matrix, and they power half of classical ML.

Variance: spread of one feature

Variance is the average squared distance from the mean:

Var(X) = E[(X − μ)²]

Squaring keeps everything positive and punishes far-out points. Its square root, the standard deviation σ, is back in the original units and is what you usually report.

Covariance: do two features move together?

Covariance asks whether two features deviate from their means in sync:

Cov(X, Y) = E[(X − μₓ)(Y − μᵧ)]

When X is above its mean and Y is above its mean, the product is positive. Consistently positive → they rise together. Consistently negative → one rises as the other falls. Near zero → no linear relationship.

The catch: covariance’s units are units of X × units of Y — meaningless to compare across pairs. So we normalize.

Correlation: covariance, made comparable

corr(X, Y) = Cov(X, Y) / (σₓ · σᵧ)     ∈ [−1, 1]

+1 is a perfect upward line, −1 a perfect downward line, 0 no linear relationship. Now you can compare relationships across totally different scales.

The covariance matrix

For d features, collect every pairwise covariance into a d × d matrix Σ: variances on the diagonal, covariances off it. It’s symmetric (Cov(X,Y) = Cov(Y,X)) and positive semi-definite. The ellipse you just rotated is Σ — its eigenvectors are the axes, its eigenvalues the spreads along them.

Where this lives in ML

  • PCA eigendecomposes Σ — the directions of maximum spread are its top eigenvectors (you’ve seen this twice now).
  • The multivariate Gaussian is defined by a mean vector and Σ.
  • Mahalanobis distance uses Σ⁻¹ to measure distance “in standard deviations,” accounting for correlated, differently-scaled features.
  • Multicollinearity is just high off-diagonal correlation — the rank problem from a statistical angle.

Quick check

Quick check

0/3
Q1Cov(X, Y) = 0. Are X and Y independent?
Q2What sits on the diagonal of a covariance matrix Σ?
Q3Why divide covariance by σₓσᵧ to get correlation?

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What is the difference between covariance and correlation, and when does each matter?

Covariance measures the direction of the linear relationship between two variables and is expressed in the product of their units, making it scale-dependent and hard to interpret across different variable pairs. Correlation normalises covariance by both standard deviations to produce a dimensionless measure bounded between -1 and 1, enabling comparison across pairs.

What is the difference between variance and standard deviation, and why do we need both?

Variance is the average squared deviation from the mean; standard deviation is its square root and lives in the same units as the data. Variance is mathematically tractable — variances of independent variables add — while standard deviation is interpretable as a typical distance from the mean.

Define expected value and variance. What are their key properties?

Expected value is the probability-weighted average outcome of a random variable; variance measures average squared deviation from that mean. Both are linear/additive in specific ways — knowing these rules prevents algebraic mistakes under interview pressure.

How does PCA work, and how do you choose the number of components?

PCA finds orthogonal directions (principal components) of maximum variance by computing the eigenvectors of the covariance matrix, then projects data onto the top components. Choose the number of components by the cumulative explained variance ratio (e.g. enough to retain 95%), a scree-plot elbow, or downstream task performance. Always standardize features first, since PCA is variance-driven.

Related lessons

Explore further

Skip to content