What is the difference between covariance and correlation, and when does each matter?

Covariance measures the direction of the linear relationship between two variables and is expressed in the product of their units, making it scale-dependent and hard to interpret across different variable pairs. Correlation normalises covariance by both standard deviations to produce a dimensionless measure bounded between -1 and 1, enabling comparison across pairs.

What is the difference between variance and standard deviation, and why do we need both?

Variance is the average squared deviation from the mean; standard deviation is its square root and lives in the same units as the data. Variance is mathematically tractable — variances of independent variables add — while standard deviation is interpretable as a typical distance from the mean.

Define expected value and variance. What are their key properties?

Expected value is the probability-weighted average outcome of a random variable; variance measures average squared deviation from that mean. Both are linear/additive in specific ways — knowing these rules prevents algebraic mistakes under interview pressure.

How does PCA work, and how do you choose the number of components?

PCA finds orthogonal directions (principal components) of maximum variance by computing the eigenvectors of the covariance matrix, then projects data onto the top components. Choose the number of components by the cumulative explained variance ratio (e.g. enough to retain 95%), a scree-plot elbow, or downstream task performance. Always standardize features first, since PCA is variance-driven.

Variance, covariance & the covariance matrix — Math for ML

Variance, covariance & the covariance matrix

Variance measures how one feature spreads; covariance measures how two move together; correlation puts it on a clean −1 to 1 scale. Pack them into the covariance matrix and you have the object PCA, Gaussians, and Mahalanobis distance all run on.

8 min read Intermediate Math for ML Lesson 25 of 37

What you'll learn

Variance and standard deviation — the spread of a single feature

Covariance — the sign and size of how two features move together

Correlation — covariance normalized to a unit-free −1 to 1 scale

The covariance matrix — variances on the diagonal, covariances off it

Two traps: correlation isn't causation, and covariance only sees linear relationships

The last lesson left a general question hanging: a Markov chain linked the state now to the state next, so how do any two random quantities relate? Here is the answer, and it has a number. You already know expected value — the centre of a distribution; the next two questions are how data spreads around that centre, and how two features move together. Spread and co-movement: that is the whole content of the covariance matrix, and it powers half of classical ML.

Variance: spread of one feature

Variance is the average squared distance from the mean:

Var(X) = E[(X − μ)²]

Squaring keeps everything positive and punishes far-out points. Its square root, the standard deviation σ, is back in the original units and is what you usually report.

Covariance: do two features move together?

Covariance asks whether two features deviate from their means in sync:

Cov(X, Y) = E[(X − μₓ)(Y − μᵧ)]

When X is above its mean and Y is above its mean, the product is positive. Consistently positive → they rise together. Consistently negative → one rises as the other falls. Near zero → no linear relationship.

The catch: covariance’s units are units of X × units of Y — meaningless to compare across pairs. So we normalize.

Correlation: covariance, made comparable

corr(X, Y) = Cov(X, Y) / (σₓ · σᵧ)     ∈ [−1, 1]

+1 is a perfect upward line, −1 a perfect downward line, 0 no linear relationship. Now you can compare relationships across totally different scales.

ρ0.70σx2.40σy1.40

covariance matrix Σ

7.01	2.68
2.68	1.97

correlation = 0.72

Diagonal = each feature's variance. Off-diagonal = how they move together. Normalize it by the spreads and you get correlation, always between −1 and 1.

The covariance matrix

For d features, collect every pairwise covariance into a d × d matrix Σ: variances on the diagonal, covariances off it. It’s symmetric (Cov(X,Y) = Cov(Y,X)) and positive semi-definite. The ellipse you just rotated is Σ — its eigenvectors are the axes, its eigenvalues the spreads along them.

import numpy as np
rng = np.random.default_rng(0)

height = rng.normal(170, 8, 500)
weight = 0.9 * height - 100 + rng.normal(0, 5, 500)   # correlated with height
noise  = rng.normal(0, 10, 500)                        # unrelated

X = np.vstack([height, weight, noise])   # 3 features x 500 samples
print("covariance matrix:\n", np.cov(X).round(1))
print("\ncorrelation matrix:\n", np.corrcoef(X).round(2))
# height-weight correlation is high; everything with 'noise' is ~0

covariance matrix:
 [[ 65.9  59.5   2.1]
 [ 59.5  75.8   0.7]
 [  2.1   0.7 102.2]]

correlation matrix:
 [[1.   0.84 0.03]
 [0.84 1.   0.01]
 [0.03 0.01 1.  ]]

The correlation matrix reads the relationships off at a glance: height and weight sit at 0.84 (tightly linked, as built), while everything paired with noise is ~0.03 — no linear relationship at all. Note the raw covariance between height and weight (59.5) is impossible to interpret on its own; only after dividing by the two standard deviations does it become the clean, comparable 0.84.

Where this lives in ML

PCA eigendecomposes Σ — the directions of maximum spread are its top eigenvectors (you’ve seen this twice now).
The multivariate Gaussian is defined by a mean vector and Σ.
Mahalanobis distance uses Σ⁻¹ to measure distance “in standard deviations,” accounting for correlated, differently-scaled features.
Multicollinearity is just high off-diagonal correlation — the rank problem from a statistical angle.

In one breath

Variance E[(X−μ)²] measures how one feature spreads (its root, the standard deviation σ, is back in original units). Covariance E[(X−μₓ)(Y−μᵧ)] measures whether two features deviate from their means in sync — positive means they rise together, negative means one rises as the other falls, near-zero means no linear link. Because covariance carries the product of both units, you normalise it to correlation Cov/(σₓσᵧ) ∈ [−1, 1], a unit-free strength of linear association. Stack all pairwise covariances into the symmetric, positive-semidefinite covariance matrix Σ (variances on the diagonal, covariances off it) — the same Σ that PCA eigendecomposes, that defines the multivariate Gaussian, and that Mahalanobis distance inverts. Two traps: correlation isn’t causation, and zero correlation rules out only linear dependence (a perfect parabola y = x² has correlation 0).

Practice

Quick check

0/3

Q1Cov(X, Y) = 0. Are X and Y independent?

Q2What sits on the diagonal of a covariance matrix Σ?

Q3Why divide covariance by σₓσᵧ to get correlation?

A question to carry forward

Look at what the covariance matrix handed us: an ellipse. Σ said the cloud of points is wide here, narrow there, tilted just so — its eigenvectors the axes, its eigenvalues the spreads. But Σ is only a description of the shape; it is not yet a distribution. It cannot tell you how likely any particular point is, only how the cloud is stretched.

So here is the thread onward: what is the full probability distribution that wears this shape — centred at a mean vector μ, stretched and tilted by exactly this Σ? It is the multivariate Gaussian, the bell curve lifted into many dimensions and the one rich distribution we can actually compute with. What is its density, why does that same elliptical “Mahalanobis distance” sit in the exponent, and what closure magic — marginals, conditionals, and linear maps all staying Gaussian — makes it the backbone of Gaussian processes, Kalman filters, VAEs, and anomaly detection?

Variance, covariance & the covariance matrix

What you'll learn

Before you start

Variance: spread of one feature

Covariance: do two features move together?

Correlation: covariance, made comparable

The covariance matrix

Where this lives in ML

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further