datarekha

The multivariate Gaussian

The bell curve, generalized to many dimensions — and the most important distribution in all of ML. A mean vector and a covariance matrix describe it completely, and that single object underlies GMMs, Gaussian processes, VAEs, Kalman filters, and anomaly detection.

8 min read Advanced Math for ML Lesson 22 of 30

What you'll learn

  • The density formula and its heart: the squared Mahalanobis distance (x−μ)ᵀΣ⁻¹(x−μ)
  • How the covariance matrix Σ controls the shape — round, axis-aligned, or tilted
  • Why the Gaussian is special: marginals, conditionals, and linear maps all stay Gaussian
  • Mahalanobis distance as "distance in standard deviations" for correlated data
  • Where it shows up — GMMs, Gaussian processes, VAEs, Kalman filters, anomaly detection

Before you start

The 1D normal distribution is everywhere because of the Central Limit Theorem. Its multi-dimensional cousin — the multivariate Gaussian — is even more central to ML, because it’s the one rich distribution we can actually compute with. And it’s described by just two things: a mean vector μ (where the blob sits) and a covariance matrix Σ (its shape and tilt).

The density

p(x) = (1 / Z) · exp( −½ (x − μ)ᵀ Σ⁻¹ (x − μ) )

Ignore the normalizer Z. The action is in the exponent: that quadratic form (x − μ)ᵀ Σ⁻¹ (x − μ) is the squared Mahalanobis distance — how far x is from the mean, measured in the distribution’s own stretched, tilted units. Points at equal Mahalanobis distance form an ellipse, and those ellipses are the contours of the bell.

Σ does all the shaping. Σ = σ²I gives a perfectly round bell (isotropic). A diagonal Σ gives an axis-aligned ellipse. A full Σ tilts it — exactly the correlation you tuned.

Why the Gaussian is the distribution we use

It has almost magical closure properties:

  • Marginals are Gaussian. Ignore some dimensions — what’s left is still Gaussian.
  • Conditionals are Gaussian. Fix some variables and the distribution of the rest is Gaussian. This is the engine of Gaussian processes and Kalman filters — “given what I’ve seen, predict the rest.”
  • Linear maps preserve it. Ax + b of a Gaussian is Gaussian. Stays closed under the operations neural nets and signal processing use.

No other multivariate distribution is this tractable — which is why we reach for it constantly, even as an approximation.

Both test points are 3 units from the mean in plain distance, but their Mahalanobis distances differ — because Σ says the data spreads more easily in one direction than the other.

Where this lives in ML

  • Gaussian Mixture Models model data as several Gaussians — soft clustering and density estimation.
  • Gaussian processes put a Gaussian over functions; prediction is just a Gaussian conditional.
  • VAEs use a Gaussian latent prior and a Gaussian encoder.
  • Anomaly detection flags points with large Mahalanobis distance.
  • Diffusion models literally add and remove multivariate Gaussian noise.

Quick check

Quick check

0/3
Q1What two quantities completely specify a multivariate Gaussian?
Q2Two points are the same Euclidean distance from the mean but have different Mahalanobis distances. Why?
Q3Which property makes Gaussians the backbone of Gaussian processes and Kalman filters?

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How does a Gaussian Mixture Model differ from k-means, and when would you prefer it?

A GMM models data as a mixture of Gaussian distributions and assigns soft probabilities of cluster membership, fitting clusters that can be elliptical and different sizes via the EM algorithm. K-means does hard assignment to the nearest centroid and implicitly assumes spherical, equal-size clusters. Prefer a GMM when clusters overlap, have different shapes or covariances, or when you need probabilistic (soft) assignments.

Explain the EM algorithm in the context of fitting a Gaussian Mixture Model.

EM fits a GMM by alternating two steps: the E-step computes each point's responsibility (posterior probability) under each Gaussian using current parameters, and the M-step updates the means, covariances, and mixing weights to maximize the expected log-likelihood given those responsibilities. It iterates until the likelihood converges. Because the objective is non-convex, EM only reaches a local optimum, so initialization and multiple restarts matter.

What makes the Normal distribution so central in statistics, and when does it fail?

The Normal distribution is justified by the Central Limit Theorem — averages of large i.i.d. samples converge to Normal regardless of the underlying distribution. It is fully characterized by mean and variance, enabling closed-form inference. It fails for heavy-tailed data, skewed outcomes, bounded quantities, and rare extreme events.

When does each common distribution arise — Bernoulli, Binomial, Poisson, Normal, Exponential, Uniform?

Each distribution has a natural generative story: Bernoulli is a single coin flip; Binomial sums Bernoullis; Poisson counts rare arrivals; Normal emerges from sums of many small effects; Exponential models waiting times between Poisson events; Uniform assigns equal probability across a range. Choosing correctly comes from matching that story to the data-generating process.

Related lessons

Explore further

Skip to content