The multivariate Gaussian
The bell curve, generalized to many dimensions — and the most important distribution in all of ML. A mean vector and a covariance matrix describe it completely, and that single object underlies GMMs, Gaussian processes, VAEs, Kalman filters, and anomaly detection.
What you'll learn
- The density formula and its heart: the squared Mahalanobis distance (x−μ)ᵀΣ⁻¹(x−μ)
- How the covariance matrix Σ controls the shape — round, axis-aligned, or tilted
- Why the Gaussian is special: marginals, conditionals, and linear maps all stay Gaussian
- Mahalanobis distance as "distance in standard deviations" for correlated data
- Where it shows up — GMMs, Gaussian processes, VAEs, Kalman filters, anomaly detection
Before you start
The 1D normal distribution is everywhere because of the Central Limit
Theorem. Its multi-dimensional cousin — the multivariate Gaussian — is
even more central to ML, because it’s the one rich distribution we can
actually compute with. And it’s described by just two things: a mean
vector μ (where the blob sits) and a covariance matrix Σ (its shape
and tilt).
The density
p(x) = (1 / Z) · exp( −½ (x − μ)ᵀ Σ⁻¹ (x − μ) )
Ignore the normalizer Z. The action is in the exponent: that quadratic
form (x − μ)ᵀ Σ⁻¹ (x − μ) is the squared Mahalanobis distance — how
far x is from the mean, measured in the distribution’s own stretched,
tilted units. Points at equal Mahalanobis distance form an ellipse, and
those ellipses are the contours of the bell.
Σ does all the shaping. Σ = σ²I gives a perfectly round bell (isotropic).
A diagonal Σ gives an axis-aligned ellipse. A full Σ tilts it — exactly
the correlation you tuned.
Why the Gaussian is the distribution we use
It has almost magical closure properties:
- Marginals are Gaussian. Ignore some dimensions — what’s left is still Gaussian.
- Conditionals are Gaussian. Fix some variables and the distribution of the rest is Gaussian. This is the engine of Gaussian processes and Kalman filters — “given what I’ve seen, predict the rest.”
- Linear maps preserve it.
Ax + bof a Gaussian is Gaussian. Stays closed under the operations neural nets and signal processing use.
No other multivariate distribution is this tractable — which is why we reach for it constantly, even as an approximation.
Both test points are 3 units from the mean in plain distance, but their
Mahalanobis distances differ — because Σ says the data spreads more easily
in one direction than the other.
Where this lives in ML
- Gaussian Mixture Models model data as several Gaussians — soft clustering and density estimation.
- Gaussian processes put a Gaussian over functions; prediction is just a Gaussian conditional.
- VAEs use a Gaussian latent prior and a Gaussian encoder.
- Anomaly detection flags points with large Mahalanobis distance.
- Diffusion models literally add and remove multivariate Gaussian noise.
Quick check
Quick check
Practice this in an interview
All questionsA GMM models data as a mixture of Gaussian distributions and assigns soft probabilities of cluster membership, fitting clusters that can be elliptical and different sizes via the EM algorithm. K-means does hard assignment to the nearest centroid and implicitly assumes spherical, equal-size clusters. Prefer a GMM when clusters overlap, have different shapes or covariances, or when you need probabilistic (soft) assignments.
EM fits a GMM by alternating two steps: the E-step computes each point's responsibility (posterior probability) under each Gaussian using current parameters, and the M-step updates the means, covariances, and mixing weights to maximize the expected log-likelihood given those responsibilities. It iterates until the likelihood converges. Because the objective is non-convex, EM only reaches a local optimum, so initialization and multiple restarts matter.
The Normal distribution is justified by the Central Limit Theorem — averages of large i.i.d. samples converge to Normal regardless of the underlying distribution. It is fully characterized by mean and variance, enabling closed-form inference. It fails for heavy-tailed data, skewed outcomes, bounded quantities, and rare extreme events.
Each distribution has a natural generative story: Bernoulli is a single coin flip; Binomial sums Bernoullis; Poisson counts rare arrivals; Normal emerges from sums of many small effects; Exponential models waiting times between Poisson events; Uniform assigns equal probability across a range. Choosing correctly comes from matching that story to the data-generating process.