How does a Gaussian Mixture Model differ from k-means, and when would you prefer it?

A GMM models data as a mixture of Gaussian distributions and assigns soft probabilities of cluster membership, fitting clusters that can be elliptical and different sizes via the EM algorithm. K-means does hard assignment to the nearest centroid and implicitly assumes spherical, equal-size clusters. Prefer a GMM when clusters overlap, have different shapes or covariances, or when you need probabilistic (soft) assignments.

Explain the EM algorithm in the context of fitting a Gaussian Mixture Model.

EM fits a GMM by alternating two steps: the E-step computes each point's responsibility (posterior probability) under each Gaussian using current parameters, and the M-step updates the means, covariances, and mixing weights to maximize the expected log-likelihood given those responsibilities. It iterates until the likelihood converges. Because the objective is non-convex, EM only reaches a local optimum, so initialization and multiple restarts matter.

What makes the Normal distribution so central in statistics, and when does it fail?

The Normal distribution is justified by the Central Limit Theorem — averages of large i.i.d. samples converge to Normal regardless of the underlying distribution. It is fully characterized by mean and variance, enabling closed-form inference. It fails for heavy-tailed data, skewed outcomes, bounded quantities, and rare extreme events.

When does each common distribution arise — Bernoulli, Binomial, Poisson, Normal, Exponential, Uniform?

Each distribution has a natural generative story: Bernoulli is a single coin flip; Binomial sums Bernoullis; Poisson counts rare arrivals; Normal emerges from sums of many small effects; Exponential models waiting times between Poisson events; Uniform assigns equal probability across a range. Choosing correctly comes from matching that story to the data-generating process.

The multivariate Gaussian — Math for ML

The multivariate Gaussian

The bell curve, generalized to many dimensions — and the most important distribution in all of ML. A mean vector and a covariance matrix describe it completely, and that single object underlies GMMs, Gaussian processes, VAEs, Kalman filters, and anomaly detection.

8 min read Advanced Math for ML Lesson 26 of 37

What you'll learn

The density formula and its heart: the squared Mahalanobis distance (x−μ)ᵀΣ⁻¹(x−μ)

How the covariance matrix Σ controls the shape — round, axis-aligned, or tilted

Why the Gaussian is special: marginals, conditionals, and linear maps all stay Gaussian

Mahalanobis distance as "distance in standard deviations" for correlated data

Where it shows up — GMMs, Gaussian processes, VAEs, Kalman filters, anomaly detection

The last lesson ended with a shape but no distribution: the covariance matrix Σ drew an ellipse, yet could not say how likely any single point was. Here is the distribution that fills that ellipse in. The 1-D normal is everywhere because of the Central Limit Theorem; its multi-dimensional cousin — the multivariate Gaussian — is even more central to ML, because it is the one rich distribution we can actually compute with. And it is described by exactly the two things you already hold: a mean vector μ (where the blob sits) and a covariance matrix Σ (its shape and tilt).

The density

p(x) = (1 / Z) · exp( −½ (x − μ)ᵀ Σ⁻¹ (x − μ) )

Ignore the normalizer Z. The action is in the exponent: that quadratic form (x − μ)ᵀ Σ⁻¹ (x − μ) is the squared Mahalanobis distance — how far x is from the mean, measured in the distribution’s own stretched, tilted units. Points at equal Mahalanobis distance form an ellipse, and those ellipses are the contours of the bell.

σx1.40σy1.00ρ0.60

Positive ρ: the blob tilts up — high x tends to mean high y.

One mean vector, one covariance matrix Σ — that's the entire description. Σ rotates and stretches a circular bell into this ellipse.

Σ does all the shaping. Σ = σ²I gives a perfectly round bell (isotropic). A diagonal Σ gives an axis-aligned ellipse. A full Σ tilts it — exactly the correlation you tuned.

Why the Gaussian is the distribution we use

It has almost magical closure properties:

Marginals are Gaussian. Ignore some dimensions — what’s left is still Gaussian.
Conditionals are Gaussian. Fix some variables and the distribution of the rest is Gaussian. This is the engine of Gaussian processes and Kalman filters — “given what I’ve seen, predict the rest.”
Linear maps preserve it. Ax + b of a Gaussian is Gaussian. Stays closed under the operations neural nets and signal processing use.

No other multivariate distribution is this tractable — which is why we reach for it constantly, even as an approximation.

import numpy as np
rng = np.random.default_rng(0)

mu = np.array([0.0, 0.0])
Sigma = np.array([[1.5, 1.0], [1.0, 1.0]])     # correlated

X = rng.multivariate_normal(mu, Sigma, size=5)
print("samples:\n", X.round(2))

# Mahalanobis distance: how many "stretched std-devs" from the mean?
Si = np.linalg.inv(Sigma)
def maha(x): return float(np.sqrt((x-mu) @ Si @ (x-mu)))
print("\npoint [3,0]  Euclidean:", round(np.hypot(3,0),2), " Mahalanobis:", round(maha([3,0]),2))
print("point [0,3]  Euclidean:", round(np.hypot(0,3),2), " Mahalanobis:", round(maha([0,3]),2))
# same Euclidean distance, very different Mahalanobis -- Sigma rescales space

samples:
 [[-0.11 -0.17]
 [-0.79 -0.56]
 [ 0.53  0.63]
 [-1.83 -0.86]
 [ 1.2   0.19]]

point [3,0]  Euclidean: 3.0  Mahalanobis: 4.24
point [0,3]  Euclidean: 3.0  Mahalanobis: 5.2

Both test points are 3 units from the mean in plain distance, but their Mahalanobis distances differ — because Σ says the data spreads more easily in one direction than the other.

Where this lives in ML

Gaussian Mixture Models model data as several Gaussians — soft clustering and density estimation.
Gaussian processes put a Gaussian over functions; prediction is just a Gaussian conditional.
VAEs use a Gaussian latent prior and a Gaussian encoder.
Anomaly detection flags points with large Mahalanobis distance.
Diffusion models literally add and remove multivariate Gaussian noise.

In one breath

The multivariate Gaussian is the bell curve in many dimensions, fixed by just a mean vector μ (centre) and a covariance matrix Σ (shape and tilt). Its density p(x) ∝ exp(−½ (x−μ)ᵀ Σ⁻¹ (x−μ)) carries the squared Mahalanobis distance in the exponent — distance from the mean in the distribution’s own stretched, tilted units — so equal-probability contours are ellipses (round when Σ = σ²I, axis-aligned when Σ is diagonal, tilted when it’s full). It dominates ML for one reason: closure — the marginals, conditionals, and linear maps of a Gaussian are all still Gaussian, the exact property behind Gaussian processes, Kalman filters, GMMs, VAEs, anomaly detection, and (assume Gaussian errors) why least-squares MSE is the default loss.

Practice

Quick check

0/3

Q1What two quantities completely specify a multivariate Gaussian?

Q2Two points are the same Euclidean distance from the mean but have different Mahalanobis distances. Why?

Q3Which property makes Gaussians the backbone of Gaussian processes and Kalman filters?

A question to carry forward

Twice now we have leaned on the same unexplained claim: the Gaussian is everywhere “because of the Central Limit Theorem.” We invoked it to justify reaching for the bell curve constantly — but a justification you keep citing without proving is a debt, and it is time to pay it.

It is a genuinely astonishing fact, so it deserves to be stared at directly. Take any distribution at all — a lopsided die, a skewed income, a single coin flip — draw many independent samples and average them. The distribution of that average is not lopsided, skewed, or two-valued; it is a smooth, symmetric bell, and it grows more bell-shaped the more you average. The shape of the originals barely matters. Here is the thread onward: what exactly is the Central Limit Theorem, why does averaging launder almost any distribution into a Gaussian, and why is that — not coincidence — the reason measurement errors, sample means, and half the quantities in statistics turn out normal?

The multivariate Gaussian

What you'll learn

Before you start

The density

Why the Gaussian is the distribution we use

Where this lives in ML

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further