What is maximum likelihood estimation, and what is the intuition behind it?
Maximum likelihood estimation finds the parameter values that make the observed data most probable under the assumed model. Intuitively, you ask: given this data, which world would have been most likely to generate it?
How to think about it
MLE is the workhorse of parametric statistics and is the implicit objective behind logistic regression, Gaussian mixture models, and many deep learning loss functions.
Core idea
Given data x₁, x₂, …, xₙ assumed i.i.d. from a distribution with parameter θ, the likelihood is:
L(θ) = P(x₁, x₂, ..., xₙ | θ) = ∏ᵢ P(xᵢ | θ)
MLE picks the θ that maximises L(θ) — or equivalently the log-likelihood (∑ log P(xᵢ | θ)), which converts the product to a sum and is numerically more stable.
Worked example — estimating a coin’s bias
You flip a coin 10 times and observe 7 heads. Model: each flip is Bernoulli(p).
L(p) = p⁷ · (1-p)³
Take the log, differentiate with respect to p, set to zero:
d/dp [7 log p + 3 log(1-p)] = 7/p - 3/(1-p) = 0
⟹ p̂ = 7/10 = 0.7
The MLE is simply the observed proportion — the value of p that would have made 7 heads out of 10 most probable.
Connection to common loss functions
| Model | Assumed distribution | MLE is equivalent to |
|---|---|---|
| Linear regression | Gaussian noise | Minimising MSE |
| Logistic regression | Bernoulli | Minimising cross-entropy |
| Poisson regression | Poisson | Minimising Poisson deviance |
The negative log-likelihood is the loss function you minimise during training.
Properties of MLE
- Consistent: converges to the true parameter as n → ∞.
- Asymptotically efficient: achieves the Cramér-Rao lower bound for large n.
- Invariant: if θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g.
- Can overfit with small samples and complex models — no regularisation is built in.