datarekha
Statistics & Probability Medium Asked at GoogleAsked at MetaAsked at AmazonAsked at OpenAI

What is maximum likelihood estimation, and what is the intuition behind it?

The short answer

Maximum likelihood estimation finds the parameter values that make the observed data most probable under the assumed model. Intuitively, you ask: given this data, which world would have been most likely to generate it?

How to think about it

MLE is the workhorse of parametric statistics and is the implicit objective behind logistic regression, Gaussian mixture models, and many deep learning loss functions.

Core idea

Given data x₁, x₂, …, xₙ assumed i.i.d. from a distribution with parameter θ, the likelihood is:

L(θ) = P(x₁, x₂, ..., xₙ | θ) = ∏ᵢ P(xᵢ | θ)

MLE picks the θ that maximises L(θ) — or equivalently the log-likelihood (∑ log P(xᵢ | θ)), which converts the product to a sum and is numerically more stable.

Worked example — estimating a coin’s bias

You flip a coin 10 times and observe 7 heads. Model: each flip is Bernoulli(p).

L(p) = p⁷ · (1-p)³

Take the log, differentiate with respect to p, set to zero:

d/dp [7 log p + 3 log(1-p)] = 7/p - 3/(1-p) = 0
⟹ p̂ = 7/10 = 0.7

The MLE is simply the observed proportion — the value of p that would have made 7 heads out of 10 most probable.

Connection to common loss functions

ModelAssumed distributionMLE is equivalent to
Linear regressionGaussian noiseMinimising MSE
Logistic regressionBernoulliMinimising cross-entropy
Poisson regressionPoissonMinimising Poisson deviance

The negative log-likelihood is the loss function you minimise during training.

Properties of MLE

  • Consistent: converges to the true parameter as n → ∞.
  • Asymptotically efficient: achieves the Cramér-Rao lower bound for large n.
  • Invariant: if θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g.
  • Can overfit with small samples and complex models — no regularisation is built in.

Keep practising

All Statistics & Probability questions

Explore further

Skip to content