Statistics & Probability Medium Asked at GoogleAsked at MetaAsked at AmazonAsked at OpenAI

What is maximum likelihood estimation, and what is the intuition behind it?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Maximum likelihood estimation finds the parameter values that make the observed data most probable under the assumed model. Intuitively, you ask: given this data, which world would have been most likely to generate it?

How to think about it

MLE is the workhorse of parametric statistics and is the implicit objective behind logistic regression, Gaussian mixture models, and many deep learning loss functions.

Core idea

Given data x₁, x₂, …, xₙ assumed i.i.d. from a distribution with parameter θ, the likelihood is:

L(θ) = P(x₁, x₂, ..., xₙ | θ) = ∏ᵢ P(xᵢ | θ)

MLE picks the θ that maximises L(θ) — or equivalently the log-likelihood (∑ log P(xᵢ | θ)), which converts the product to a sum and is numerically more stable.

Worked example — estimating a coin’s bias

You flip a coin 10 times and observe 7 heads. Model: each flip is Bernoulli(p).

L(p) = p⁷ · (1-p)³

Take the log, differentiate with respect to p, set to zero:

d/dp [7 log p + 3 log(1-p)] = 7/p - 3/(1-p) = 0
⟹ p̂ = 7/10 = 0.7

The MLE is simply the observed proportion — the value of p that would have made 7 heads out of 10 most probable.

Connection to common loss functions

Model	Assumed distribution	MLE is equivalent to
Linear regression	Gaussian noise	Minimising MSE
Logistic regression	Bernoulli	Minimising cross-entropy
Poisson regression	Poisson	Minimising Poisson deviance

The negative log-likelihood is the loss function you minimise during training.

Properties of MLE

Consistent: converges to the true parameter as n → ∞.
Asymptotically efficient: achieves the Cramér-Rao lower bound for large n.
Invariant: if θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g.
Can overfit with small samples and complex models — no regularisation is built in.

What is maximum likelihood estimation, and what is the intuition behind it?

Core idea

Worked example — estimating a coin’s bias

Connection to common loss functions

Properties of MLE

Keep practising

Explore further