Maximum likelihood & MAP: where loss functions come from
Why is the loss for regression squared error, and for classification cross-entropy? Both fall out of one principle — maximum likelihood — and adding a prior turns it into MAP, which is exactly what regularization is. This is the bridge from probability to training.
What you'll learn
- Likelihood vs probability, and the MLE principle: pick θ that makes the data most probable
- Why Gaussian errors make maximum likelihood equal to minimizing squared error (MSE)
- Why classification's cross-entropy loss is just the negative log-likelihood
- MAP = MLE + a prior, and how that prior is exactly L2 / L1 regularization
- That almost every loss you train with is a disguised negative log-likelihood
Before you start
You’ve trained with mean squared error and cross-entropy without asking the obvious question: why those? Why not mean absolute error, or something else? The answer is one of the most unifying ideas in ML — maximum likelihood — and once you see it, every loss function stops being a recipe and becomes a consequence.
The likelihood principle
Flip the usual question around. Instead of “given the model, how probable is the data,” ask: “which model parameters make the data I actually observed most probable?” That’s the likelihood:
L(θ) = P(data | θ) = ∏ᵢ p(xᵢ | θ)
Products of tiny probabilities underflow, so we maximize the log-likelihood (sums are nicer and the maximizer is the same):
θ* = argmax Σᵢ log p(xᵢ | θ)
The maximum lands at μ = the sample mean and σ = the sample spread. The
“obvious” estimators are MLE estimators.
Reveal #1: squared error is Gaussian likelihood
Assume your regression targets are the true line plus Gaussian noise:
y = f(x) + ε, ε ~ N(0, σ²). Write the log-likelihood and the exponent of
the Gaussian gives:
log L = −(1/2σ²) Σ (yᵢ − f(xᵢ))² + const
Maximizing that is identical to minimizing Σ(yᵢ − f(xᵢ))² — the sum of
squared errors. MSE is not a choice; it’s what you get from assuming
Gaussian noise.
Reveal #2: cross-entropy is classification likelihood
For a binary label with predicted probability p, each example contributes
p if the label is 1 and 1−p if it’s 0 — a Bernoulli likelihood. Its
negative log is exactly:
−[ y log p + (1−y) log(1−p) ]
the binary cross-entropy loss. The multi-class version is categorical likelihood. Cross-entropy isn’t a separate invention — it’s the negative log-likelihood of a classifier.
MAP: maximum likelihood meets a prior
What if you have a belief about the parameters before seeing data — say,
“weights should be small”? Encode it as a prior P(θ) and maximize the
posterior (Bayes):
θ_MAP = argmax [ Σ log p(xᵢ | θ) + log P(θ) ]
└ likelihood ┘ └ prior ┘
Toggle the prior in the demo — the estimate gets pulled toward it. And the punchline:
- A Gaussian prior on the weights adds
λ‖w‖²→ L2 / ridge. - A Laplace prior adds
λ‖w‖₁→ L1 / lasso.
Regularization is MAP estimation. The penalty you bolt on is literally a
prior belief, and λ is its strength.
Quick check
Quick check
Practice this in an interview
All questionsMLE maximises the likelihood of the data alone; MAP (Maximum A Posteriori) adds a prior over parameters and maximises the posterior, making it equivalent to regularised MLE. Frequentists treat parameters as fixed unknowns; Bayesians treat them as random variables with a prior distribution.
Ridge regression is equivalent to maximum a posteriori (MAP) estimation with a zero-mean Gaussian prior on the coefficients. The regularization strength λ corresponds to the ratio of the noise variance to the prior variance — stronger regularization means you believe coefficients are drawn from a tighter distribution around zero.
MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.
Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.