datarekha

Maximum likelihood & MAP: where loss functions come from

Why is the loss for regression squared error, and for classification cross-entropy? Both fall out of one principle — maximum likelihood — and adding a prior turns it into MAP, which is exactly what regularization is. This is the bridge from probability to training.

9 min read Advanced Math for ML Lesson 27 of 30

What you'll learn

  • Likelihood vs probability, and the MLE principle: pick θ that makes the data most probable
  • Why Gaussian errors make maximum likelihood equal to minimizing squared error (MSE)
  • Why classification's cross-entropy loss is just the negative log-likelihood
  • MAP = MLE + a prior, and how that prior is exactly L2 / L1 regularization
  • That almost every loss you train with is a disguised negative log-likelihood

Before you start

You’ve trained with mean squared error and cross-entropy without asking the obvious question: why those? Why not mean absolute error, or something else? The answer is one of the most unifying ideas in ML — maximum likelihood — and once you see it, every loss function stops being a recipe and becomes a consequence.

The likelihood principle

Flip the usual question around. Instead of “given the model, how probable is the data,” ask: “which model parameters make the data I actually observed most probable?” That’s the likelihood:

L(θ) = P(data | θ) = ∏ᵢ p(xᵢ | θ)

Products of tiny probabilities underflow, so we maximize the log-likelihood (sums are nicer and the maximizer is the same):

θ* = argmax  Σᵢ log p(xᵢ | θ)

The maximum lands at μ = the sample mean and σ = the sample spread. The “obvious” estimators are MLE estimators.

Reveal #1: squared error is Gaussian likelihood

Assume your regression targets are the true line plus Gaussian noise: y = f(x) + ε, ε ~ N(0, σ²). Write the log-likelihood and the exponent of the Gaussian gives:

log L = −(1/2σ²) Σ (yᵢ − f(xᵢ))²  + const

Maximizing that is identical to minimizing Σ(yᵢ − f(xᵢ))² — the sum of squared errors. MSE is not a choice; it’s what you get from assuming Gaussian noise.

Reveal #2: cross-entropy is classification likelihood

For a binary label with predicted probability p, each example contributes p if the label is 1 and 1−p if it’s 0 — a Bernoulli likelihood. Its negative log is exactly:

−[ y log p + (1−y) log(1−p) ]

the binary cross-entropy loss. The multi-class version is categorical likelihood. Cross-entropy isn’t a separate invention — it’s the negative log-likelihood of a classifier.

MAP: maximum likelihood meets a prior

What if you have a belief about the parameters before seeing data — say, “weights should be small”? Encode it as a prior P(θ) and maximize the posterior (Bayes):

θ_MAP = argmax  [ Σ log p(xᵢ | θ)  +  log P(θ) ]
                  └ likelihood ┘      └ prior ┘

Toggle the prior in the demo — the estimate gets pulled toward it. And the punchline:

  • A Gaussian prior on the weights adds λ‖w‖²L2 / ridge.
  • A Laplace prior adds λ‖w‖₁L1 / lasso.

Regularization is MAP estimation. The penalty you bolt on is literally a prior belief, and λ is its strength.

Quick check

Quick check

0/3
Q1Minimizing mean squared error is equivalent to maximum likelihood under what assumption?
Q2How does L2 regularization relate to MAP estimation?
Q3Why is cross-entropy the 'natural' loss for classification?

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How does MLE differ from MAP estimation, and what is the frequentist vs Bayesian divide?

MLE maximises the likelihood of the data alone; MAP (Maximum A Posteriori) adds a prior over parameters and maximises the posterior, making it equivalent to regularised MLE. Frequentists treat parameters as fixed unknowns; Bayesians treat them as random variables with a prior distribution.

What is the Bayesian interpretation of Ridge regression, and what prior does it correspond to?

Ridge regression is equivalent to maximum a posteriori (MAP) estimation with a zero-mean Gaussian prior on the coefficients. The regularization strength λ corresponds to the ratio of the noise variance to the prior variance — stronger regularization means you believe coefficients are drawn from a tighter distribution around zero.

Why use cross-entropy loss instead of MSE for classification?

MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.

What is log loss and why does it penalise confident wrong predictions more than uncertain ones?

Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.

Related lessons

Explore further

Skip to content