Loss Functions: MSE, Cross-Entropy & Friends
The loss is the only thing your network actually optimizes. Pick the wrong one and training chases the wrong goal — here is how to choose.
What you'll learn
- Why MSE punishes large errors quadratically and when that hurts you
- How cross-entropy loss derives from maximum likelihood and why it blows up on confident wrong predictions
- Which loss to pair with which output layer — and why MSE + sigmoid is a trap
Before you start
A team once trained a fraud-detection model for weeks. Accuracy on the test set hit 99.8 %. They shipped it. The fraud rate did not budge.
The model had learned one trick: predict “not fraud” for every transaction. With 0.2 % fraud in the data, that gives 99.8 % accuracy — and zero useful signal. The problem was not the architecture or the learning rate. It was the loss function (the single number the optimizer minimizes each step). They had used accuracy as a proxy loss, which cannot be differentiated, and the surrogate they chose rewarded the trivial solution.
The loss is the only thing your network actually optimizes. Everything else — architecture, data augmentation, regularization — only matters insofar as it shapes what the loss landscape looks like. Get the loss wrong and you are optimizing the wrong goal, no matter how clever the rest of your setup is.
Regression losses
Regression tasks predict a continuous number — house price, temperature, pixel intensity. You have three main options.
Mean Squared Error (MSE)
MSE averages the squared difference between predictions and targets:
MSE = (1/n) * sum( (pred_i - target_i)^2 )
Squaring does two things. First, it makes every term non-negative, so errors in opposite directions do not cancel. Second, it penalizes large errors quadratically (it grows with the square of the error) — an error of 10 costs 100 times more than an error of 1. That amplification is good when all errors genuinely matter proportionally, and it gives a smooth, differentiable landscape that gradient descent loves.
The same amplification is the weakness. One outlier — a mislabeled target or an extreme value — dominates the loss and pulls the whole model toward it.
Mean Absolute Error (MAE)
MAE averages the absolute difference:
MAE = (1/n) * sum( |pred_i - target_i| )
Every unit of error costs the same regardless of magnitude, so outliers have far less influence. The catch: the derivative of the absolute value function has a kink (a non-differentiable point) at zero, which makes gradient-based updates noisy when predictions are very close to the target.
Huber loss
Huber loss (parameterized by a threshold delta) blends both: it is
MSE-like for small errors and MAE-like for large ones.
Huber(e) = 0.5 * e^2 if |e| <= delta
= delta*(|e| - 0.5*delta) otherwise
You get smooth gradients near zero and robustness to outliers. The trade-off is one extra hyperparameter to tune.
Left: MSE (parabola) punishes large errors quadratically; MAE (V) is linear; Huber transitions between them. Right: Cross-entropy explodes as the true-class probability approaches 0 — exactly the gradient signal you need.
Classification losses
Cross-entropy (log loss)
For a single sample, cross-entropy loss is:
CE = -log( p_true )
where p_true is the probability your model assigned to the correct
class. That is the entire formula for the single-label case.
Why the negative log? Two reasons.
Mathematically: this falls directly out of maximum likelihood
estimation (MLE) — the principle of finding parameters that make the
observed data most probable. Maximizing log(p_true) is equivalent to
minimizing -log(p_true), so cross-entropy is just MLE with a sign flip
to turn it into a minimization problem.
Practically: the gradient signal is exactly right. When your model is
confidently wrong — it gives the true class only p = 0.1 — the loss is
-log(0.1) = 2.303. When it is almost right at p = 0.9, the loss is
-log(0.9) = 0.105. That is a 22x difference in penalty for the exact
same “one wrong prediction,” scaled by confidence. A softly wrong model
gets a gentle nudge; a confidently wrong model gets slammed — which is
exactly what drives learning.
The playground output:
MSE = 0.1667,MAE = 0.3333,Huber = 0.0833— the same three errors, but MSE amplifies them the most.- Cross-entropy at
p = 0.90is0.1054; atp = 0.10it is2.3026— the confident-wrong penalty is 21.9x larger.
Hinge loss
Hinge loss is the SVM family’s native loss:
Hinge = max(0, 1 - y * score)
where y is +1 or -1 and score is the raw output (no probability).
It does not require a probability estimate at all. You will encounter it
when reading about SVMs or when a paper uses a margin-based objective, but
it is rare in modern deep learning.
The output layer must match the loss
Each loss assumes a specific form of output.
| Task | Output layer | Loss |
|---|---|---|
| Binary classification | Sigmoid (output in (0, 1)) | Binary cross-entropy |
| Multi-class (one correct) | Softmax (outputs sum to 1) | Categorical cross-entropy |
| Regression | Linear (no activation) | MSE, MAE, or Huber |
| Multi-label (many correct) | Sigmoid per class | Binary cross-entropy per class |
Softmax + cross-entropy is the canonical pairing because softmax turns
raw scores (called logits — the unnormalized pre-softmax values) into a
proper probability distribution, and cross-entropy measures how far that
distribution is from a one-hot truth. PyTorch’s nn.CrossEntropyLoss
fuses the softmax and the log internally for numerical stability — you feed
it raw logits, not softmax outputs.
Why cross-entropy falls out of maximum likelihood
If you want to understand why cross-entropy is the right choice, not just that it is, here is the one-paragraph derivation.
Suppose your model outputs a probability p for the true class. You observe
n independent samples. The likelihood of the whole dataset is the product
p_1 * p_2 * ... * p_n. Taking the log turns the product into a sum:
log(p_1) + log(p_2) + ... + log(p_n). Maximizing this (MLE) is the same
as minimizing -(1/n) * sum(log(p_i)) — which is exactly mean cross-entropy.
There is no hand-waving: cross-entropy is MLE for a categorical distribution.
That is why it is the principled default for any task where the output is
a probability over classes.
Next
Optimizers: SGD, Adam, and when to switch — now that you know what the loss measures, learn how the optimizer steps through it.
Practice this in an interview
All questionsMSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.
Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.
Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.
This divergence is the signature of overfitting: the model has enough capacity to memorise training-set specifics — noise, label errors, dataset-specific patterns — that do not generalise. Training loss measures fit to what has already been seen; validation loss measures generalisation to held-out data. As the model memorises rather than learns structure, it scores better on training data and worse on everything else.