datarekha

Loss Functions: MSE, Cross-Entropy & Friends

The loss is the only thing your network actually optimizes. Pick the wrong one and training chases the wrong goal — here is how to choose.

8 min read Intermediate Deep Learning Lesson 6 of 17

What you'll learn

  • Why MSE punishes large errors quadratically and when that hurts you
  • How cross-entropy loss derives from maximum likelihood and why it blows up on confident wrong predictions
  • Which loss to pair with which output layer — and why MSE + sigmoid is a trap

Before you start

A team once trained a fraud-detection model for weeks. Accuracy on the test set hit 99.8 %. They shipped it. The fraud rate did not budge.

The model had learned one trick: predict “not fraud” for every transaction. With 0.2 % fraud in the data, that gives 99.8 % accuracy — and zero useful signal. The problem was not the architecture or the learning rate. It was the loss function (the single number the optimizer minimizes each step). They had used accuracy as a proxy loss, which cannot be differentiated, and the surrogate they chose rewarded the trivial solution.

The loss is the only thing your network actually optimizes. Everything else — architecture, data augmentation, regularization — only matters insofar as it shapes what the loss landscape looks like. Get the loss wrong and you are optimizing the wrong goal, no matter how clever the rest of your setup is.

Regression losses

Regression tasks predict a continuous number — house price, temperature, pixel intensity. You have three main options.

Mean Squared Error (MSE)

MSE averages the squared difference between predictions and targets:

MSE = (1/n) * sum( (pred_i - target_i)^2 )

Squaring does two things. First, it makes every term non-negative, so errors in opposite directions do not cancel. Second, it penalizes large errors quadratically (it grows with the square of the error) — an error of 10 costs 100 times more than an error of 1. That amplification is good when all errors genuinely matter proportionally, and it gives a smooth, differentiable landscape that gradient descent loves.

The same amplification is the weakness. One outlier — a mislabeled target or an extreme value — dominates the loss and pulls the whole model toward it.

Mean Absolute Error (MAE)

MAE averages the absolute difference:

MAE = (1/n) * sum( |pred_i - target_i| )

Every unit of error costs the same regardless of magnitude, so outliers have far less influence. The catch: the derivative of the absolute value function has a kink (a non-differentiable point) at zero, which makes gradient-based updates noisy when predictions are very close to the target.

Huber loss

Huber loss (parameterized by a threshold delta) blends both: it is MSE-like for small errors and MAE-like for large ones.

Huber(e) = 0.5 * e^2          if |e| <= delta
         = delta*(|e| - 0.5*delta)   otherwise

You get smooth gradients near zero and robustness to outliers. The trade-off is one extra hyperparameter to tune.

Regression lossesCross-entropy vs true-class probabilityerrorlossMSEHuberMAEtrue-class prob (p)01−log(p)p→1, loss→0p→0, loss→∞−log(p)

Left: MSE (parabola) punishes large errors quadratically; MAE (V) is linear; Huber transitions between them. Right: Cross-entropy explodes as the true-class probability approaches 0 — exactly the gradient signal you need.

Classification losses

Cross-entropy (log loss)

For a single sample, cross-entropy loss is:

CE = -log( p_true )

where p_true is the probability your model assigned to the correct class. That is the entire formula for the single-label case.

Why the negative log? Two reasons.

Mathematically: this falls directly out of maximum likelihood estimation (MLE) — the principle of finding parameters that make the observed data most probable. Maximizing log(p_true) is equivalent to minimizing -log(p_true), so cross-entropy is just MLE with a sign flip to turn it into a minimization problem.

Practically: the gradient signal is exactly right. When your model is confidently wrong — it gives the true class only p = 0.1 — the loss is -log(0.1) = 2.303. When it is almost right at p = 0.9, the loss is -log(0.9) = 0.105. That is a 22x difference in penalty for the exact same “one wrong prediction,” scaled by confidence. A softly wrong model gets a gentle nudge; a confidently wrong model gets slammed — which is exactly what drives learning.

The playground output:

  • MSE = 0.1667, MAE = 0.3333, Huber = 0.0833 — the same three errors, but MSE amplifies them the most.
  • Cross-entropy at p = 0.90 is 0.1054; at p = 0.10 it is 2.3026 — the confident-wrong penalty is 21.9x larger.

Hinge loss

Hinge loss is the SVM family’s native loss:

Hinge = max(0, 1 - y * score)

where y is +1 or -1 and score is the raw output (no probability). It does not require a probability estimate at all. You will encounter it when reading about SVMs or when a paper uses a margin-based objective, but it is rare in modern deep learning.

The output layer must match the loss

Each loss assumes a specific form of output.

TaskOutput layerLoss
Binary classificationSigmoid (output in (0, 1))Binary cross-entropy
Multi-class (one correct)Softmax (outputs sum to 1)Categorical cross-entropy
RegressionLinear (no activation)MSE, MAE, or Huber
Multi-label (many correct)Sigmoid per classBinary cross-entropy per class

Softmax + cross-entropy is the canonical pairing because softmax turns raw scores (called logits — the unnormalized pre-softmax values) into a proper probability distribution, and cross-entropy measures how far that distribution is from a one-hot truth. PyTorch’s nn.CrossEntropyLoss fuses the softmax and the log internally for numerical stability — you feed it raw logits, not softmax outputs.

Why cross-entropy falls out of maximum likelihood

If you want to understand why cross-entropy is the right choice, not just that it is, here is the one-paragraph derivation.

Suppose your model outputs a probability p for the true class. You observe n independent samples. The likelihood of the whole dataset is the product p_1 * p_2 * ... * p_n. Taking the log turns the product into a sum: log(p_1) + log(p_2) + ... + log(p_n). Maximizing this (MLE) is the same as minimizing -(1/n) * sum(log(p_i)) — which is exactly mean cross-entropy. There is no hand-waving: cross-entropy is MLE for a categorical distribution. That is why it is the principled default for any task where the output is a probability over classes.

Next

Optimizers: SGD, Adam, and when to switch — now that you know what the loss measures, learn how the optimizer steps through it.

Practice this in an interview

All questions
Why use cross-entropy loss instead of MSE for classification?

MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.

What is log loss and why does it penalise confident wrong predictions more than uncertain ones?

Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.

What loss function does logistic regression optimize, and why is it convex?

Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.

Why does training loss keep falling while validation loss rises?

This divergence is the signature of overfitting: the model has enough capacity to memorise training-set specifics — noise, label errors, dataset-specific patterns — that do not generalise. Training loss measures fit to what has already been seen; validation loss measures generalisation to held-out data. As the model memorises rather than learns structure, it scores better on training data and worse on everything else.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content