Why use cross-entropy loss instead of MSE for classification?

MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.

What is log loss and why does it penalise confident wrong predictions more than uncertain ones?

Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.

What loss function does logistic regression optimize, and why is it convex?

Logistic regression minimizes binary cross-entropy (log-loss), which is the negative log-likelihood of the Bernoulli distribution given the sigmoid-transformed linear predictions. The Hessian of log-loss is positive semi-definite everywhere, guaranteeing a convex surface with a unique global minimum.

Why does training loss keep falling while validation loss rises?

This divergence is the signature of overfitting: the model has enough capacity to memorise training-set specifics — noise, label errors, dataset-specific patterns — that do not generalise. Training loss measures fit to what has already been seen; validation loss measures generalisation to held-out data. As the model memorises rather than learns structure, it scores better on training data and worse on everything else.

Loss Functions: MSE, Cross-Entropy & Friends — Deep Learning

A team once trained a fraud-detection model for weeks. Accuracy on the test set hit 99.8 %. They shipped it. The fraud rate did not budge.

The model had learned one trick: predict “not fraud” for every transaction. With 0.2 % fraud in the data, that gives 99.8 % accuracy — and zero useful signal. The problem was not the architecture or the learning rate. It was the loss function (the single number the optimizer minimizes each step). They had used accuracy as a proxy loss, which cannot be differentiated, and the surrogate they chose rewarded the trivial solution.

The loss is the only thing your network actually optimizes. Everything else — architecture, data augmentation, regularization — only matters insofar as it shapes what the loss landscape looks like. Get the loss wrong and you are optimizing the wrong goal, no matter how clever the rest of your setup is.

Regression losses

Regression tasks predict a continuous number — house price, temperature, pixel intensity. You have three main options.

Mean Squared Error (MSE)

MSE averages the squared difference between predictions and targets:

MSE = (1/n) * sum( (pred_i - target_i)^2 )

Squaring does two things. First, it makes every term non-negative, so errors in opposite directions do not cancel. Second, it penalizes large errors quadratically (it grows with the square of the error) — an error of 10 costs 100 times more than an error of 1. That amplification is good when all errors genuinely matter proportionally, and it gives a smooth, differentiable landscape that gradient descent loves.

The same amplification is the weakness. One outlier — a mislabeled target or an extreme value — dominates the loss and pulls the whole model toward it.

Mean Absolute Error (MAE)

MAE averages the absolute difference:

MAE = (1/n) * sum( |pred_i - target_i| )

Every unit of error costs the same regardless of magnitude, so outliers have far less influence. The catch: the derivative of the absolute value function has a kink (a non-differentiable point) at zero, which makes gradient-based updates noisy when predictions are very close to the target.

Huber loss

Huber loss (parameterized by a threshold delta) blends both: it is MSE-like for small errors and MAE-like for large ones.

Huber(e) = 0.5 * e^2          if |e| <= delta
         = delta*(|e| - 0.5*delta)   otherwise

You get smooth gradients near zero and robustness to outliers. The trade-off is one extra hyperparameter to tune.

Left: MSE (parabola) punishes large errors quadratically; MAE (V) is linear; Huber transitions between them. Right: Cross-entropy explodes as the true-class probability approaches 0 — exactly the gradient signal you need.

Classification losses

Cross-entropy (log loss)

For a single sample, cross-entropy loss is:

CE = -log( p_true )

where p_true is the probability your model assigned to the correct class. That is the entire formula for the single-label case.

Why the negative log? Two reasons.

Mathematically: this falls directly out of maximum likelihood estimation (MLE) — the principle of finding parameters that make the observed data most probable. Maximizing log(p_true) is equivalent to minimizing -log(p_true), so cross-entropy is just MLE with a sign flip to turn it into a minimization problem.

Practically: the gradient signal is exactly right. When your model is confidently wrong — it gives the true class only p = 0.1 — the loss is -log(0.1) = 2.303. When it is almost right at p = 0.9, the loss is -log(0.9) = 0.105. That is a 22x difference in penalty for the exact same “one wrong prediction,” scaled by confidence. A softly wrong model gets a gentle nudge; a confidently wrong model gets slammed — which is exactly what drives learning.

import math

# ── Regression losses ──────────────────────────────────────────
# 3 predictions vs targets
preds   = [2.5, 0.0, 2.0]
targets = [3.0, -0.5, 2.0]
errors  = [p - t for p, t in zip(preds, targets)]

mse = sum(e**2 for e in errors) / len(errors)
mae = sum(abs(e) for e in errors) / len(errors)

delta = 1.0
def huber_one(e, d):
    return 0.5 * e**2 if abs(e) <= d else d * (abs(e) - 0.5 * d)
huber = sum(huber_one(e, delta) for e in errors) / len(errors)

print("── Regression ────────────────────")
print("errors:", errors)
print(f"MSE  = {mse:.4f}")
print(f"MAE  = {mae:.4f}")
print(f"Huber (delta=1) = {huber:.4f}")

# ── Cross-entropy ──────────────────────────────────────────────
# 3 classification samples; store only the true-class probability
true_class_probs = [0.90, 0.60, 0.10]

ce_losses = [-math.log(p) for p in true_class_probs]
ce_mean   = sum(ce_losses) / len(ce_losses)

print()
print("── Cross-entropy ─────────────────")
for p, loss in zip(true_class_probs, ce_losses):
    print(f"  p = {p:.2f}  ->  -log(p) = {loss:.4f}")
print(f"mean CE = {ce_mean:.4f}")

print()
print("── Confidence penalty ratio ──────")
high = -math.log(0.90)
low  = -math.log(0.10)
print(f"-log(0.90) = {high:.3f}")
print(f"-log(0.10) = {low:.3f}")
print(f"ratio      = {low/high:.1f}x")

── Regression ────────────────────
errors: [-0.5, 0.5, 0.0]
MSE  = 0.1667
MAE  = 0.3333
Huber (delta=1) = 0.0833

── Cross-entropy ─────────────────
  p = 0.90  ->  -log(p) = 0.1054
  p = 0.60  ->  -log(p) = 0.5108
  p = 0.10  ->  -log(p) = 2.3026
mean CE = 0.9729

── Confidence penalty ratio ──────
-log(0.90) = 0.105
-log(0.10) = 2.303
ratio      = 21.9x

Reading the numbers:

MSE = 0.1667, MAE = 0.3333, Huber = 0.0833 — the same three errors, but MSE amplifies them the most.
Cross-entropy at p = 0.90 is 0.1054; at p = 0.10 it is 2.3026 — the confident-wrong penalty is 21.9x larger.

Hinge loss

Hinge loss is the SVM family’s native loss:

Hinge = max(0, 1 - y * score)

where y is +1 or -1 and score is the raw output (no probability). It does not require a probability estimate at all. You will encounter it when reading about SVMs or when a paper uses a margin-based objective, but it is rare in modern deep learning.

The output layer must match the loss

Each loss assumes a specific form of output.

Task	Output layer	Loss
Binary classification	Sigmoid (output in `(0, 1)`)	Binary cross-entropy
Multi-class (one correct)	Softmax (outputs sum to 1)	Categorical cross-entropy
Regression	Linear (no activation)	MSE, MAE, or Huber
Multi-label (many correct)	Sigmoid per class	Binary cross-entropy per class

Softmax + cross-entropy is the canonical pairing because softmax turns raw scores (called logits — the unnormalized pre-softmax values) into a proper probability distribution, and cross-entropy measures how far that distribution is from a one-hot truth. PyTorch’s nn.CrossEntropyLoss fuses the softmax and the log internally for numerical stability — you feed it raw logits, not softmax outputs.

Why cross-entropy falls out of maximum likelihood

If you want to understand why cross-entropy is the right choice, not just that it is, here is the one-paragraph derivation.

Suppose your model outputs a probability p for the true class. You observe n independent samples. The likelihood of the whole dataset is the product p_1 * p_2 * ... * p_n. Taking the log turns the product into a sum: log(p_1) + log(p_2) + ... + log(p_n). Maximizing this (MLE) is the same as minimizing -(1/n) * sum(log(p_i)) — which is exactly mean cross-entropy. There is no hand-waving: cross-entropy is MLE for a categorical distribution. That is why it is the principled default for any task where the output is a probability over classes.

In one breath

The loss is the only thing the optimizer minimizes — pick the wrong one and you chase the wrong goal, however good the rest of the setup.
Regression: MSE punishes large errors quadratically (smooth, but outlier-sensitive); MAE is robust but kinked at zero; Huber blends both with a delta knob.
Classification: cross-entropy = −log(p_true), which falls straight out of maximum likelihood and slams confident-wrong predictions (−log 0.1 is ~22× −log 0.9).
Match the output layer to the loss: sigmoid + binary cross-entropy, softmax + categorical cross-entropy, linear + MSE/MAE/Huber.
Never use MSE on a sigmoid/softmax classifier — its gradient vanishes exactly when the model is confidently wrong; cross-entropy keeps the signal alive.

Quick check

0/3

Q1Your model assigns probability 0.02 to the true class. Roughly what is the cross-entropy loss for that sample?

Q2You are predicting house prices and your dataset has a few extreme mansions worth 100x the median. Which loss is the safest starting point?

Q3A new paper proposes training an image classifier with softmax output but MSE loss (comparing the softmax vector to a one-hot vector). What is the most likely failure mode?

Optimizers: SGD, Adam, and when to switch — now that you know what the loss measures, learn how the optimizer steps through it.

Loss Functions: MSE, Cross-Entropy & Friends

What you'll learn

Before you start

Regression losses

Mean Squared Error (MSE)

Mean Absolute Error (MAE)

Huber loss

Classification losses

Cross-entropy (log loss)

Hinge loss

The output layer must match the loss

Why cross-entropy falls out of maximum likelihood

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further