What does softmax do, and why is it used in the output layer?
Softmax converts a vector of raw scores (logits) into a valid probability distribution — all values positive and summing to one — by exponentiating each score and normalising by the total. It is used in classification output layers because the resulting probabilities pair naturally with cross-entropy loss and allow confident predictions to dominate while preserving the relative ordering of logits.
How to think about it
Softmax is the canonical output activation for multi-class classification. It converts arbitrary real-valued logits into a proper probability simplex.
The formula
For a logit vector z of length K:
softmax(z_i) = exp(z_i) / Σ_j exp(z_j)
The exponential ensures all values are positive; dividing by the sum enforces they add to 1.
Why exponentiation?
Exponentiation amplifies differences between scores non-linearly. If class A has a logit 2 units higher than class B, it gets roughly e²≈7.4× more probability weight — the model can express sharp confidence without any single value dominating at small differences.
Numerical stability in practice
Naive exponentiation overflows for large logits. The standard fix is to subtract the maximum logit before exponentiating — this leaves the output unchanged mathematically but prevents inf:
import torch
import torch.nn.functional as F
logits = torch.tensor([3.0, 1.0, 0.2])
# PyTorch handles stability internally
probs = F.softmax(logits, dim=-1)
# tensor([0.8438, 0.1142, 0.0420])
# For loss computation, skip softmax entirely — use CrossEntropyLoss
Temperature scaling
Dividing logits by a temperature T before softmax sharpens (T < 1) or flattens (T > 1) the distribution. This is used in knowledge distillation and language-model sampling.