Softmax
How a network turns raw scores into a probability distribution: the exp-over-sum formula, why exp, the subtract-the-max stability trick, and how temperature dials it from greedy argmax to coin-flip random.
What you'll learn
- The formula softmax(z)_i = exp(z_i) / sum_j exp(z_j) and why exp is the right choice
- The subtract-the-max trick — exact, not approximate — that every library uses
- Temperature — T below 1 sharpens toward argmax, T above 1 flattens toward uniform
- Softmax vs sigmoid — one competing distribution vs independent per-class scores
- Where it lives in practice — classifier output layers, attention, and LLM sampling
Before you start
The raw scores are called logits — the unnormalized outputs of the last linear layer. Softmax is the bridge from logits to a probability distribution: every output lands in the open interval between 0 and 1, and together they sum to exactly 1. One formula, three jobs across modern deep learning.
The formula
For a vector of logits z with K entries, the probability assigned to class i is:
Plug in the example: softmax([2.0, 1.0, 0.1]) gives [0.659, 0.242, 0.099],
which sums to 1. Class 0 had the biggest logit, so it gets the biggest
probability — softmax never reorders the classes.
Why exp, of all functions?
You could imagine normalizing logits some other way — divide each by the sum,
say. Softmax uses exp for four reasons that all matter at once:
That last one is the deepest. A hard argmax — “just pick the biggest” — is
flat almost everywhere, so its gradient is zero or undefined and a network
can’t learn through it. Softmax is a smooth, differentiable stand-in. Its
other name, softargmax, says it plainly: a soft, trainable version of
argmax. Notice too that only the differences between logits matter — the
ratio of two probabilities, p_i / p_j, equals exp(z_i - z_j). Add the
same constant to every logit and nothing changes. Hold onto that fact.
Play with it
The widget below is the whole lesson in one place. Drag the logit sliders and watch the probability bars — they always sum to 100%. Then drag temperature and feel what it does.
The subtract-the-max trick
Here is the detail that separates a textbook formula from production code.
exp(1000) overflows to infinity in floating point, and your softmax returns
NaN. The fix exploits the fact you just learned: softmax is unchanged when
you add a constant to every logit. So before exponentiating, subtract the
largest logit from all of them:
The biggest logit becomes 0, so its exponential is exp(0) = 1 — the largest
term can never overflow. Everything else is exp of a negative number, safely
between 0 and 1. The result is bit-for-bit the mathematically correct
answer, just computed without blowing up. This “safe softmax” is what every
mainstream library does under the hood. Here it is in three lines of NumPy —
edit the logits and run it:
Temperature — one knob from greedy to random
Divide the logits by a temperature T before softmax: softmax(z / T). It is
a single dial over how peaked the distribution is.
| Temperature | Effect | Limit |
|---|---|---|
T = 1 | Plain softmax | — |
T < 1 | Sharpens — more peaked, more confident/greedy | T -> 0 becomes one-hot at the argmax |
T > 1 | Flattens — closer to uniform, more random | T -> infinity becomes uniform, 1/K each |
On [2.0, 1.0, 0.1]: at T = 0.5 you get [0.86, 0.12, 0.02] (sharper); at
T = 2 you get [0.50, 0.30, 0.19] (flatter). This is the LLM temperature
knob — T = 0 is greedy decoding (always the top token), higher T gives more
diverse, surprising output. The mental model the widget builds: softmax is a
temperature-controllable soft argmax. Crank T down and it hardens into
argmax; crank it up and it melts into a uniform guess.
A common trap, worth saying out loud: higher temperature makes the model less confident, not more. People get this backwards constantly.
Softmax vs sigmoid — they are not interchangeable
This is the confusion that bites people in code review. A sigmoid squashes one logit into one independent probability. Stack N sigmoids and each class is scored on its own — the outputs need not sum to 1. That is multi-label: an image can be both “outdoor” and “sunset” at once.
Softmax couples all the classes into one distribution that competes and sums to
- That is multi-class, single-label: exactly one answer is right, so the
classes fight over a fixed budget of probability. Softmax is the multi-class
generalization of logistic regression — and in the
two-class case it collapses back to a sigmoid of the logit difference:
softmax([z0, z1])_0 = sigmoid(z0 - z1). Pick by the question you are asking: “which one?” is softmax; “which ones?” is sigmoid.
Where softmax shows up
Three places, same operation:
- Classifier output layer. Softmax over the final logits gives class probabilities, paired with cross-entropy. The softmax + cross-entropy combination has a beautifully clean gradient: predicted distribution minus the true one.
- Attention. Scaled dot-product attention
is
softmax(Q Kᵀ / sqrt(d_k)) V. Softmax runs row-wise over the similarity scores so each query’s attention weights are non-negative and sum to 1 — a weighted average over the values. The1/sqrt(d_k)scaling keeps the scores from growing so large that softmax saturates and its gradient vanishes. This is the Softmax block inside the transformer. - LLM sampling. The model emits a logit per vocabulary token; divide by temperature and softmax to get the next-token distribution you sample from, often after top-k or top-p truncation.
Quick check
Quick check
Next
Softmax is the last step of the classifier; the loss it feeds is the engine of learning. Continue to loss functions to see why softmax and cross-entropy are an inseparable pair, or to sampling to see temperature drive an LLM’s word choices.
Practice this in an interview
All questionsSoftmax converts a vector of raw scores (logits) into a valid probability distribution — all values positive and summing to one — by exponentiating each score and normalising by the total. It is used in classification output layers because the resulting probabilities pair naturally with cross-entropy loss and allow confident predictions to dominate while preserving the relative ordering of logits.
For large key dimensions, the dot products between query and key vectors grow in magnitude proportionally to d_k, pushing the softmax into regions with very small gradients. Dividing by sqrt(d_k) keeps the pre-softmax scores at unit variance regardless of dimension, stabilising training.
MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.
Temperature rescales the logits before softmax — lower values sharpen the distribution toward the most likely token, higher values flatten it. Top-k restricts sampling to the k highest-probability tokens; top-p (nucleus sampling) restricts it to the smallest set of tokens whose cumulative probability reaches p. In practice top-p adapts the candidate pool dynamically while top-k uses a fixed count.