datarekha

Softmax

How a network turns raw scores into a probability distribution: the exp-over-sum formula, why exp, the subtract-the-max stability trick, and how temperature dials it from greedy argmax to coin-flip random.

7 min read Intermediate Deep Learning Lesson 4 of 17

What you'll learn

  • The formula softmax(z)_i = exp(z_i) / sum_j exp(z_j) and why exp is the right choice
  • The subtract-the-max trick — exact, not approximate — that every library uses
  • Temperature — T below 1 sharpens toward argmax, T above 1 flattens toward uniform
  • Softmax vs sigmoid — one competing distribution vs independent per-class scores
  • Where it lives in practice — classifier output layers, attention, and LLM sampling

Before you start

The raw scores are called logits — the unnormalized outputs of the last linear layer. Softmax is the bridge from logits to a probability distribution: every output lands in the open interval between 0 and 1, and together they sum to exactly 1. One formula, three jobs across modern deep learning.

The formula

For a vector of logits z with K entries, the probability assigned to class i is:

softmax(z)i=ezijezjexponentiateeach logit, thennormalize so theysum to one
Exponentiate every logit, then divide by the total of all the exponentials.

Plug in the example: softmax([2.0, 1.0, 0.1]) gives [0.659, 0.242, 0.099], which sums to 1. Class 0 had the biggest logit, so it gets the biggest probability — softmax never reorders the classes.

Why exp, of all functions?

You could imagine normalizing logits some other way — divide each by the sum, say. Softmax uses exp for four reasons that all matter at once:

Positiveex> 0every prob ispositive, evenfor negativesMonotonicbigger logit→ bigger probranking ispreservedAmplifiesp_i / p_j =e^(z_i − z_j)gaps becomeratios; top winsSmoothdifferentiableeverywheregradients flow;you can train it
All four hold at once — which is why softmax, not plain normalization, is the standard.

That last one is the deepest. A hard argmax — “just pick the biggest” — is flat almost everywhere, so its gradient is zero or undefined and a network can’t learn through it. Softmax is a smooth, differentiable stand-in. Its other name, softargmax, says it plainly: a soft, trainable version of argmax. Notice too that only the differences between logits matter — the ratio of two probabilities, p_i / p_j, equals exp(z_i - z_j). Add the same constant to every logit and nothing changes. Hold onto that fact.

Play with it

The widget below is the whole lesson in one place. Drag the logit sliders and watch the probability bars — they always sum to 100%. Then drag temperature and feel what it does.

The subtract-the-max trick

Here is the detail that separates a textbook formula from production code. exp(1000) overflows to infinity in floating point, and your softmax returns NaN. The fix exploits the fact you just learned: softmax is unchanged when you add a constant to every logit. So before exponentiating, subtract the largest logit from all of them:

softmax(z)=softmax( z − max(z) )same answer, exactlybiggest exponent is nowe^0 = 1, never overflows
Not an approximation — an algebraic identity that prevents overflow.

The biggest logit becomes 0, so its exponential is exp(0) = 1 — the largest term can never overflow. Everything else is exp of a negative number, safely between 0 and 1. The result is bit-for-bit the mathematically correct answer, just computed without blowing up. This “safe softmax” is what every mainstream library does under the hood. Here it is in three lines of NumPy — edit the logits and run it:

Temperature — one knob from greedy to random

Divide the logits by a temperature T before softmax: softmax(z / T). It is a single dial over how peaked the distribution is.

TemperatureEffectLimit
T = 1Plain softmax
T < 1Sharpens — more peaked, more confident/greedyT -> 0 becomes one-hot at the argmax
T > 1Flattens — closer to uniform, more randomT -> infinity becomes uniform, 1/K each

On [2.0, 1.0, 0.1]: at T = 0.5 you get [0.86, 0.12, 0.02] (sharper); at T = 2 you get [0.50, 0.30, 0.19] (flatter). This is the LLM temperature knob — T = 0 is greedy decoding (always the top token), higher T gives more diverse, surprising output. The mental model the widget builds: softmax is a temperature-controllable soft argmax. Crank T down and it hardens into argmax; crank it up and it melts into a uniform guess.

A common trap, worth saying out loud: higher temperature makes the model less confident, not more. People get this backwards constantly.

Softmax vs sigmoid — they are not interchangeable

This is the confusion that bites people in code review. A sigmoid squashes one logit into one independent probability. Stack N sigmoids and each class is scored on its own — the outputs need not sum to 1. That is multi-label: an image can be both “outdoor” and “sunset” at once.

Softmax couples all the classes into one distribution that competes and sums to

  1. That is multi-class, single-label: exactly one answer is right, so the classes fight over a fixed budget of probability. Softmax is the multi-class generalization of logistic regression — and in the two-class case it collapses back to a sigmoid of the logit difference: softmax([z0, z1])_0 = sigmoid(z0 - z1). Pick by the question you are asking: “which one?” is softmax; “which ones?” is sigmoid.

Where softmax shows up

Three places, same operation:

  • Classifier output layer. Softmax over the final logits gives class probabilities, paired with cross-entropy. The softmax + cross-entropy combination has a beautifully clean gradient: predicted distribution minus the true one.
  • Attention. Scaled dot-product attention is softmax(Q Kᵀ / sqrt(d_k)) V. Softmax runs row-wise over the similarity scores so each query’s attention weights are non-negative and sum to 1 — a weighted average over the values. The 1/sqrt(d_k) scaling keeps the scores from growing so large that softmax saturates and its gradient vanishes. This is the Softmax block inside the transformer.
  • LLM sampling. The model emits a logit per vocabulary token; divide by temperature and softmax to get the next-token distribution you sample from, often after top-k or top-p truncation.

Quick check

Quick check

0/3
Q1A teammate computes softmax by subtracting the max logit first, claiming it 'changes the numbers a little but avoids overflow.' What is wrong with that claim?
Q2You raise an LLM's temperature from 0.7 to 1.5. What happens to the next-token distribution?
Q3Transfer: you are building a photo tagger where one image can carry several tags at once — 'beach', 'sunset', 'people' might all be true. Should the output layer use softmax or independent sigmoids, and why?

Next

Softmax is the last step of the classifier; the loss it feeds is the engine of learning. Continue to loss functions to see why softmax and cross-entropy are an inseparable pair, or to sampling to see temperature drive an LLM’s word choices.

Practice this in an interview

All questions
What does softmax do, and why is it used in the output layer?

Softmax converts a vector of raw scores (logits) into a valid probability distribution — all values positive and summing to one — by exponentiating each score and normalising by the total. It is used in classification output layers because the resulting probabilities pair naturally with cross-entropy loss and allow confident predictions to dominate while preserving the relative ordering of logits.

Why do we scale by sqrt(d_k) in scaled dot-product attention?

For large key dimensions, the dot products between query and key vectors grow in magnitude proportionally to d_k, pushing the softmax into regions with very small gradients. Dividing by sqrt(d_k) keeps the pre-softmax scores at unit variance regardless of dimension, stabilising training.

Why use cross-entropy loss instead of MSE for classification?

MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.

What is the difference between temperature, top-k, and top-p sampling in LLMs?

Temperature rescales the logits before softmax — lower values sharpen the distribution toward the most likely token, higher values flatten it. Top-k restricts sampling to the k highest-probability tokens; top-p (nucleus sampling) restricts it to the smallest set of tokens whose cumulative probability reaches p. In practice top-p adapts the candidate pool dynamically while top-k uses a fixed count.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content