What does softmax do, and why is it used in the output layer?

Softmax converts a vector of raw scores (logits) into a valid probability distribution — all values positive and summing to one — by exponentiating each score and normalising by the total. It is used in classification output layers because the resulting probabilities pair naturally with cross-entropy loss and allow confident predictions to dominate while preserving the relative ordering of logits.

Why do we scale the dot-product attention scores by the square root of d_k?

For large key dimension d_k, the dot products grow large in magnitude, pushing softmax into saturated regions where gradients are tiny. Dividing by the square root of d_k keeps the score variance around one, stabilizing gradients and training.

How do temperature, top-k, and top-p sampling control LLM generation?

Temperature rescales the logits before softmax: low values sharpen the distribution toward greedy deterministic output and high values flatten it for more randomness. Top-k restricts sampling to the k most likely tokens, and top-p or nucleus sampling restricts it to the smallest set of tokens whose cumulative probability exceeds p, both trimming the unlikely tail.

Softmax — Deep Learning

The raw scores are called logits — the unnormalized outputs of the last linear layer. Softmax is the bridge from logits to a probability distribution: every output lands in the open interval between 0 and 1, and together they sum to exactly 1. One formula, three jobs across modern deep learning.

The formula

For a vector of logits z with K entries, the probability assigned to class i is:

Exponentiate every logit, then divide by the total of all the exponentials.

Plug in the example: softmax([2.0, 1.0, 0.1]) gives [0.659, 0.242, 0.099], which sums to 1. Class 0 had the biggest logit, so it gets the biggest probability — softmax never reorders the classes.

Why exp, of all functions?

You could imagine normalizing logits some other way — divide each by the sum, say. Softmax uses exp for four reasons that all matter at once:

All four hold at once — which is why softmax, not plain normalization, is the standard.

That last one is the deepest. A hard argmax — “just pick the biggest” — is flat almost everywhere, so its gradient is zero or undefined and a network can’t learn through it. Softmax is a smooth, differentiable stand-in. Its other name, softargmax, says it plainly: a soft, trainable version of argmax. Notice too that only the differences between logits matter — the ratio of two probabilities, p_i / p_j, equals exp(z_i - z_j). Add the same constant to every logit and nothing changes. Hold onto that fact.

See it across temperatures

The figure below is the whole lesson in one place: the same logits at three temperatures. The probability bars always sum to 100% — they sharpen as temperature drops and flatten as it rises.

The subtract-the-max trick

Here is the detail that separates a textbook formula from production code. exp(1000) overflows to infinity in floating point, and your softmax returns NaN. The fix exploits the fact you just learned: softmax is unchanged when you add a constant to every logit. So before exponentiating, subtract the largest logit from all of them:

Not an approximation — an algebraic identity that prevents overflow.

The biggest logit becomes 0, so its exponential is exp(0) = 1 — the largest term can never overflow. Everything else is exp of a negative number, safely between 0 and 1. The result is bit-for-bit the mathematically correct answer, just computed without blowing up. This “safe softmax” is what every mainstream library does under the hood. Here it is in three lines of NumPy:

import numpy as np

def softmax(z, T=1.0):
    z = np.asarray(z, dtype=float) / T
    z = z - z.max()              # the stability trick — exact, prevents overflow
    e = np.exp(z)
    return e / e.sum()

logits = [2.0, 1.0, 0.1]
p = softmax(logits)
print("probs:", np.round(p, 4))
print("sum:  ", p.sum())            # exactly 1.0

# Big logits that would overflow a naive exp() — safe softmax handles them.
print("huge: ", np.round(softmax([1000.0, 1000.0, 1000.0]), 4))

probs: [0.659  0.2424 0.0986]
sum:   1.0
huge:  [0.3333 0.3333 0.3333]

Temperature — one knob from greedy to random

Divide the logits by a temperature T before softmax: softmax(z / T). It is a single dial over how peaked the distribution is.

Temperature	Effect	Limit
`T = 1`	Plain softmax	—
`T < 1`	Sharpens — more peaked, more confident/greedy	`T -> 0` becomes one-hot at the argmax
`T > 1`	Flattens — closer to uniform, more random	`T -> infinity` becomes uniform, 1/K each

On [2.0, 1.0, 0.1]: at T = 0.5 you get [0.86, 0.12, 0.02] (sharper); at T = 2 you get [0.50, 0.30, 0.19] (flatter). This is the LLM temperature knob — T = 0 is greedy decoding (always the top token), higher T gives more diverse, surprising output. The mental model to keep: softmax is a temperature-controllable soft argmax. Crank T down and it hardens into argmax; crank it up and it melts into a uniform guess.

A common trap, worth saying out loud: higher temperature makes the model less confident, not more. People get this backwards constantly.

Softmax vs sigmoid — they are not interchangeable

This is the confusion that bites people in code review. A sigmoid squashes one logit into one independent probability. Stack N sigmoids and each class is scored on its own — the outputs need not sum to 1. That is multi-label: an image can be both “outdoor” and “sunset” at once.

Softmax couples all the classes into one distribution that competes and sums to

That is multi-class, single-label: exactly one answer is right, so the classes fight over a fixed budget of probability. Softmax is the multi-class generalization of logistic regression — and in the two-class case it collapses back to a sigmoid of the logit difference: softmax([z0, z1])_0 = sigmoid(z0 - z1). Pick by the question you are asking: “which one?” is softmax; “which ones?” is sigmoid.

Where softmax shows up

Three places, same operation:

Classifier output layer. Softmax over the final logits gives class probabilities, paired with cross-entropy. The softmax + cross-entropy combination has a beautifully clean gradient: predicted distribution minus the true one.
Attention. Scaled dot-product attention is softmax(Q Kᵀ / sqrt(d_k)) V. Softmax runs row-wise over the similarity scores so each query’s attention weights are non-negative and sum to 1 — a weighted average over the values. The 1/sqrt(d_k) scaling keeps the scores from growing so large that softmax saturates and its gradient vanishes. This is the Softmax block inside the transformer.
LLM sampling. The model emits a logit per vocabulary token; divide by temperature and softmax to get the next-token distribution you sample from, often after top-k or top-p truncation.

In one breath

Softmax turns logits into a probability distribution: exponentiate each, divide by the sum, so every output is in (0,1) and they sum to 1.
exp is the right choice because it’s positive, monotonic (ranking preserved), amplifies gaps into ratios (p_i/p_j = e^(z_i−z_j)), and is smooth — a differentiable “soft argmax” you can train through.
Only differences between logits matter, which is why the subtract-the-max trick is exact, not approximate — it just stops exp from overflowing.
Temperature is one knob: below 1 sharpens toward the top class (greedy), above 1 flattens toward uniform (random) — higher T means less confident.
Softmax = one competing single-label distribution; stacked sigmoids = independent multi-label scores. It shows up in classifier heads, attention, and LLM next-token sampling.

Quick check

0/3

Q1A teammate computes softmax by subtracting the max logit first, claiming it 'changes the numbers a little but avoids overflow.' What is wrong with that claim?

Q2You raise an LLM's temperature from 0.7 to 1.5. What happens to the next-token distribution?

Q3Transfer: you are building a photo tagger where one image can carry several tags at once — 'beach', 'sunset', 'people' might all be true. Should the output layer use softmax or independent sigmoids, and why?

Softmax is the last step of the classifier; the loss it feeds is the engine of learning. Continue to loss functions to see why softmax and cross-entropy are an inseparable pair, or to sampling to see temperature drive an LLM’s word choices.

Softmax

What you'll learn

Before you start

The formula

Why exp, of all functions?

See it across temperatures

The subtract-the-max trick

Temperature — one knob from greedy to random

Softmax vs sigmoid — they are not interchangeable

Where softmax shows up

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further