Deep Learning Easy Asked at GoogleAsked at MicrosoftAsked at Apple

What does softmax do, and why is it used in the output layer?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Softmax converts a vector of raw scores (logits) into a valid probability distribution — all values positive and summing to one — by exponentiating each score and normalising by the total. It is used in classification output layers because the resulting probabilities pair naturally with cross-entropy loss and allow confident predictions to dominate while preserving the relative ordering of logits.

How to think about it

Softmax is the canonical output activation for multi-class classification. It converts arbitrary real-valued logits into a proper probability simplex.

The formula

For a logit vector z of length K:

softmax(z_i) = exp(z_i) / Σ_j exp(z_j)

The exponential ensures all values are positive; dividing by the sum enforces they add to 1.

Why exponentiation?

Exponentiation amplifies differences between scores non-linearly. If class A has a logit 2 units higher than class B, it gets roughly e²≈7.4× more probability weight — the model can express sharp confidence without any single value dominating at small differences.

Numerical stability in practice

Naive exponentiation overflows for large logits. The standard fix is to subtract the maximum logit before exponentiating — this leaves the output unchanged mathematically but prevents inf:

import torch
import torch.nn.functional as F

logits = torch.tensor([3.0, 1.0, 0.2])

# PyTorch handles stability internally
probs = F.softmax(logits, dim=-1)
# tensor([0.8438, 0.1142, 0.0420])

# For loss computation, skip softmax entirely — use CrossEntropyLoss

Temperature scaling

Dividing logits by a temperature T before softmax sharpens (T < 1) or flattens (T > 1) the distribution. This is used in knowledge distillation and language-model sampling.

Learn it properly Softmax