Entropy & information theory
Entropy measures surprise — the floor on how many bits it takes to encode outcomes. From it flow cross-entropy (your classification loss), KL divergence, decision-tree splits, and mutual information. One idea, an enormous amount of ML.
What you'll learn
- Self-information (surprise) = −log p, and entropy as average surprise
- Why entropy is maximized by a uniform distribution and zero at certainty
- Cross-entropy: the cost of encoding truth p with a wrong model q — your loss function
- KL divergence and mutual information, built from the same pieces
- Where it appears — cross-entropy loss, decision-tree information gain, VAEs, feature selection
Before you start
How surprised should you be by an outcome? A coin landing heads — barely. Winning the lottery — enormously. Information theory turns that intuition into numbers, and those numbers quietly define cross-entropy loss, decision trees, and how we compare distributions.
Surprise, then entropy
The self-information (surprise) of an outcome with probability p is:
surprise = −log p
Rare events (small p) are surprising (large −log p); certain events
(p = 1) carry zero surprise. Entropy is the average surprise over a
distribution:
H(p) = −Σᵢ pᵢ log pᵢ
With log₂ it’s measured in bits — the average number of yes/no
questions (or the floor on the bits) needed to pin down an outcome.
Entropy is maximized by the uniform distribution (everything equally likely → most uncertainty) and zero when one outcome is certain (no surprise possible).
Cross-entropy: encoding truth with the wrong model
If the true distribution is p but you encode using your model q, your
average cost is the cross-entropy:
H(p, q) = −Σᵢ pᵢ log qᵢ
It’s minimized exactly when q = p. Sound familiar? When p is the true
one-hot label and q is your model’s predicted probabilities, H(p, q) is
the cross-entropy loss — the same object the MLE lesson derived from
likelihood, now from an information angle.
KL divergence & mutual information
- KL divergence
D(p‖q) = H(p,q) − H(p) ≥ 0is the extra bits you pay for using the wrong model — covered in depth in its own lesson. - Mutual information
I(X;Y)is how much knowingXreduces your uncertainty aboutY. Zero iff they’re independent; the basis of information-gain feature selection and a lot of representation learning.
Cross-entropy is always at least the entropy; the gap is the KL divergence — literally the bits wasted by a wrong model.
Where this lives in ML
- Cross-entropy loss — the default classification loss, top to bottom.
- Decision trees split on the feature that maximizes information gain (the drop in entropy).
- KL divergence regularizes VAEs and shapes RLHF / policy updates.
- Mutual information drives feature selection and self-supervised objectives.
- Perplexity — the language-model metric — is just
2^(cross-entropy).
Quick check
Quick check
Practice this in an interview
All questionsInformation gain measures how much a split reduces uncertainty (entropy) in the target variable. It is the difference between the parent node's entropy and the weighted average entropy of the child nodes. The split that maximises information gain is selected at each node.
Both measure node impurity but differ in computation and sensitivity. Gini is faster to compute and slightly favors larger partitions, while entropy (information gain) is more sensitive to class probability changes near 0.5. In practice the splits they produce are nearly identical.
MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.
Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.