What is information gain and how does it relate to entropy in a decision tree split?

Information gain measures how much a split reduces uncertainty (entropy) in the target variable. It is the difference between the parent node's entropy and the weighted average entropy of the child nodes. The split that maximises information gain is selected at each node.

What is the difference between Gini impurity and entropy as splitting criteria in decision trees?

Both measure node impurity but differ in computation and sensitivity. Gini is faster to compute and slightly favors larger partitions, while entropy (information gain) is more sensitive to class probability changes near 0.5. In practice the splits they produce are nearly identical.

Why use cross-entropy loss instead of MSE for classification?

MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.

What is log loss and why does it penalise confident wrong predictions more than uncertain ones?

Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.

Entropy & information theory — Math for ML

Entropy & information theory

Entropy measures surprise — the floor on how many bits it takes to encode outcomes. From it flow cross-entropy (your classification loss), KL divergence, decision-tree splits, and mutual information. One idea, an enormous amount of ML.

8 min read Intermediate Math for ML Lesson 32 of 37

What you'll learn

Self-information (surprise) = −log p, and entropy as average surprise

Why entropy is maximized by a uniform distribution and zero at certainty

Cross-entropy: the cost of encoding truth p with a wrong model q — your loss function

KL divergence and mutual information, built from the same pieces

Where it appears — cross-entropy loss, decision-tree information gain, VAEs, feature selection

The last lesson left us holding a loaded word. Maximum likelihood maximised Σ log p(xᵢ|θ), and we noticed that its negative, −log p, behaves like surprise: huge when the model is shocked by an outcome, near zero when the outcome was expected. Cross-entropy — the classification loss we derived — wore that idea on its sleeve. So it is time to take the word seriously and turn it into numbers, because those numbers quietly define cross-entropy loss, decision-tree splits, and how we compare any two distributions at all.

Start with the intuition. A coin landing heads — barely surprising. Rain in a rainforest — no news. Winning the lottery — astonishing. The less probable the outcome, the more information its occurrence carries. Information theory is just that instinct, made precise.

Surprise, then entropy

The self-information (surprise) of an outcome with probability p is:

surprise = −log p

A rare event (small p) is very surprising (large −log p); a certain event (p = 1) carries exactly zero surprise — −log 1 = 0. That is the same −log p the likelihood lesson minimised, now named for what it is. Entropy lifts surprise from a single outcome to a whole distribution by taking its average:

H(p) = −Σᵢ pᵢ log pᵢ

Measured with log₂, the unit is bits — the average number of yes/no questions, and the hard floor on the bits, needed to pin down an outcome drawn from p. Reshape the distribution below and watch the bit count move.

67%

19%

10%

ABCD

Entropy

1.38 / 2 bits

Somewhere in between — the more uneven the distribution, the lower the entropy.

The answer the bars reveal: entropy is maximised by the uniform distribution — when everything is equally likely you are maximally uncertain, so each outcome carries the most surprise — and it is zero when one outcome is certain, because then nothing can surprise you. Uncertainty and information are the same quantity seen from two sides.

Cross-entropy: encoding truth with the wrong model

Now suppose the true distribution is p, but you build your code — your model — around a wrong distribution q. The average cost of describing reality with the wrong model is the cross-entropy:

H(p, q) = −Σᵢ pᵢ log qᵢ

It is minimised exactly when q = p. Does that shape look familiar? Let p be the true one-hot label and q your model’s predicted probabilities, and H(p, q) is the cross-entropy loss — the very object the maximum-likelihood lesson derived from the Bernoulli/categorical negative log-likelihood. Two roads, one destination: training a classifier to minimise cross-entropy is maximising the likelihood of the labels, is shrinking the surprise the model feels at the truth.

KL divergence & mutual information

Two more quantities fall straight out of the same pieces:

KL divergence D(p‖q) = H(p,q) − H(p) ≥ 0 is the extra bits you pay for using the wrong model q instead of the truth p — cross-entropy minus the irreducible entropy floor. (Its own lesson is next.)
Mutual information I(X;Y) is how many bits knowing X shaves off your uncertainty about Y. It is zero if and only if X and Y are independent — and it is the engine of information-gain feature selection and a great deal of representation learning.

import numpy as np

def H(p):  return -np.sum(p * np.log2(p + 1e-12))
def CE(p,q): return -np.sum(p * np.log2(q + 1e-12))
def KL(p,q): return CE(p,q) - H(p)

p = np.array([0.7, 0.2, 0.1])          # truth
q = np.array([0.5, 0.3, 0.2])          # model
print("entropy H(p)      :", round(H(p), 3), "bits")
print("cross-entropy H(p,q):", round(CE(p,q), 3), "bits  (≥ H(p))")
print("KL(p||q)          :", round(KL(p,q), 3), "extra bits")
print("perfect model q=p :", round(CE(p,p), 3), "= H(p), KL = 0")

entropy H(p)      : 1.157 bits
cross-entropy H(p,q): 1.28 bits  (≥ H(p))
KL(p||q)          : 0.123 extra bits
perfect model q=p : 1.157 = H(p), KL = 0

Read the arithmetic across: the truth p needs 1.157 bits to encode; using the wrong model q costs 1.28 bits; and the gap, 0.123 bits, is precisely the KL divergence — the bits wasted by the mismatch. Match the model to the truth (q = p) and the cross-entropy collapses back to the entropy, the waste vanishes, and KL hits zero. Cross-entropy is always at least the entropy; the slack is always the KL.

Where this lives in ML

Cross-entropy loss — the default classification objective, top to bottom.
Decision trees split on the feature with the highest information gain — the largest drop in entropy from parent to children.
KL divergence regularises VAEs and shapes RLHF / policy updates.
Mutual information drives feature selection and self-supervised objectives.
Perplexity — the language-model metric — is just 2^(cross-entropy).

In one breath

Self-information −log p is the surprise of an outcome (rare = surprising, certain = zero), and entropy H(p) = −Σ pᵢ log pᵢ is its average — in bits with log₂, the floor on how many yes/no questions encode the source. Entropy is maximal for the uniform distribution and zero at certainty. Cross-entropy H(p,q) = −Σ pᵢ log qᵢ is the cost of encoding truth p with model q, minimised at q = p — and with p a one-hot label it is the classification cross-entropy loss, the same negative log-likelihood MLE derived. KL divergence D(p‖q) = H(p,q) − H(p) ≥ 0 is the extra bits wasted by the wrong model (the demo: 1.28 − 1.157 = 0.123), and mutual information is the bits X reveals about Y. The same currency pays for cross-entropy loss, decision-tree information gain, VAE/RLHF regularisation, and perplexity 2^(cross-entropy).

Practice

Quick check

0/3

Q1Which distribution over 4 outcomes has the highest entropy?

Q2Cross-entropy H(p,q) is minimized when…

Q3A decision tree chooses a split to maximize information gain. What does that mean?

A question to carry forward

One quantity in this lesson kept appearing as a remainder. Cross-entropy was always at least the entropy, and the leftover — H(p,q) − H(p), the bits wasted by using the wrong model — we named KL divergence, printed it (0.123 bits), and then waved at “its own lesson.” That remainder deserves to be the main character, because it is the single most common way machine learning measures the distance between two distributions.

So here is the thread onward. What exactly is D(p‖q), and why is it always ≥ 0 (with equality only when the distributions match)? Why is it pointedly not symmetric — why D(p‖q) ≠ D(q‖p), and which direction should you minimise when you fit a model? And how does this one asymmetric, information-flavoured “divergence” end up as the regulariser inside a VAE, the leash in RLHF, and the very thing variational inference quietly optimises?

Entropy & information theory

What you'll learn

Before you start

Surprise, then entropy

Cross-entropy: encoding truth with the wrong model

KL divergence & mutual information

Where this lives in ML

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further