What is information gain and how does it relate to entropy in a decision tree split?
Information gain measures how much a split reduces uncertainty (entropy) in the target variable. It is the difference between the parent node's entropy and the weighted average entropy of the child nodes. The split that maximises information gain is selected at each node.
How to think about it
Entropy recap
For a node containing samples from K classes with proportions p_1, …, p_K:
H(node) = -Σ p_k · log₂(p_k)
H = 0 for a pure node (one class only); H = log₂(K) for maximum disorder (equal proportions).
Information gain
A split on feature X divides the node into child sets L and R with sizes n_L and n_R (n = n_L + n_R):
IG(split) = H(parent) - [n_L/n · H(L) + n_R/n · H(R)]
The algorithm selects the split with the highest IG. Equivalently, this is the mutual information between X and Y restricted to this node.
Worked example (binary classification)
Parent: 10 positive, 10 negative → H = 1.0 bit
After split:
- Left child: 8 positive, 2 negative → H ≈ 0.722
- Right child: 2 positive, 8 negative → H ≈ 0.722
IG = 1.0 - [0.5 · 0.722 + 0.5 · 0.722] = 1.0 - 0.722 = 0.278 bits
import numpy as np
def entropy(y):
classes, counts = np.unique(y, return_counts=True)
p = counts / counts.sum()
return -np.sum(p * np.log2(p + 1e-12))
def information_gain(y_parent, y_left, y_right):
n = len(y_parent)
weighted_child = (len(y_left)/n * entropy(y_left) +
len(y_right)/n * entropy(y_right))
return entropy(y_parent) - weighted_child
Gain ratio (C4.5)
Information gain is biased toward features with many distinct values (a unique-ID feature would have IG near 1.0 and pure leaves but zero predictive value). The gain ratio divides IG by the split’s own entropy:
GainRatio = IG(X) / H(X)
This penalises splits that fragment data into many tiny partitions. sklearn does not use gain ratio but the bias is instead addressed in random forests through feature subsampling.