datarekha
Machine Learning Easy Asked at GoogleAsked at AmazonAsked at Microsoft

What is information gain and how does it relate to entropy in a decision tree split?

The short answer

Information gain measures how much a split reduces uncertainty (entropy) in the target variable. It is the difference between the parent node's entropy and the weighted average entropy of the child nodes. The split that maximises information gain is selected at each node.

How to think about it

Entropy recap

For a node containing samples from K classes with proportions p_1, …, p_K:

H(node) = -Σ p_k · log₂(p_k)

H = 0 for a pure node (one class only); H = log₂(K) for maximum disorder (equal proportions).

Information gain

A split on feature X divides the node into child sets L and R with sizes n_L and n_R (n = n_L + n_R):

IG(split) = H(parent) - [n_L/n · H(L) + n_R/n · H(R)]

The algorithm selects the split with the highest IG. Equivalently, this is the mutual information between X and Y restricted to this node.

Worked example (binary classification)

Parent: 10 positive, 10 negative → H = 1.0 bit

After split:

  • Left child: 8 positive, 2 negative → H ≈ 0.722
  • Right child: 2 positive, 8 negative → H ≈ 0.722
IG = 1.0 - [0.5 · 0.722 + 0.5 · 0.722] = 1.0 - 0.722 = 0.278 bits
import numpy as np

def entropy(y):
    classes, counts = np.unique(y, return_counts=True)
    p = counts / counts.sum()
    return -np.sum(p * np.log2(p + 1e-12))

def information_gain(y_parent, y_left, y_right):
    n = len(y_parent)
    weighted_child = (len(y_left)/n * entropy(y_left) +
                      len(y_right)/n * entropy(y_right))
    return entropy(y_parent) - weighted_child

Gain ratio (C4.5)

Information gain is biased toward features with many distinct values (a unique-ID feature would have IG near 1.0 and pure leaves but zero predictive value). The gain ratio divides IG by the split’s own entropy:

GainRatio = IG(X) / H(X)

This penalises splits that fragment data into many tiny partitions. sklearn does not use gain ratio but the bias is instead addressed in random forests through feature subsampling.

Learn it properly Decision trees

Keep practising

All Machine Learning questions

Explore further

Skip to content