Machine Learning Easy Asked at GoogleAsked at AmazonAsked at Microsoft

What is information gain and how does it relate to entropy in a decision tree split?

The short answer

Information gain measures how much a split reduces uncertainty (entropy) in the target variable. It is the difference between the parent node's entropy and the weighted average entropy of the child nodes. The split that maximises information gain is selected at each node.

How to think about it

Entropy recap

For a node containing samples from K classes with proportions p_1, …, p_K:

H(node) = -Σ p_k · log₂(p_k)

H = 0 for a pure node (one class only); H = log₂(K) for maximum disorder (equal proportions).

Information gain

A split on feature X divides the node into child sets L and R with sizes n_L and n_R (n = n_L + n_R):

IG(split) = H(parent) - [n_L/n · H(L) + n_R/n · H(R)]

The algorithm selects the split with the highest IG. Equivalently, this is the mutual information between X and Y restricted to this node.

Worked example (binary classification)

Parent: 10 positive, 10 negative → H = 1.0 bit

After split:

Left child: 8 positive, 2 negative → H ≈ 0.722
Right child: 2 positive, 8 negative → H ≈ 0.722

IG = 1.0 - [0.5 · 0.722 + 0.5 · 0.722] = 1.0 - 0.722 = 0.278 bits

import numpy as np

def entropy(y):
    classes, counts = np.unique(y, return_counts=True)
    p = counts / counts.sum()
    return -np.sum(p * np.log2(p + 1e-12))

def information_gain(y_parent, y_left, y_right):
    n = len(y_parent)
    weighted_child = (len(y_left)/n * entropy(y_left) +
                      len(y_right)/n * entropy(y_right))
    return entropy(y_parent) - weighted_child

Gain ratio (C4.5)

Information gain is biased toward features with many distinct values (a unique-ID feature would have IG near 1.0 and pure leaves but zero predictive value). The gain ratio divides IG by the split’s own entropy:

GainRatio = IG(X) / H(X)

This penalises splits that fragment data into many tiny partitions. sklearn does not use gain ratio but the bias is instead addressed in random forests through feature subsampling.

Learn it properly Decision trees

What is information gain and how does it relate to entropy in a decision tree split?

Keep practising

Explore further