What is the difference between Gini impurity and entropy as splitting criteria in decision trees?
Both measure node impurity but differ in computation and sensitivity. Gini is faster to compute and slightly favors larger partitions, while entropy (information gain) is more sensitive to class probability changes near 0.5. In practice the splits they produce are nearly identical.
How to think about it
Both Gini impurity and entropy quantify how mixed the class labels are in a node. The tree algorithm exhaustively tries every feature and threshold, picks the split that minimises impurity (or maximises the reduction), and recurses.
Gini impurity for a node with K classes:
Gini = 1 - Σ p_i²
Maximum is 1 - 1/K; pure node = 0.
Entropy (Shannon, base-2):
H = -Σ p_i log₂(p_i)
Maximum is log₂(K); pure node = 0.
Information gain is the weighted reduction in entropy after a split.
from sklearn.tree import DecisionTreeClassifier
# Gini (sklearn default)
tree_gini = DecisionTreeClassifier(criterion="gini", max_depth=4)
# Entropy / information gain
tree_entropy = DecisionTreeClassifier(criterion="entropy", max_depth=4)
Why they almost always agree: both are concave functions of the class probabilities and are maximised at equal proportions. The curves diverge most near p = 0.5 for binary problems — entropy penalises impurity slightly more there. Empirical benchmarks show less than 1–2% difference in accuracy across most datasets.
When it matters: entropy is slightly more expensive (logarithm vs squaring) and marginally more sensitive to rare classes. Gini is the sklearn default for speed.