Decision Trees: Entropy, Gini & Info Gain
A decision tree splits data to drive down impurity. Entropy and Gini measure that impurity; information gain picks the split — a recurring GATE DA NAT.
What you'll learn
- Entropy H = −Σ pᵢ·log2(pᵢ) and Gini = 1 − Σ pᵢ² as impurity measures
- Information gain = parent impurity minus the weighted child impurity
- Computing information gain for one split by hand, step by step
- Why fully grown trees overfit, and how pruning or depth limits help
Before you start
A decision tree asks a sequence of yes/no questions — “is this feature above that threshold?” — and each answer sends a sample down a branch. The art is which question to ask: at every node the tree picks the split that leaves its two children as pure as possible (each child dominated by one class). “Pure” needs a number, and that number is an impurity measure. This same split-on-best-gain idea powers the random forests and gradient-boosted trees that still win most tabular-data problems in industry — so the arithmetic here is the engine inside tools you will actually ship.
Measuring impurity
For a node where class i holds a fraction pᵢ of the samples, two measures
dominate GATE:
- Entropy
H = −Σ pᵢ·log2(pᵢ)— measured in bits. A pure node (p = 1) hasH = 0; a 50/50 binary node hasH = 1(maximal disorder). - Gini
G = 1 − Σ pᵢ²— a pure node hasG = 0; a 50/50 node hasG = 0.5. It is the sklearn default, being slightly cheaper than a logarithm.
A split is judged by how much it cuts impurity:
Information gain = H(parent) − Σ (weighted) H(child), where each child is weighted by the fraction of samples it receives.
Higher gain means a better split. Below, add axis-aligned cuts and watch each one carve a region, recolour by majority class, and drop the Gini impurity — the greedy best cut is hinted in green.
How GATE asks this
The bread-and-butter version is a NAT: you are handed a parent node and one proposed split into two children (with class counts), and asked for the information gain — or just the entropy or Gini of a single node. Numbers are kept clean (3/1 splits, 9/1 splits, 50/50 nodes) so the arithmetic is doable in a minute. Conceptual MCQ/MSQ questions probe the rest of the ceiling: a pure node has impurity 0, higher gain is the better split, and a fully grown tree overfits (high variance).
Worked example — information gain of a split
A parent node has 10 positive and 10 negative samples. A candidate split sends them to child A =
[9 pos, 1 neg]and child B =[1 pos, 9 neg], each holding 10 samples. What is the information gain?
Step 1 — parent entropy. The parent is 10/20 and 10/20, a perfect 50/50:
H(parent) = −0.5·log2(0.5) − 0.5·log2(0.5)
= −0.5·(−1) − 0.5·(−1) = 1.0
Step 2 — child entropy. Child A is p = 0.9 and 0.1. Using
log2(0.9) ≈ −0.152 and log2(0.1) ≈ −3.322:
H(A) = −0.9·log2(0.9) − 0.1·log2(0.1)
= −0.9·(−0.152) − 0.1·(−3.322)
= 0.137 + 0.332 = 0.469
Child B is the mirror image (p = 0.1, 0.9), so H(B) = 0.469 as well.
Step 3 — weighted child entropy. Each child holds half the samples
(weight 10/20 = 0.5):
Weighted H = 0.5·0.469 + 0.5·0.469 = 0.469
Step 4 — information gain.
IG = H(parent) − Weighted H = 1.0 − 0.469 = 0.531
So this split buys 0.531 bits of information — a large gain, because it turns a useless 50/50 node into two nearly pure children.
Quick check
Quick check
Practice this in an interview
All questionsBoth measure node impurity but differ in computation and sensitivity. Gini is faster to compute and slightly favors larger partitions, while entropy (information gain) is more sensitive to class probability changes near 0.5. In practice the splits they produce are nearly identical.
Information gain measures how much a split reduces uncertainty (entropy) in the target variable. It is the difference between the parent node's entropy and the weighted average entropy of the child nodes. The split that maximises information gain is selected at each node.
At each node the algorithm iterates over every feature and every candidate threshold, scores each candidate split by the weighted impurity of the two child nodes, and selects the pair that gives the largest impurity reduction. It then recurses on each child until a stopping criterion is met.
Impurity-based importance (mean decrease in impurity) is systematically biased toward high-cardinality and continuous features because they offer more candidate splits. Permutation importance and SHAP values are less biased alternatives that measure actual predictive contribution on held-out data.