What is the difference between Gini impurity and entropy as splitting criteria in decision trees?

Both measure node impurity but differ in computation and sensitivity. Gini is faster to compute and slightly favors larger partitions, while entropy (information gain) is more sensitive to class probability changes near 0.5. In practice the splits they produce are nearly identical.

What is information gain and how does it relate to entropy in a decision tree split?

Information gain measures how much a split reduces uncertainty (entropy) in the target variable. It is the difference between the parent node's entropy and the weighted average entropy of the child nodes. The split that maximises information gain is selected at each node.

Walk me through exactly how a decision tree chooses a split at each node.

At each node the algorithm iterates over every feature and every candidate threshold, scores each candidate split by the weighted impurity of the two child nodes, and selects the pair that gives the largest impurity reduction. It then recurses on each child until a stopping criterion is met.

What are the pitfalls of impurity-based feature importance in tree ensembles, and how do you get a more reliable estimate?

Impurity-based importance (mean decrease in impurity) is systematically biased toward high-cardinality and continuous features because they offer more candidate splits. Permutation importance and SHAP values are less biased alternatives that measure actual predictive contribution on held-out data.

Decision Trees: Entropy, Gini & Info Gain — GATE DA

What you'll learn

Entropy H = −Σ pᵢ·log2(pᵢ) and Gini = 1 − Σ pᵢ² as impurity measures

Information gain = parent impurity minus the weighted child impurity

Computing information gain for one split by hand, step by step

Why fully grown trees overfit, and how pruning or depth limits help

Last lesson left every classifier so far speaking in weight vectors — correct, but unreadable. Here is the model that answers the wish for one a person can follow by finger. A decision tree asks a sequence of plain yes/no questions — “is this feature above that threshold?” — and each answer sends a sample down a branch, until it lands at a leaf that names a class. Is income below ₹30,000? Then is debt above ₹10,000? Then refuse. No dot products, no margins; just questions, the way a human would reason.

The whole art is which question to ask first, and the rest of the lesson is that one decision made precise. At every node the tree picks the split leaving its two children as pure as possible — each child dominated by a single class — because a pure child is a confident answer. But “pure” is a feeling until you put a number on it, and that number is an impurity measure. This same split-on-best-gain idea powers the random forests and gradient-boosted trees that still win most tabular-data contests, so the arithmetic here is the engine inside tools you will actually ship.

Measuring impurity

For a node where class i holds a fraction pᵢ of the samples, two measures dominate GATE:

Two ways to score disorder. Entropy uses log2; Gini uses squared probabilities — they agree closely.

Entropy H = −Σ pᵢ·log2(pᵢ) — measured in bits. A pure node (p = 1) has H = 0; a 50/50 binary node has H = 1 (maximal disorder).
Gini G = 1 − Σ pᵢ² — a pure node has G = 0; a 50/50 node has G = 0.5. It is the sklearn default, being slightly cheaper than a logarithm.

A split is then judged by how much it cuts impurity:

Information gain = H(parent) − Σ (weighted) H(child), where each child is weighted by the fraction of samples it receives.

Higher gain means a better split. Below, add axis-aligned cuts and watch each one carve a region, recolour by majority class, and drop the Gini impurity — the greedy best cut is hinted in green.

TryDecision tree · split the space

Carve the plane into pure rectangles — one axis-aligned cut at a time

class A class Bbest cut +0.016 gain

Accuracy50%one region — pure guessing

Tree

leaves1

splits0

total Gini0.500

split axis

Pick an axis, hit Add split, then click inside a region to cut it in two. Each cut is chosen to drop Gini impurity — the green dashed line marks the highest-gain cut. Deeper trees carve finer rectangles and can fit anything (and overfit).

How GATE asks this

The bread-and-butter version is a NAT: you are handed a parent node and one proposed split into two children (with class counts), and asked for the information gain — or just the entropy or Gini of a single node. Numbers are kept clean (3/1 splits, 9/1 splits, 50/50 nodes) so the arithmetic is doable in a minute. Conceptual MCQ/MSQ questions probe the rest of the ceiling: a pure node has impurity 0, higher gain is the better split, and a fully grown tree overfits (high variance).

Worked example — information gain of a split

A parent node has 10 positive and 10 negative samples. A candidate split sends them to child A = [9 pos, 1 neg] and child B = [1 pos, 9 neg], each holding 10 samples. What is the information gain?

Step 1 — parent entropy. The parent is a perfect 50/50:

H(parent) = −0.5·log2(0.5) − 0.5·log2(0.5)
          = −0.5·(−1) − 0.5·(−1) = 1.0

Step 2 — child entropy. Child A is p = 0.9 and 0.1. Using log2(0.9) ≈ −0.152 and log2(0.1) ≈ −3.322:

H(A) = −0.9·log2(0.9) − 0.1·log2(0.1)
     = −0.9·(−0.152) − 0.1·(−3.322)
     = 0.137 + 0.332 = 0.469

Child B is the mirror image (p = 0.1, 0.9), so H(B) = 0.469 as well.

Step 3 — weighted child entropy. Each child holds half the samples (weight 10/20 = 0.5):

Weighted H = 0.5·0.469 + 0.5·0.469 = 0.469

Step 4 — information gain.

IG = H(parent) − Weighted H = 1.0 − 0.469 = 0.531

So this split buys 0.531 bits — a large gain, exactly because it turns a useless 50/50 node into two nearly pure children. The same arithmetic in Python:

from math import log2

def entropy(counts):
    n = sum(counts)
    return -sum((c/n) * log2(c/n) for c in counts if c)   # skip 0-count classes

h_parent = entropy([10, 10])
h_a      = entropy([9, 1])
h_b      = entropy([1, 9])
weighted = 0.5 * h_a + 0.5 * h_b
gain     = h_parent - weighted

print(f"H(parent) = {h_parent:.3f}")
print(f"H(child)  = {h_a:.3f}")
print(f"info gain = {gain:.3f}")

H(parent) = 1.000
H(child)  = 0.469
info gain = 0.531

In one breath

A decision tree classifies by a readable cascade of yes/no questions, choosing at each node the split that most reduces impurity — measured by entropy H = −Σ pᵢ log2 pᵢ (0 for pure, 1 bit for a 50/50 binary node) or Gini G = 1 − Σ pᵢ² (0 for pure, 0.5 for 50/50) — and ranks splits by information gain = parent impurity minus the sample-weighted child impurity, always taking the largest; left to grow unchecked it drives training impurity to zero by memorising noise, so depth limits or pruning are what keep it from overfitting.

Practice

Quick check

0/6

Q1Recall — Which statements about decision-tree impurity and splitting are TRUE? (select all that apply)select all that apply

Q2Recall — Entropy and Gini both measure node impurity. Which fact distinguishes them?

Q3Trace — A node contains 3 positive and 1 negative sample. What is its entropy in bits? (2 decimals)numerical answer — type a number

Q4Trace — What is the Gini impurity of a node that is 50% positive and 50% negative? (2 decimals)numerical answer — type a number

Q5Trace — A parent node (entropy 1.0) of 8 samples splits into a PURE child of 4 and another PURE child of 4. What is the information gain? (1 decimal)numerical answer — type a number

Q6Apply — A parent of 10 samples (entropy 1.0, a 5/5 split) splits into child A = [4 pos, 1 neg] and child B = [1 pos, 4 neg]. Each child has entropy ≈ 0.722. What is the information gain? (3 decimals)numerical answer — type a number

A question to carry forward

The tree was a detour — a readable, rule-based model that owes nothing to weights or geometry. But it came at a price: greedy axis-aligned cuts, a tendency to overfit, and no kinship at all with the brain that first inspired the word “learning.” So return, one last time, to the linear-score family — and ask the question that cracks the whole field open.

Every linear classifier so far was fitted: solved by a normal equation, or descended on a loss. None of them learned in the plain sense of watching its own mistakes and correcting. What is the simplest unit that does exactly that — sees a point it got wrong, nudges itself a little toward getting it right, and repeats, with no calculus and no probability anywhere? Here is the thread onward: that humble self-correcting unit is the literal ancestor of every neural network — what is its one update rule, and what is the one kind of problem it can never solve, no matter how long it runs?

Decision Trees: Entropy, Gini & Info Gain

What you'll learn

Before you start

Measuring impurity

Carve the plane into pure rectangles — one axis-aligned cut at a time

How GATE asks this

Worked example — information gain of a split

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further