datarekha

Entropy & information theory

Entropy measures surprise — the floor on how many bits it takes to encode outcomes. From it flow cross-entropy (your classification loss), KL divergence, decision-tree splits, and mutual information. One idea, an enormous amount of ML.

8 min read Intermediate Math for ML Lesson 28 of 30

What you'll learn

  • Self-information (surprise) = −log p, and entropy as average surprise
  • Why entropy is maximized by a uniform distribution and zero at certainty
  • Cross-entropy: the cost of encoding truth p with a wrong model q — your loss function
  • KL divergence and mutual information, built from the same pieces
  • Where it appears — cross-entropy loss, decision-tree information gain, VAEs, feature selection

Before you start

How surprised should you be by an outcome? A coin landing heads — barely. Winning the lottery — enormously. Information theory turns that intuition into numbers, and those numbers quietly define cross-entropy loss, decision trees, and how we compare distributions.

Surprise, then entropy

The self-information (surprise) of an outcome with probability p is:

surprise = −log p

Rare events (small p) are surprising (large −log p); certain events (p = 1) carry zero surprise. Entropy is the average surprise over a distribution:

H(p) = −Σᵢ pᵢ log pᵢ

With log₂ it’s measured in bits — the average number of yes/no questions (or the floor on the bits) needed to pin down an outcome.

Entropy is maximized by the uniform distribution (everything equally likely → most uncertainty) and zero when one outcome is certain (no surprise possible).

Cross-entropy: encoding truth with the wrong model

If the true distribution is p but you encode using your model q, your average cost is the cross-entropy:

H(p, q) = −Σᵢ pᵢ log qᵢ

It’s minimized exactly when q = p. Sound familiar? When p is the true one-hot label and q is your model’s predicted probabilities, H(p, q) is the cross-entropy loss — the same object the MLE lesson derived from likelihood, now from an information angle.

KL divergence & mutual information

  • KL divergence D(p‖q) = H(p,q) − H(p) ≥ 0 is the extra bits you pay for using the wrong model — covered in depth in its own lesson.
  • Mutual information I(X;Y) is how much knowing X reduces your uncertainty about Y. Zero iff they’re independent; the basis of information-gain feature selection and a lot of representation learning.

Cross-entropy is always at least the entropy; the gap is the KL divergence — literally the bits wasted by a wrong model.

Where this lives in ML

  • Cross-entropy loss — the default classification loss, top to bottom.
  • Decision trees split on the feature that maximizes information gain (the drop in entropy).
  • KL divergence regularizes VAEs and shapes RLHF / policy updates.
  • Mutual information drives feature selection and self-supervised objectives.
  • Perplexity — the language-model metric — is just 2^(cross-entropy).

Quick check

Quick check

0/3
Q1Which distribution over 4 outcomes has the highest entropy?
Q2Cross-entropy H(p,q) is minimized when…
Q3A decision tree chooses a split to maximize information gain. What does that mean?

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What is information gain and how does it relate to entropy in a decision tree split?

Information gain measures how much a split reduces uncertainty (entropy) in the target variable. It is the difference between the parent node's entropy and the weighted average entropy of the child nodes. The split that maximises information gain is selected at each node.

What is the difference between Gini impurity and entropy as splitting criteria in decision trees?

Both measure node impurity but differ in computation and sensitivity. Gini is faster to compute and slightly favors larger partitions, while entropy (information gain) is more sensitive to class probability changes near 0.5. In practice the splits they produce are nearly identical.

Why use cross-entropy loss instead of MSE for classification?

MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.

What is log loss and why does it penalise confident wrong predictions more than uncertain ones?

Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.

Related lessons

Explore further

Skip to content