What do skewness and kurtosis measure, and what are their practical implications?

Skewness measures the asymmetry of a distribution's tails — positive skew means a longer right tail, negative skew a longer left tail. Kurtosis measures the heaviness of the tails relative to a normal distribution; excess kurtosis above zero indicates more probability mass in the tails and peak than a Gaussian, which matters for risk and outlier frequency.

Why use cross-entropy loss instead of MSE for classification?

MSE treats class probabilities as continuous values and produces tiny, saturating gradients when a sigmoid output is near 0 or 1, stalling learning. Cross-entropy is the proper log-likelihood loss for categorical distributions; it keeps gradients large and informative even when the network is very wrong, and its minimum aligns with the true class probabilities.

Why does training loss keep falling while validation loss rises?

This divergence is the signature of overfitting: the model has enough capacity to memorise training-set specifics — noise, label errors, dataset-specific patterns — that do not generalise. Training loss measures fit to what has already been seen; validation loss measures generalisation to held-out data. As the model memorises rather than learns structure, it scores better on training data and worse on everything else.

What is log loss and why does it penalise confident wrong predictions more than uncertain ones?

Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.

KL Divergence — Math for ML

The last lesson ended on a remainder we refused to throw away. Cross-entropy always cost at least the entropy — H(p,q) ≥ H(p) — and the gap, the bits wasted by encoding the truth p with the wrong model q, we named KL divergence and promptly shelved. Here it is, off the shelf. It is the single most common way machine learning measures how far one distribution sits from another, and once you hold it, cross-entropy loss, VAEs, distillation, and RLHF all turn out to be the same idea seen from different angles.

Make it concrete first. A language model was trained to predict the next token. On one test sentence it assigned probability 0.9 to the correct token and spread the remaining 0.1 across 49,999 other tokens. A second model assigned 0.5 to the correct token and 0.5 across the rest. Which model is further from the true distribution — where the correct token has probability 1.0 and every other token 0.0? Cross-entropy loss answers that numerically; KL divergence is the quantity hiding inside it that explains why.

The core formula

Let P be the true distribution (what the data actually is) and Q the model distribution (what your model guesses). KL divergence from Q to P — written D(P||Q), read “KL of P from Q” — is:

D(P||Q) = sum over all x of  P(x) * log2( P(x) / Q(x) )

This lesson uses log base 2, so the unit is bits. (Some texts use the natural log; the unit is then nats — same formula, same properties, only the scale changes.)

What one term means

Pick a single outcome x. The fraction P(x) / Q(x) measures how wrong your model is there: if Q says x has probability 0.1 but the truth P says 0.5, the ratio is 5 — Q underestimated this outcome fivefold, so it will be surprised when it shows up. log2(5) turns that surprise ratio into bits. Multiplying by P(x) weights it by how often the outcome actually occurs. The sum collects these weighted surprises across every outcome.

In plain words: D(P||Q) is the average extra bits you pay per observation because you coded the data with Q instead of P. It is exactly the wasted bits the entropy lesson pointed at.

Three essential properties

1. Always non-negative: D(P||Q) ≥ 0 for any two distributions — you can never save bits with the wrong code. (Proof sketch: log(x) ≤ x − 1; apply it term by term and the sum collapses to ≥ 0, which is Jensen’s inequality on the concave log.)

2. Zero if and only if equal: D(P||Q) = 0 exactly when P(x) = Q(x) for every x. Match the truth perfectly and there is no extra surprise left to pay.

3. Asymmetric: D(P||Q) almost never equals D(Q||P). This is the defining difference between a divergence and a distance — distances, like Euclidean, are symmetric by definition.

Worked example: two coins

Let P = [0.5, 0.5] (a fair coin) and Q = [0.9, 0.1] (a biased model that thinks heads is nine times more likely than tails).

Computing D(P||Q):

D(P||Q) = 0.5 * log2(0.5 / 0.9)  +  0.5 * log2(0.5 / 0.1)
        = 0.5 * log2(0.5556)       +  0.5 * log2(5.0)
        = 0.5 * (-0.8480)          +  0.5 * (2.3219)
        = -0.4240 + 1.1610
        = 0.737 bits

When the fair coin lands tails (probability 0.5 under P), the biased model assigns only 0.1 — a ratio of 5, costing 2.32 extra bits. That tail term dominates. Q’s over-confidence on heads partially cancels it, but the net penalty is 0.737 bits per observation.

Computing D(Q||P):

D(Q||P) = 0.9 * log2(0.9 / 0.5)  +  0.1 * log2(0.1 / 0.5)
        = 0.9 * log2(1.8)          +  0.1 * log2(0.2)
        = 0.9 * (0.8480)           +  0.1 * (-2.3219)
        = 0.7632 + (-0.2322)
        = 0.531 bits

Two directions, two different numbers — 0.737 bits versus 0.531 bits. That gap is the asymmetry, made arithmetic.

import math

P = [0.5, 0.5]
Q = [0.9, 0.1]

def kl(p, q):
    return sum(pi * math.log2(pi / qi) for pi, qi in zip(p, q))

dpq = kl(P, Q)
dqp = kl(Q, P)

print("D(P||Q) = " + f"{dpq:.4f}" + " bits  (P=fair, Q=biased)")
print("D(Q||P) = " + f"{dqp:.4f}" + " bits  (Q=biased, P=fair)")
print("Symmetric? " + str(abs(dpq - dqp) < 1e-9))

D(P||Q) = 0.7370 bits  (P=fair, Q=biased)
D(Q||P) = 0.5310 bits  (Q=biased, P=fair)
Symmetric? False

The hand calculation and the code agree to the bit, and Symmetric? False is the whole point in one line: order matters.

The distribution diagram

The figure shows P and Q side by side, with the per-outcome surprise contributions shaded for the D(P||Q) direction.

Heads: Q over-estimates (Q=0.9 vs P=0.5), contributing −0.424 bits. Tails: Q severely under-estimates (Q=0.1 vs P=0.5), contributing +1.161 bits. Net = 0.737 bits.

The cross-entropy identity

Now close the loop with the previous lesson. The entropy of P is the average bits needed with a perfect code, and the cross-entropy is the average bits when you code P-data with Q-codes:

H(P)    = - sum over x of  P(x) * log2( P(x) )
H(P, Q) = - sum over x of  P(x) * log2( Q(x) )

Subtract and KL falls straight out:

H(P, Q)  =  H(P)  +  D(P||Q)

This identity is why minimizing cross-entropy loss is minimizing the KL divergence to the true label distribution. When the labels are one-hot (one class gets probability 1, the rest 0), H(P) is zero and H(P, Q) = D(P||Q) exactly. Every gradient step that shrinks cross-entropy shrinks the KL gap between your model and the truth — the loss you have been minimizing all along was a divergence in disguise.

Where KL shows up in practice

Variational Autoencoders (VAEs). The loss has a reconstruction term plus a KL term D(q(z|x) || p(z)) — the encoder’s approximate posterior against a standard-Gaussian prior. The KL term penalizes drift from the prior, keeping the latent space smooth enough to sample.

Knowledge distillation. A small student is trained to match a large teacher’s soft outputs. Minimizing forward KL (P = teacher, Q = student) pushes the student to cover all the modes the teacher weights — including the small probabilities on “almost-right” answers that carry generalization signal.

RLHF and PPO. When fine-tuning a language model on human feedback, a KL penalty D(policy || reference) is added to the reward to stop the policy drifting too far from the base model. Without it, the model collapses to a narrow set of reward-hacking outputs.

The asymmetry matters in practice

The two directions of KL are two different personalities, and choosing one is a modeling decision:

Forward KL (D(P||Q), P true, Q model): wherever P puts probability, Q is forced to put some too — or the log ratio blows up. Q tries to cover all modes of P. Call it mass-covering.
Reverse KL (D(Q||P), Q model, P true): wherever Q puts probability, P must have some — an infinite penalty for mass where P is zero. So Q picks one mode and commits. Call it mode-seeking.

VAEs use forward KL on the encoder; variational inference generally uses reverse KL because optimizing D(Q||P) is tractable when P is the intractable posterior. The choice is not cosmetic — it changes what the model learns to approximate.

In one breath

KL divergence D(P||Q) = Σ P(x) log₂(P(x)/Q(x)) is the average extra bits you pay for coding true-distribution P with model Q — exactly the wasted bits left over when cross-entropy exceeds entropy. Three properties: it is ≥ 0 always, zero iff P = Q, and asymmetric (D(P||Q) ≠ D(Q||P), so it is a divergence, not a distance — the two-coin demo: 0.737 vs 0.531 bits). The identity H(P,Q) = H(P) + D(P||Q) means minimizing cross-entropy loss is minimizing KL to the true labels (and equals it exactly for one-hot labels). Its asymmetry has teeth: forward KL is mass-covering, reverse KL is mode-seeking — which is why it shows up as the VAE regularizer, the distillation objective, the RLHF leash, and the thing variational inference quietly optimizes.

Practice

Quick check

0/3

Q1P = [0.5, 0.5], Q = [0.9, 0.1]. What is D(P||Q) rounded to two decimal places, in bits?

Q2A classifier's cross-entropy loss on a one-hot label distribution decreases from 1.2 to 0.4. What happened to D(model output || true labels)?

Q3A new generative model uses reverse KL — D(Q||P) — where Q is the model and P is the data distribution. Compared to a model trained with forward KL, you expect the reverse-KL model to produce outputs that are:

A question to carry forward

Stand back and look at what this whole chapter has been doing. Covariance, correlation, mutual information, now KL divergence — every one is a number for a relationship, a way to say how much one quantity tells you about another. We have learned to trust those numbers. The last lesson of the chapter is here to shake that trust, hard.

Here is the unsettling fact. A relationship — Treatment A beats Treatment B, a feature predicts churn — can hold in every single subgroup of your data and then reverse the moment you pool the groups together. Same rows, same arithmetic, opposite conclusion. No calculation is wrong. So what is Simpson’s paradox, what hidden “lurking variable” engineers the reversal, and — the question no formula can answer for you — when the subgroups and the aggregate disagree, which number should you actually act on?

KL Divergence

What you'll learn

Before you start

The core formula

What one term means

Three essential properties

Worked example: two coins

The distribution diagram

The cross-entropy identity

Where KL shows up in practice

The asymmetry matters in practice

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further