Cross-entropy and KL divergence

evergreen#foundations#information-theory

Bar: compute entropy, cross-entropy, and KL by hand for a 3-outcome distribution — and explain in one sentence what each means.

Intuition first

Information is "surprise." A rare event carries more bits than a common one: $-\log p(x)$ . Average that surprise and you get the three quantities below.

Entropy $H(p)$ — the irreducible average surprise of $p$ ; the best you could ever compress it to.
Cross-entropy $H(p,q)$ — the average surprise you actually pay when you believe $q$ but reality is $p$ .
KL divergence $D_{KL}(p\|q)$ — the extra bits you waste for believing the wrong model: $H(p,q)-H(p)$ .

The math

H(p) = -\sum_x p(x)\log p(x)

H(p,q) = -\sum_x p(x)\log q(x)

D_{KL}(p\|q) = \sum_x p(x)\log\frac{p(x)}{q(x)} = H(p,q) - H(p)

Log base 2 → bits; natural log → nats. Key properties:

$D_{KL}\ge 0$ , with equality iff $p=q$ (Gibbs' inequality). So $H(p,q)\ge H(p)$ always.
Asymmetric: $D_{KL}(p\|q)\ne D_{KL}(q\|p)$ in general. It is not a distance/metric.
Undefined if $q(x)=0$ where $p(x)>0$ (infinite surprise) — why we smooth probabilities.

Worked example (by hand, in bits)

Let $p=(\tfrac12,\tfrac14,\tfrac14)$ and $q=(\tfrac14,\tfrac14,\tfrac12)$ .

H(p) = -\big(\tfrac12\log_2\tfrac12 + \tfrac14\log_2\tfrac14 + \tfrac14\log_2\tfrac14\big) = \tfrac12(1) + \tfrac14(2) + \tfrac14(2) = 1.5\ \text{bits}

H(p,q) = -\big(\tfrac12\log_2\tfrac14 + \tfrac14\log_2\tfrac14 + \tfrac14\log_2\tfrac12\big) = \tfrac12(2)+\tfrac14(2)+\tfrac14(1) = 1.75\ \text{bits}

D_{KL}(p\|q) = H(p,q)-H(p) = 1.75 - 1.5 = 0.25\ \text{bits}

Check directly: $\tfrac12\log_2\tfrac{0.5}{0.25} + \tfrac14\log_2\tfrac{0.25}{0.25} + \tfrac14\log_2\tfrac{0.25}{0.5} = \tfrac12(1) + \tfrac14(0) + \tfrac14(-1) = 0.25$ ✓

Why this is everywhere in ML

Cross-entropy loss = maximum likelihood. Minimizing $H(p,q)$ over your model $q$ for the empirical label distribution $p$ is MLE. Next-event prediction on a clickstream minimizes cross-entropy over the event vocabulary.
Perplexity $= 2^{H(p,q)}$ — the effective branching factor / "how many equally-likely choices the model is confused among." (More in Information theory basics.)
Proper scoring rule: log-loss is minimized in expectation only by the true probabilities → it rewards honesty and calibration. See Evaluation theory.
Forward vs reverse KL shapes behavior: minimizing $D_{KL}(p\|q)$ is mode-covering (q must put mass everywhere p does); $D_{KL}(q\|p)$ is mode-seeking. This distinction recurs in variational methods.

By-hand exercise (meets the bar)

With the $p,q$ above, this particular pair happens to give $D_{KL}(q\|p)=0.25$ too. Now take $q'=(0.7,0.2,0.1)$ and show $D_{KL}(p\|q')\ne D_{KL}(q'\|p)$ — proving asymmetry.
Compute the perplexity of $p$ . (Answer: $2^{1.5}\approx 2.83$ .)