Cross-entropy and KL divergence
evergreen#foundations#information-theory
Bar: compute entropy, cross-entropy, and KL by hand for a 3-outcome distribution — and explain in one sentence what each means.
Intuition first
Information is "surprise." A rare event carries more bits than a common one:
- Entropy
— the irreducible average surprise of ; the best you could ever compress it to. - Cross-entropy
— the average surprise you actually pay when you believe but reality is . - KL divergence
— the extra bits you waste for believing the wrong model: .
The math
Log base 2 → bits; natural log → nats. Key properties:
, with equality iff (Gibbs' inequality). So always. - Asymmetric:
in general. It is not a distance/metric. - Undefined if
where (infinite surprise) — why we smooth probabilities.
Worked example (by hand, in bits)
Let
Check directly: $\tfrac12\log_2\tfrac{0.5}{0.25} + \tfrac14\log_2\tfrac{0.25}{0.25} + \tfrac14\log_2\tfrac{0.25}{0.5} = \tfrac12(1) + \tfrac14(0) + \tfrac14(-1) = 0.25$ ✓
Why this is everywhere in ML
- Cross-entropy loss = maximum likelihood. Minimizing
over your model for the empirical label distribution is MLE. Next-event prediction on a clickstream minimizes cross-entropy over the event vocabulary. - Perplexity
— the effective branching factor / "how many equally-likely choices the model is confused among." (More in Information theory basics.) - Proper scoring rule: log-loss is minimized in expectation only by the true probabilities → it rewards honesty and calibration. See Evaluation theory.
- Forward vs reverse KL shapes behavior: minimizing
is mode-covering (q must put mass everywhere p does); is mode-seeking. This distinction recurs in variational methods.
By-hand exercise (meets the bar)
- With the
above, this particular pair happens to give too. Now take and show — proving asymmetry. - Compute the perplexity of
. (Answer: .)
Links
- Built on by: Information theory basics · Evaluation theory · Optimization & gradients
- Used in: Prediction (next-event log-loss), Concept drift in production (KL between time windows)