Cross-entropy and KL divergence

evergreen#foundations#information-theory

Up: 10_Foundations-MOC

Bar: compute entropy, cross-entropy, and KL by hand for a 3-outcome distribution — and explain in one sentence what each means.

Intuition first

Information is "surprise." A rare event carries more bits than a common one: logp(x)-\log p(x). Average that surprise and you get the three quantities below.

The math

H(p)=xp(x)logp(x)H(p) = -\sum_x p(x)\log p(x)
H(p,q)=xp(x)logq(x)H(p,q) = -\sum_x p(x)\log q(x)
DKL(pq)=xp(x)logp(x)q(x)=H(p,q)H(p)D_{KL}(p\|q) = \sum_x p(x)\log\frac{p(x)}{q(x)} = H(p,q) - H(p)

Log base 2 → bits; natural log → nats. Key properties:

Worked example (by hand, in bits)

Let p=(12,14,14)p=(\tfrac12,\tfrac14,\tfrac14) and q=(14,14,12)q=(\tfrac14,\tfrac14,\tfrac12).

H(p)=(12log212+14log214+14log214)=12(1)+14(2)+14(2)=1.5 bitsH(p) = -\big(\tfrac12\log_2\tfrac12 + \tfrac14\log_2\tfrac14 + \tfrac14\log_2\tfrac14\big) = \tfrac12(1) + \tfrac14(2) + \tfrac14(2) = 1.5\ \text{bits}
H(p,q)=(12log214+14log214+14log212)=12(2)+14(2)+14(1)=1.75 bitsH(p,q) = -\big(\tfrac12\log_2\tfrac14 + \tfrac14\log_2\tfrac14 + \tfrac14\log_2\tfrac12\big) = \tfrac12(2)+\tfrac14(2)+\tfrac14(1) = 1.75\ \text{bits}
DKL(pq)=H(p,q)H(p)=1.751.5=0.25 bitsD_{KL}(p\|q) = H(p,q)-H(p) = 1.75 - 1.5 = 0.25\ \text{bits}

Check directly: $\tfrac12\log_2\tfrac{0.5}{0.25} + \tfrac14\log_2\tfrac{0.25}{0.25} + \tfrac14\log_2\tfrac{0.25}{0.5} = \tfrac12(1) + \tfrac14(0) + \tfrac14(-1) = 0.25$ ✓

Why this is everywhere in ML

By-hand exercise (meets the bar)

  1. With the p,qp,q above, this particular pair happens to give DKL(qp)=0.25D_{KL}(q\|p)=0.25 too. Now take q=(0.7,0.2,0.1)q'=(0.7,0.2,0.1) and show DKL(pq)DKL(qp)D_{KL}(p\|q')\ne D_{KL}(q'\|p) — proving asymmetry.
  2. Compute the perplexity of pp. (Answer: 21.52.832^{1.5}\approx 2.83.)