Information theory basics

evergreen#foundations#information-theory

Up: 10_Foundations-MOC

Bar: compute entropy and mutual information for a 2×2 joint by hand, and say what perplexity means in one sentence.

Entropy — average surprise

H(X)=xp(x)logp(x)H(X)=-\sum_x p(x)\log p(x)

Maximal for a uniform distribution (logk\log k for kk outcomes), zero for a certain one. Units: bits (log2\log_2) or nats (ln\ln). (Cross-entropy and KL build on this — see Cross-entropy and KL divergence.)

Joint, conditional, and the chain rule

Mutual information — shared information

I(X;Y)=H(X)H(XY)=H(X)+H(Y)H(X,Y)=DKL(p(x,y)p(x)p(y))I(X;Y)=H(X)-H(X\mid Y)=H(X)+H(Y)-H(X,Y)=D_{KL}\big(p(x,y)\,\|\,p(x)p(y)\big)

I0I\ge0; it's 00 iff XYX\perp Y. It measures how much knowing one variable reduces uncertainty about the other — a nonlinear dependence measure (unlike correlation, it catches any relationship). Useful for feature selection: which event/feature carries information about conversion?

Worked (by hand)

Perfect dependence: joint [0.5000.5]\begin{bmatrix}0.5&0\\0&0.5\end{bmatrix}. Marginals are (0.5,0.5)(0.5,0.5) both, so H(X)=H(Y)=1H(X)=H(Y)=1 bit. H(X,Y)=2(0.5log20.5)=1H(X,Y)=-2(0.5\log_2 0.5)=1 bit. Then

I(X;Y)=H(X)+H(Y)H(X,Y)=1+11=1 bit.I(X;Y)=H(X)+H(Y)-H(X,Y)=1+1-1=1\ \text{bit}.
Knowing XX tells you YY completely — exactly 1 bit shared.

Independence: joint [0.250.250.250.25]\begin{bmatrix}0.25&0.25\\0.25&0.25\end{bmatrix}H(X,Y)=2H(X,Y)=2, H(X)=H(Y)=1H(X)=H(Y)=1, so I=1+12=0I=1+1-2=0. As expected.

Perplexity — effective branching factor

PP=2H(or eH in nats)\text{PP}=2^{H}\quad(\text{or } e^{H}\text{ in nats})

"How many equally-likely options is the model effectively choosing among?" Uniform over kk → perplexity kk. A next-event model with perplexity 8 is as confused as if guessing uniformly among 8 events. It's the standard report for sequence models and ties directly to cross-entropy loss.

Why this matters here

By-hand exercise (meets the bar)

  1. Compute I(X;Y)I(X;Y) for joint [0.40.10.10.4]\begin{bmatrix}0.4&0.1\\0.1&0.4\end{bmatrix}. (Marginals (0.5,0.5)(0.5,0.5); H(X,Y)=2(0.4log20.4)2(0.1log20.1)1.722H(X,Y)=-2(0.4\log_2 0.4)-2(0.1\log_2 0.1)\approx1.722; I0.278I\approx0.278 bits.)
  2. What is the perplexity of a fair 6-sided die? (Answer: 2log26=62^{\log_2 6}=6.)