Information theory basics

evergreen#foundations#information-theory

Bar: compute entropy and mutual information for a 2×2 joint by hand, and say what perplexity means in one sentence.

Entropy — average surprise

H(X)=-\sum_x p(x)\log p(x)

Maximal for a uniform distribution ( $\log k$ for $k$ outcomes), zero for a certain one. Units: bits ( $\log_2$ ) or nats ( $\ln$ ). (Cross-entropy and KL build on this — see Cross-entropy and KL divergence.)

Joint, conditional, and the chain rule

Joint $H(X,Y)=-\sum p(x,y)\log p(x,y)$ .
Conditional $H(Y\mid X)=H(X,Y)-H(X)$ = surprise left in $Y$ once you know $X$ .
Chain rule $H(X,Y)=H(X)+H(Y\mid X)$ .

Mutual information — shared information

I(X;Y)=H(X)-H(X\mid Y)=H(X)+H(Y)-H(X,Y)=D_{KL}\big(p(x,y)\,\|\,p(x)p(y)\big)

$I\ge0$ ; it's $0$ iff $X\perp Y$ . It measures how much knowing one variable reduces uncertainty about the other — a nonlinear dependence measure (unlike correlation, it catches any relationship). Useful for feature selection: which event/feature carries information about conversion?

Worked (by hand)

Perfect dependence: joint $\begin{bmatrix}0.5&0\\0&0.5\end{bmatrix}$ . Marginals are $(0.5,0.5)$ both, so $H(X)=H(Y)=1$ bit. $H(X,Y)=-2(0.5\log_2 0.5)=1$ bit. Then

I(X;Y)=H(X)+H(Y)-H(X,Y)=1+1-1=1\ \text{bit}.

Knowing

X

tells you

Y

completely — exactly 1 bit shared.

Independence: joint $\begin{bmatrix}0.25&0.25\\0.25&0.25\end{bmatrix}$ → $H(X,Y)=2$ , $H(X)=H(Y)=1$ , so $I=1+1-2=0$ . As expected.

Perplexity — effective branching factor

\text{PP}=2^{H}\quad(\text{or } e^{H}\text{ in nats})

"How many equally-likely options is the model effectively choosing among?" Uniform over $k$ → perplexity $k$ . A next-event model with perplexity 8 is as confused as if guessing uniformly among 8 events. It's the standard report for sequence models and ties directly to cross-entropy loss.

Why this matters here

Feature/event selection: rank candidate features by $I(\text{feature};\text{target})$ — but beware, high MI with the target can also signal leakage (Leakage checklist).
Drift detection: KL/MI between time windows flags distribution shift (Concept drift in production).
Model reporting: perplexity and cross-entropy are the honest scoreboard for Prediction.

By-hand exercise (meets the bar)

Compute $I(X;Y)$ for joint $\begin{bmatrix}0.4&0.1\\0.1&0.4\end{bmatrix}$ . (Marginals $(0.5,0.5)$ ; $H(X,Y)=-2(0.4\log_2 0.4)-2(0.1\log_2 0.1)\approx1.722$ ; $I\approx0.278$ bits.)
What is the perplexity of a fair 6-sided die? (Answer: $2^{\log_2 6}=6$ .)