Information theory basics
evergreen#foundations#information-theory
Up: 10_Foundations-MOC
Bar: compute entropy and mutual information for a 2×2 joint by hand, and say what perplexity means in
one sentence.
Entropy — average surprise
H(X)=−x∑p(x)logp(x)Maximal for a uniform distribution (logk for k outcomes), zero for a certain one. Units: bits (log2)
or nats (ln). (Cross-entropy and KL build on this — see Cross-entropy and KL divergence.)
Joint, conditional, and the chain rule
- Joint H(X,Y)=−∑p(x,y)logp(x,y).
- Conditional H(Y∣X)=H(X,Y)−H(X) = surprise left in Y once you know X.
- Chain rule H(X,Y)=H(X)+H(Y∣X).
I(X;Y)=H(X)−H(X∣Y)=H(X)+H(Y)−H(X,Y)=DKL(p(x,y)∥p(x)p(y))I≥0; it's 0 iff X⊥Y. It measures how much knowing one variable reduces uncertainty about the
other — a nonlinear dependence measure (unlike correlation, it catches any relationship). Useful for
feature selection: which event/feature carries information about conversion?
Worked (by hand)
Perfect dependence: joint [0.5000.5]. Marginals are (0.5,0.5) both,
so H(X)=H(Y)=1 bit. H(X,Y)=−2(0.5log20.5)=1 bit. Then
I(X;Y)=H(X)+H(Y)−H(X,Y)=1+1−1=1 bit.
Knowing X tells you Y completely — exactly 1 bit shared.
Independence: joint [0.250.250.250.25] → H(X,Y)=2, H(X)=H(Y)=1,
so I=1+1−2=0. As expected.
Perplexity — effective branching factor
PP=2H(or eH in nats)"How many equally-likely options is the model effectively choosing among?" Uniform over k → perplexity k.
A next-event model with perplexity 8 is as confused as if guessing uniformly among 8 events. It's the
standard report for sequence models and ties directly to cross-entropy loss.
Why this matters here
- Feature/event selection: rank candidate features by I(feature;target) — but beware,
high MI with the target can also signal leakage (Leakage checklist).
- Drift detection: KL/MI between time windows flags distribution shift (Concept drift in production).
- Model reporting: perplexity and cross-entropy are the honest scoreboard for Prediction.
By-hand exercise (meets the bar)
- Compute I(X;Y) for joint [0.40.10.10.4].
(Marginals (0.5,0.5); H(X,Y)=−2(0.4log20.4)−2(0.1log20.1)≈1.722; I≈0.278 bits.)
- What is the perplexity of a fair 6-sided die? (Answer: 2log26=6.)
Links