Probability for sequences
evergreen#foundations#probability
Up: 10_Foundations-MOC
Bar: factor a joint with the chain rule, apply Bayes to a base-rate problem, and state the difference
between MLE and MAP — all by hand.
The core objects
- Joint P(x1,…,xn) — probability of a whole sequence.
- Marginal P(x1)=∑x2,…P(x1,…) — sum out the rest.
- Conditional P(xt∣x<t)=P(x<t)P(x≤t).
- Chain rule (always true, no assumptions):
P(x1,…,xn)=t=1∏nP(xt∣x1,…,xt−1)
A Markov model just truncates the conditioning to the last k events → Markov chains and HMMs.
A neural sequence model keeps all of it. Same chain rule, different conditioning budget.
Independence vs conditional independence
- Independent: P(x,y)=P(x)P(y).
- Conditionally independent given z: P(x,y∣z)=P(x∣z)P(y∣z) — the workhorse assumption
(naïve Bayes, HMMs). Two events can be dependent marginally but independent given a confounder, and vice
versa (links to Causal inference primer).
Bayes — and the base-rate trap (worked)
P(H∣E)=P(E)P(E∣H)P(H),P(E)=h∑P(E∣h)P(h)Bot detection. Prior P(bot)=0.01. A bot fires "fast" requests with P(fast∣bot)=0.9;
humans rarely, P(fast∣human)=0.05. Saw "fast" — is it a bot?
P(bot∣fast)=0.9⋅0.01+0.05⋅0.990.9⋅0.01=0.009+0.04950.009=0.05850.009≈0.154
Even a "90% accurate" signal is wrong 85% of the time here, because bots are rare. Base rates dominate
— the same arithmetic governs rare-conversion and fraud problems (Evaluation theory, Bots, fraud, and invalid traffic).
MLE vs MAP (worked)
Estimate a 3-page transition distribution from counts c=(nA,nB,nC), total N.
- MLE (maximize likelihood): p^A=nA/N. Zero count → zero probability (brittle for rare events).
- MAP with a symmetric Dirichlet(α) prior (a.k.a. add-α / Laplace smoothing):
p^AMAP=N+K(α−1)nA+α−1,posterior mean=N+KαnA+α
With counts (8,1,0), N=9, K=3, α=1 (Laplace): smoothed estimates
129,122,121 — the never-seen page C no longer has probability 0.
MLE = MAP with a flat prior and infinite data; the prior is your regularizer when data is thin.
By-hand exercise (meets the bar)
- Redo the bot example with prior 0.1 instead of 0.01. (Answer: ≈0.667 — base rate is the lever.)
- Factor P(x1,x2,x3) under a first-order Markov assumption and count how many parameters a full
joint needs vs the Markov version for a 5-symbol alphabet, length-3 sequences.
Links