Probability for sequences

evergreen#foundations#probability

Bar: factor a joint with the chain rule, apply Bayes to a base-rate problem, and state the difference between MLE and MAP — all by hand.

The core objects

Joint $P(x_1,\dots,x_n)$ — probability of a whole sequence.
Marginal $P(x_1)=\sum_{x_2,\dots}P(x_1,\dots)$ — sum out the rest.
Conditional $P(x_t\mid x_{<t})=\dfrac{P(x_{\le t})}{P(x_{<t})}$ .
Chain rule (always true, no assumptions):
$P(x_1,\dots,x_n)=\prod_{t=1}^{n}P(x_t\mid x_1,\dots,x_{t-1})$
A Markov model just truncates the conditioning to the last $k$ events → Markov chains and HMMs. A neural sequence model keeps all of it. Same chain rule, different conditioning budget.

Independence vs conditional independence

Independent: $P(x,y)=P(x)P(y)$ .
Conditionally independent given $z$ : $P(x,y\mid z)=P(x\mid z)P(y\mid z)$ — the workhorse assumption (naïve Bayes, HMMs). Two events can be dependent marginally but independent given a confounder, and vice versa (links to Causal inference primer).

Bayes — and the base-rate trap (worked)

P(H\mid E)=\frac{P(E\mid H)\,P(H)}{P(E)},\quad P(E)=\sum_h P(E\mid h)P(h)

Bot detection. Prior $P(\text{bot})=0.01$ . A bot fires "fast" requests with $P(\text{fast}\mid\text{bot})=0.9$ ; humans rarely, $P(\text{fast}\mid\text{human})=0.05$ . Saw "fast" — is it a bot?

P(\text{bot}\mid\text{fast})=\frac{0.9\cdot0.01}{0.9\cdot0.01+0.05\cdot0.99} =\frac{0.009}{0.009+0.0495}=\frac{0.009}{0.0585}\approx 0.154

Even a "90% accurate" signal is wrong 85% of the time here, because bots are rare. Base rates dominate — the same arithmetic governs rare-conversion and fraud problems (Evaluation theory, Bots, fraud, and invalid traffic).

MLE vs MAP (worked)

Estimate a 3-page transition distribution from counts $\mathbf{c}=(n_A,n_B,n_C)$ , total $N$ .

MLE (maximize likelihood): $\hat p_A = n_A/N$ . Zero count → zero probability (brittle for rare events).
MAP with a symmetric Dirichlet $(\alpha)$ prior (a.k.a. add- $\alpha$ / Laplace smoothing):
$\hat p_A^{\text{MAP}}=\frac{n_A+\alpha-1}{N+K(\alpha-1)},\qquad \text{posterior mean}=\frac{n_A+\alpha}{N+K\alpha}$
With counts $(8,1,0)$ , $N=9$ , $K=3$ , $\alpha=1$ (Laplace): smoothed estimates $\tfrac{9}{12},\tfrac{2}{12},\tfrac{1}{12}$ — the never-seen page $C$ no longer has probability 0. MLE = MAP with a flat prior and infinite data; the prior is your regularizer when data is thin.

By-hand exercise (meets the bar)

Redo the bot example with prior $0.1$ instead of $0.01$ . (Answer: $\approx 0.667$ — base rate is the lever.)
Factor $P(x_1,x_2,x_3)$ under a first-order Markov assumption and count how many parameters a full joint needs vs the Markov version for a 5-symbol alphabet, length-3 sequences.