Markov chains and HMMs

evergreen#foundations#sequences

Bar: write a transition matrix, solve for its stationary distribution by hand, and explain the forward and Viterbi recursions in one line each.

Markov chains

Markov property: the future depends on the present only — $P(x_{t+1}\mid x_t,x_{t-1},\dots)=P(x_{t+1}\mid x_t)$ . Encode it as a transition matrix $P$ where $P_{ij}=P(\text{next}=j\mid\text{now}=i)$ ; rows sum to 1.

$n$ -step transitions: $P^{(n)}=P^n$ (matrix power).
Distribution after one step: $\boldsymbol\pi_{t+1}=\boldsymbol\pi_t P$ (row vector convention).
Stationary distribution $\boldsymbol\pi$ solves $\boldsymbol\pi = \boldsymbol\pi P$ with $\sum_i\pi_i=1$ . If the chain is irreducible + aperiodic, $\boldsymbol\pi$ is unique and $\boldsymbol\pi_t\to\boldsymbol\pi$ from any start.

Worked: stationary distribution (by hand)

P=\begin{bmatrix}0.9&0.1\\0.5&0.5\end{bmatrix}

Solve $\pi_1 = 0.9\pi_1 + 0.5\pi_2$ with $\pi_2=1-\pi_1$ :

\pi_1 = 0.9\pi_1 + 0.5(1-\pi_1)\ \Rightarrow\ 0.6\pi_1 = 0.5\ \Rightarrow\ \pi_1=\tfrac56\approx0.833,\ \pi_2=\tfrac16.

Long-run, the system spends 5/6 of its time in state 1. The matrix is the ground truth you plant when generating synthetic page-transition data (Generating background traffic).

Hidden Markov Models

Now the state is hidden; you see only emissions. Parameters $\lambda=(A,B,\boldsymbol\pi)$ :

$A$ — state transition matrix (Markov over hidden states),
$B$ — emission probabilities $b_j(o)=P(\text{observe } o\mid\text{state } j)$ ,
$\boldsymbol\pi$ — initial state distribution.

Three canonical problems:

Problem	Question	Algorithm
Likelihood	$P(O\mid\lambda)$ of an observation sequence?	Forward (sum)
Decoding	most likely hidden path?	Viterbi (max)
Learning	best $\lambda$ from data?	Baum–Welch (EM)

Forward vs Viterbi (the only difference is sum vs max)

\text{Forward: } \alpha_t(j)=\Big[\sum_i \alpha_{t-1}(i)\,A_{ij}\Big]\,b_j(o_t)

\text{Viterbi: } \ \delta_t(j)=\Big[\max_i \delta_{t-1}(i)\,A_{ij}\Big]\,b_j(o_t)\ \ (+\text{ backpointers})

Forward sums over all paths to get total probability; Viterbi maximizes to recover the single best path (then traces backpointers). Both are $O(T\cdot S^2)$ dynamic programming — linear in sequence length, the reason HMMs are practical.

Worked: one forward step

Hidden $\{R,S\}$ , $\boldsymbol\pi=(0.6,0.4)$ , emission of "walk": $b_R(\text{walk})=0.1,\ b_S(\text{walk})=0.6$ .

\alpha_1(R)=0.6\cdot0.1=0.06,\qquad \alpha_1(S)=0.4\cdot0.6=0.24.

P(o_1=\text{walk})=0.06+0.24=0.30

. Extend with the recursion for

\alpha_2

, etc.

Why this matters here

The transition matrix is the answer key for synthetic discovery experiments (Planting signals).
Sessionization, next-page models, and journey segmentation are Markov/HMM-shaped (Sessionization, Pattern & structure discovery).
Shifting $A$ at a known timestamp = a planted concept drift to measure detection delay (Anomaly & change detection).

By-hand exercise (meets the bar)

Find the stationary distribution of $P=\begin{bmatrix}0.7&0.3\\0.4&0.6\end{bmatrix}$ . (Answer: $\pi=(4/7,3/7)$ .)
Add a second emission and compute $\alpha_2$ for the HMM above for the sequence (walk, shop).