30 · Methods — MOC

seed#moc#methods

Up: plan

The core of the discipline: the 3 research areas + how to build ground truth. Each area maps to a different known pattern you plant in synthetic data — which is how you measure your algorithm honestly.

The three (+1) areas

Pattern & structure discovery (unsupervised) — sequential pattern mining, motif discovery, journey clustering, sessionization
Prediction (supervised) — next-event, conversion/churn/propensity, time-to-event / survival
Anomaly & change detection — bots, fraud, concept drift, change points
Attribution / uplift (the 4th, causal) — changing outcomes, not just predicting them

Synthetic data: plant ground truth, measure recovery

The reason to build synthetic data is that you know the answer key. Generate realistic background, inject a known signal, score how well the algorithm recovers it.

Generating background traffic — Markov/HMM transitions (the matrix is the truth), point processes for timing (Hawkes for burstiness), heavy-tailed dwell times + Zipf popularity
Planting signals — inject a known subsequence/cohort (discovery), a controllable-effect feature + decoys (prediction), or a labeled bot / mid-stream matrix shift (anomaly/drift)
Difficulty knobs — signal-to-noise, pattern rarity, class imbalance (1–3% conversions!), corruption. Sweep them; plot where the method breaks.
Synthetic-data toolkit — numpy/scipy, tick (Hawkes), SimPy (agent simulation)

First project that exercises all three: a Markov+Hawkes generator with one planted motif, one predictive feature, one mid-stream drift — then see which algorithm family catches its target. → Track — Synthetic generator + 3-algorithm bake-off

Foundations these lean on

Markov chains and HMMs · Point processes · Heavy-tailed distributions · Evaluation theory · Causal inference primer

The three (+1) areas

Synthetic data: plant ground truth, measure recovery

Foundations these lean on

linked from