30 · Methods — MOC
seed#moc#methods
Up: plan
The core of the discipline: the 3 research areas + how to build ground truth. Each area maps to a different known pattern you plant in synthetic data — which is how you measure your algorithm honestly.
The three (+1) areas
- Pattern & structure discovery (unsupervised) — sequential pattern mining, motif discovery, journey clustering, sessionization
- Prediction (supervised) — next-event, conversion/churn/propensity, time-to-event / survival
- Anomaly & change detection — bots, fraud, concept drift, change points
- Attribution / uplift (the 4th, causal) — changing outcomes, not just predicting them
Synthetic data: plant ground truth, measure recovery
The reason to build synthetic data is that you know the answer key. Generate realistic background, inject a known signal, score how well the algorithm recovers it.
- Generating background traffic — Markov/HMM transitions (the matrix is the truth), point processes for timing (Hawkes for burstiness), heavy-tailed dwell times + Zipf popularity
- Planting signals — inject a known subsequence/cohort (discovery), a controllable-effect feature + decoys (prediction), or a labeled bot / mid-stream matrix shift (anomaly/drift)
- Difficulty knobs — signal-to-noise, pattern rarity, class imbalance (1–3% conversions!), corruption. Sweep them; plot where the method breaks.
- Synthetic-data toolkit — numpy/scipy,
tick(Hawkes), SimPy (agent simulation)
First project that exercises all three: a Markov+Hawkes generator with one planted motif, one predictive feature, one mid-stream drift — then see which algorithm family catches its target. → Track — Synthetic generator + 3-algorithm bake-off
Foundations these lean on
Markov chains and HMMs · Point processes · Heavy-tailed distributions · Evaluation theory · Causal inference primer