🧭 Research Vault — HOME (Map of Content)

Purpose: A living knowledge base to take me from software engineer to bonafide researcher in event streams, web analytics, and clickstream prediction — within one year, in a for-profit context.

How to use this: This is the top-level index (a "Map of Content"). Each [[bracketed link]] is a note to create in Obsidian. Don't fill it all in up front — grow it as you go. The structure matters more than the completeness. Plain Markdown, version-controlled with git.

The one-sentence thesis

An engineer asks "does it work?" A researcher asks "is it true, and why?" This entire vault exists to retrain that reflex.

Vault architecture

A folder-per-domain layout. Numeric prefixes keep them ordered. One idea per note; link liberally; let the graph emerge.

/Research-Vault
├── 00_Home/                  ← this file + sub-MOCs
├── 10_Foundations/           ← the math & fundamentals to actually learn
├── 20_Domain/                ← clickstream / event-stream subject knowledge
├── 30_Methods/               ← the 3 research areas + synthetic data craft
├── 40_Experiments/           ← lab notebook: one note per experiment
├── 50_Literature/            ← Zotero-linked paper notes
├── 60_Craft/                 ← how to be a researcher (temperament, traps)
├── 70_Tracks/                ← research tracks / project portfolio
├── 80_Templates/             ← note templates (experiment, paper, track)
└── 90_Inbox/                 ← fleeting notes, capture-first, sort later

Top-level maps to build next: 10_Foundations-MOC · 20_Domain-MOC · 30_Methods-MOC · 60_Craft-MOC · 70_Tracks-MOC

10 · Foundations — go deep on basics, not 2026 fads

The fundamentals haven't changed in decades; the frameworks will. Learn the thing under the thing. For each, the bar is "can I derive/visualize it from memory and compute a tiny example by hand?" — not "have I read about it."

Cross-entropy and KL divergence — compute by hand for a 3-outcome distribution
SVD and low-rank structure — visualize it; relate to PCA, embeddings, matrix factorization
Probability for sequences — joint/conditional/marginal, Bayes, MLE vs MAP
Markov chains and HMMs — transition matrices, stationary dist, the forward/Viterbi algorithms
Point processes — Poisson (homogeneous + inhomogeneous), Hawkes self-excitation
Heavy-tailed distributions — log-normal, Pareto, Zipf; why real clickstreams have fat tails
Information theory basics — entropy, mutual information, perplexity
Evaluation theory — precision/recall/AUC/log-loss, calibration, proper scoring rules
Causal inference primer — confounding, potential outcomes, why prediction ≠ causation
Optimization & gradients — SGD, the ideas behind policy gradients (not the framework du jour)

Anti-fad rule: be wary of any topic that's been hot for < 6 months. Learn the 40-year-old idea underneath it first.

20 · Domain — event streams, clickstream, web analytics

The subject-matter substrate. This is where my engineering background is an asset.

What is a clickstream — events, sessions, users, identity stitching
Sessionization — windowing strategies, timeout choices, the boundary problem
Event schema and drift — schema evolution, new event types, reprocessing
Funnels and journeys — conversion paths, drop-off, multi-touch
Attribution models — last-touch, position-based, data-driven; their failure modes
Concept drift in production — seasonality, campaigns, bot waves; detection & retraining
Online vs offline gap — why offline lift so often dies in the A/B test ⚠️ (a core trap)
Bots, fraud, and invalid traffic — the anomaly side of the domain
Privacy & consent constraints — GDPR, cookieless, what data I can actually use

Query & compute substrate

DuckDB for local event analytics — Parquet, sessionize, funnels on a laptop
ClickHouse — when local stops scaling (built for this exact workload)
Polars over pandas — event tables get big
Batch vs streaming decision — Kafka/Flink only if in-flight latency genuinely matters

30 · Methods — the 3 research areas + how to build ground truth

The core of the discipline. Each area maps to a different known pattern you plant in synthetic data — which is how you measure your algorithm honestly.

The three areas

Pattern & structure discovery (unsupervised) — sequential pattern mining, motif discovery, journey clustering, sessionization
Prediction (supervised) — next-event, conversion/churn/propensity, time-to-event / survival
Anomaly & change detection — bots, fraud, concept drift, change points
Attribution / uplift (the 4th, causal) — changing outcomes, not just predicting them

Synthetic data: plant ground truth, measure recovery

The reason to build synthetic data is that you know the answer key. Generate realistic background, inject a known signal, score how well the algorithm recovers it.

Generating background traffic
- Markov / HMM page-transition models (the matrix is the ground truth)
- Point processes for timing (Hawkes for burstiness)
- Heavy-tailed dwell times + Zipf page popularity
Planting signals
- Discovery: inject a known subsequence or cohort → score precision/recall, ARI/NMI
- Prediction: one feature with controllable effect size + decoy noise features → test it finds signal, ignores decoys, stays calibrated
- Anomaly/drift: inject labeled bots, or shift the transition matrix at a known timestamp → measure detection delay
Difficulty knobs — signal-to-noise, pattern rarity, class imbalance (1–3% conversions!), corruption (dropped/dup/jittered events). Sweep them, plot where the method breaks.
Synthetic-data toolkit — numpy/scipy, tick (Hawkes), SimPy (agent simulation)

First project that exercises all three: a Markov+Hawkes generator with one planted motif, one predictive feature, one mid-stream drift — then see which algorithm family catches its target. → see 70_Tracks-MOC

40 · Experiments — the lab notebook

One note per experiment, created before you run it (hypothesis first). This is the single highest-leverage habit for becoming a researcher. Template: 80_Experiment-Template.

Every note links to its run ID (MLflow / W&B URL) and data version (DVC hash)
Log many metrics; if any look wrong, you cannot move on until you know why
Negative results get the same write-up as positive ones — often they teach more

Index: Experiment-Log-MOC (running list, newest first)

50 · Literature — paper notes, Zotero-linked

How I read a paper — three-pass method; what to extract
One note per paper, linked to its Zotero entry + BibTeX key. Template: 80_Literature-Template
Caveat to "don't read too much": as a newcomer to this domain, read more than the gurus advise — there's decades of prior art (sessionization, sequential recommendation, survival models). Don't reinvent solved wheels. Reach for novelty only once you've mapped what's known.
Reading lists: Clickstream foundational papers · Sequence modeling papers · Anomaly & drift papers · Uplift modeling papers

60 · Craft — temperament > talent

The meta-skill. The part no one teaches directly. This is the section most likely to make or break the transition.

Mindset shifts (engineer → researcher)

It runs ≠ it's true — a green pipeline means the code executed, not that the result is valid ⚠️
Healthy paranoia — distrust your good results most; too-good-to-be-true almost always is (usually a bug or leakage)
Experimental equanimity — went well? great. went poorly? also great. Both are information.
Beginner's mind — my engineering priors are sometimes the trap; hold ideas loosely
Do things other than research — walks, distance from the keyboard; aha moments don't come at the desk

Workflow discipline

Fast iteration beats parallelism — short cold-starts, tiny fast evals; context-switching is the enemy
Beware coding-agent dragons — agents silently shorten prompts/seq-lengths/configs; small eng errors are grave science errors. Understand the system that produced your result.
Don't over-engineer the research infra — my senior-engineer reflex; build the minimum to iterate + reproduce, not a cathedral ⚠️
Gruntwork is the job — hundreds of hours of unglamorous labeling/filtering behind every good result

⚠️ The traps checklists (run these before trusting any number)

Leakage checklist — temporal leakage, target leakage, split by user & time (never by row), no session-boundary peeking
Offline→online checklist — does the lift survive an A/B test? is the offline metric a faithful proxy for business value?
Predict vs cause checklist — does the business want to predict or change the outcome? propensity vs uplift model. (Goodhart: a propensity model just finds people who'd convert anyway.)
For-profit tension checklist — am I doing real research or incremental engineering relabeled? am I going so open-ended I'll ship nothing?

70 · Tracks — the research portfolio

Run a portfolio, not one bet: a couple of short-horizon wins that de-risk delivery + one deeper line. Convert every negative result into a business decision ("windows past 30 min don't help → stop investing there"). Template: 80_Track-Template.

Track — Synthetic generator + 3-algorithm bake-off (the starter project)
Track — (short bet, TBD)
Track — (deep bet, TBD)
Portfolio-MOC — status board (Backlog / Active / Parked / Shipped)

80 · Templates

The 12-month transition roadmap

Phased, deliberately front-loading the reflexes over the results.

Q1 — Foundations & the lab habit (months 1–3)

Set up the vault, Zotero, MLflow/DVC, DuckDB. Keep it minimal. (Don't over-engineer the research infra)
Work through 10_Foundations-MOC: cross-entropy, SVD, Markov/HMM, point processes — by hand.
Deliverable: the Markov+Hawkes synthetic generator with one planted motif. First experiment note written before the run.
Transition marker: I can plant a known pattern and measure recovery with honest precision/recall.

Q2 — First real track + the skepticism reflex (months 4–6)

Pick one short business-relevant track (e.g. a churn or next-event model on real event logs).
Run the Leakage checklist and Offline→online checklist on everything. Break your own results on purpose.
Deliverable: one shippable-or-killed result, with a written negative-or-positive conclusion either way.
Transition marker: I caught at least one leakage/measurement bug that would have fooled past-me.

Q3 — Depth + the causal turn (months 7–9)

Go deep on one method area; add the Attribution / uplift dimension.
Add a longer deep-bet track to the portfolio. Read the harder literature now (not just when stuck).
Deliverable: an uplift vs propensity comparison on a real decision; a result that isn't just a benchmark number.
Transition marker: I framed a problem the business didn't know it had (predict vs cause).

Q4 — Communicate & compound (months 10–12)

Write up two tracks as internal research notes/talks. Build the reproducibility story (data versions → result).
Reflect: which results held up? which were drift/bugs? Update the 60_Craft-MOC from lived experience.
Deliverable: a portfolio of documented tracks + a written "what I learned about being wrong."
Transition marker: colleagues come to me for how to test whether something is real, not just how to build it.

How I'll know it worked (transition scorecard)

Track quarterly. The goal isn't output volume — it's the change in reflex.

Signal	Engineer-me	Researcher-me
First reaction to a great result	ship it	what's the bug?
Train/test split	by row	by user & time
Definition of "done"	tests pass	I understand why the number is what it is
A negative result	failure	information (often more than a positive)
The business goal	predict the outcome	know whether I should predict or cause it
My infra	maximal, elegant	minimal, fast to iterate, reproducible

Maintenance notes

Capture first, sort later. Anything goes into 90_Inbox/ instantly; triage weekly.
Link, don't file. A note's value is its connections. Prefer [[links]] over deep folders.
One quote per source, paraphrase the rest in literature notes — build the muscle of saying it in your own words.
Commit the vault to git. Your research history is an audit trail.

Created as the seed of a living knowledge base. Delete nothing; grow everything.