π§ Research Vault β HOME (Map of Content)
Purpose: A living knowledge base to take me from software engineer to bonafide researcher in event streams, web analytics, and clickstream prediction β within one year, in a for-profit context.
How to use this: This is the top-level index (a "Map of Content"). Each
[[bracketed link]]is a note to create in Obsidian. Don't fill it all in up front β grow it as you go. The structure matters more than the completeness. Plain Markdown, version-controlled with git.
The one-sentence thesis
An engineer asks "does it work?" A researcher asks "is it true, and why?" This entire vault exists to retrain that reflex.
Vault architecture
A folder-per-domain layout. Numeric prefixes keep them ordered. One idea per note; link liberally; let the graph emerge.
/Research-Vault
βββ 00_Home/ β this file + sub-MOCs
βββ 10_Foundations/ β the math & fundamentals to actually learn
βββ 20_Domain/ β clickstream / event-stream subject knowledge
βββ 30_Methods/ β the 3 research areas + synthetic data craft
βββ 40_Experiments/ β lab notebook: one note per experiment
βββ 50_Literature/ β Zotero-linked paper notes
βββ 60_Craft/ β how to be a researcher (temperament, traps)
βββ 70_Tracks/ β research tracks / project portfolio
βββ 80_Templates/ β note templates (experiment, paper, track)
βββ 90_Inbox/ β fleeting notes, capture-first, sort later
Top-level maps to build next: 10_Foundations-MOC Β· 20_Domain-MOC Β· 30_Methods-MOC Β· 60_Craft-MOC Β· 70_Tracks-MOC
10 Β· Foundations β go deep on basics, not 2026 fads
The fundamentals haven't changed in decades; the frameworks will. Learn the thing under the thing. For each, the bar is "can I derive/visualize it from memory and compute a tiny example by hand?" β not "have I read about it."
- Cross-entropy and KL divergence β compute by hand for a 3-outcome distribution
- SVD and low-rank structure β visualize it; relate to PCA, embeddings, matrix factorization
- Probability for sequences β joint/conditional/marginal, Bayes, MLE vs MAP
- Markov chains and HMMs β transition matrices, stationary dist, the forward/Viterbi algorithms
- Point processes β Poisson (homogeneous + inhomogeneous), Hawkes self-excitation
- Heavy-tailed distributions β log-normal, Pareto, Zipf; why real clickstreams have fat tails
- Information theory basics β entropy, mutual information, perplexity
- Evaluation theory β precision/recall/AUC/log-loss, calibration, proper scoring rules
- Causal inference primer β confounding, potential outcomes, why prediction β causation
- Optimization & gradients β SGD, the ideas behind policy gradients (not the framework du jour)
Anti-fad rule: be wary of any topic that's been hot for < 6 months. Learn the 40-year-old idea underneath it first.
20 Β· Domain β event streams, clickstream, web analytics
The subject-matter substrate. This is where my engineering background is an asset.
- What is a clickstream β events, sessions, users, identity stitching
- Sessionization β windowing strategies, timeout choices, the boundary problem
- Event schema and drift β schema evolution, new event types, reprocessing
- Funnels and journeys β conversion paths, drop-off, multi-touch
- Attribution models β last-touch, position-based, data-driven; their failure modes
- Concept drift in production β seasonality, campaigns, bot waves; detection & retraining
- Online vs offline gap β why offline lift so often dies in the A/B test β οΈ (a core trap)
- Bots, fraud, and invalid traffic β the anomaly side of the domain
- Privacy & consent constraints β GDPR, cookieless, what data I can actually use
Query & compute substrate
- DuckDB for local event analytics β Parquet, sessionize, funnels on a laptop
- ClickHouse β when local stops scaling (built for this exact workload)
- Polars over pandas β event tables get big
- Batch vs streaming decision β Kafka/Flink only if in-flight latency genuinely matters
30 Β· Methods β the 3 research areas + how to build ground truth
The core of the discipline. Each area maps to a different known pattern you plant in synthetic data β which is how you measure your algorithm honestly.
The three areas
- Pattern & structure discovery (unsupervised) β sequential pattern mining, motif discovery, journey clustering, sessionization
- Prediction (supervised) β next-event, conversion/churn/propensity, time-to-event / survival
- Anomaly & change detection β bots, fraud, concept drift, change points
- Attribution / uplift (the 4th, causal) β changing outcomes, not just predicting them
Synthetic data: plant ground truth, measure recovery
The reason to build synthetic data is that you know the answer key. Generate realistic background, inject a known signal, score how well the algorithm recovers it.
- Generating background traffic
- Markov / HMM page-transition models (the matrix is the ground truth)
- Point processes for timing (Hawkes for burstiness)
- Heavy-tailed dwell times + Zipf page popularity
- Planting signals
- Discovery: inject a known subsequence or cohort β score precision/recall, ARI/NMI
- Prediction: one feature with controllable effect size + decoy noise features β test it finds signal, ignores decoys, stays calibrated
- Anomaly/drift: inject labeled bots, or shift the transition matrix at a known timestamp β measure detection delay
- Difficulty knobs β signal-to-noise, pattern rarity, class imbalance (1β3% conversions!), corruption (dropped/dup/jittered events). Sweep them, plot where the method breaks.
- Synthetic-data toolkit β numpy/scipy,
tick(Hawkes), SimPy (agent simulation)
First project that exercises all three: a Markov+Hawkes generator with one planted motif, one predictive feature, one mid-stream drift β then see which algorithm family catches its target. β see 70_Tracks-MOC
40 Β· Experiments β the lab notebook
One note per experiment, created before you run it (hypothesis first). This is the single highest-leverage habit for becoming a researcher. Template: 80_Experiment-Template.
- Every note links to its run ID (MLflow / W&B URL) and data version (DVC hash)
- Log many metrics; if any look wrong, you cannot move on until you know why
- Negative results get the same write-up as positive ones β often they teach more
Index: Experiment-Log-MOC (running list, newest first)
50 Β· Literature β paper notes, Zotero-linked
- How I read a paper β three-pass method; what to extract
- One note per paper, linked to its Zotero entry + BibTeX key. Template: 80_Literature-Template
- Caveat to "don't read too much": as a newcomer to this domain, read more than the gurus advise β there's decades of prior art (sessionization, sequential recommendation, survival models). Don't reinvent solved wheels. Reach for novelty only once you've mapped what's known.
- Reading lists: Clickstream foundational papers Β· Sequence modeling papers Β· Anomaly & drift papers Β· Uplift modeling papers
60 Β· Craft β temperament > talent
The meta-skill. The part no one teaches directly. This is the section most likely to make or break the transition.
Mindset shifts (engineer β researcher)
- It runs β it's true β a green pipeline means the code executed, not that the result is valid β οΈ
- Healthy paranoia β distrust your good results most; too-good-to-be-true almost always is (usually a bug or leakage)
- Experimental equanimity β went well? great. went poorly? also great. Both are information.
- Beginner's mind β my engineering priors are sometimes the trap; hold ideas loosely
- Do things other than research β walks, distance from the keyboard; aha moments don't come at the desk
Workflow discipline
- Fast iteration beats parallelism β short cold-starts, tiny fast evals; context-switching is the enemy
- Beware coding-agent dragons β agents silently shorten prompts/seq-lengths/configs; small eng errors are grave science errors. Understand the system that produced your result.
- Don't over-engineer the research infra β my senior-engineer reflex; build the minimum to iterate + reproduce, not a cathedral β οΈ
- Gruntwork is the job β hundreds of hours of unglamorous labeling/filtering behind every good result
β οΈ The traps checklists (run these before trusting any number)
- Leakage checklist β temporal leakage, target leakage, split by user & time (never by row), no session-boundary peeking
- Offlineβonline checklist β does the lift survive an A/B test? is the offline metric a faithful proxy for business value?
- Predict vs cause checklist β does the business want to predict or change the outcome? propensity vs uplift model. (Goodhart: a propensity model just finds people who'd convert anyway.)
- For-profit tension checklist β am I doing real research or incremental engineering relabeled? am I going so open-ended I'll ship nothing?
70 Β· Tracks β the research portfolio
Run a portfolio, not one bet: a couple of short-horizon wins that de-risk delivery + one deeper line. Convert every negative result into a business decision ("windows past 30 min don't help β stop investing there"). Template: 80_Track-Template.
- Track β Synthetic generator + 3-algorithm bake-off (the starter project)
- Track β (short bet, TBD)
- Track β (deep bet, TBD)
- Portfolio-MOC β status board (Backlog / Active / Parked / Shipped)
80 Β· Templates
The 12-month transition roadmap
Phased, deliberately front-loading the reflexes over the results.
Q1 β Foundations & the lab habit (months 1β3)
- Set up the vault, Zotero, MLflow/DVC, DuckDB. Keep it minimal. (Don't over-engineer the research infra)
- Work through 10_Foundations-MOC: cross-entropy, SVD, Markov/HMM, point processes β by hand.
- Deliverable: the Markov+Hawkes synthetic generator with one planted motif. First experiment note written before the run.
- Transition marker: I can plant a known pattern and measure recovery with honest precision/recall.
Q2 β First real track + the skepticism reflex (months 4β6)
- Pick one short business-relevant track (e.g. a churn or next-event model on real event logs).
- Run the Leakage checklist and Offlineβonline checklist on everything. Break your own results on purpose.
- Deliverable: one shippable-or-killed result, with a written negative-or-positive conclusion either way.
- Transition marker: I caught at least one leakage/measurement bug that would have fooled past-me.
Q3 β Depth + the causal turn (months 7β9)
- Go deep on one method area; add the Attribution / uplift dimension.
- Add a longer deep-bet track to the portfolio. Read the harder literature now (not just when stuck).
- Deliverable: an uplift vs propensity comparison on a real decision; a result that isn't just a benchmark number.
- Transition marker: I framed a problem the business didn't know it had (predict vs cause).
Q4 β Communicate & compound (months 10β12)
- Write up two tracks as internal research notes/talks. Build the reproducibility story (data versions β result).
- Reflect: which results held up? which were drift/bugs? Update the 60_Craft-MOC from lived experience.
- Deliverable: a portfolio of documented tracks + a written "what I learned about being wrong."
- Transition marker: colleagues come to me for how to test whether something is real, not just how to build it.
How I'll know it worked (transition scorecard)
Track quarterly. The goal isn't output volume β it's the change in reflex.
| Signal | Engineer-me | Researcher-me |
|---|---|---|
| First reaction to a great result | ship it | what's the bug? |
| Train/test split | by row | by user & time |
| Definition of "done" | tests pass | I understand why the number is what it is |
| A negative result | failure | information (often more than a positive) |
| The business goal | predict the outcome | know whether I should predict or cause it |
| My infra | maximal, elegant | minimal, fast to iterate, reproducible |
Maintenance notes
- Capture first, sort later. Anything goes into
90_Inbox/instantly; triage weekly. - Link, don't file. A note's value is its connections. Prefer
[[links]]over deep folders. - One quote per source, paraphrase the rest in literature notes β build the muscle of saying it in your own words.
- Commit the vault to git. Your research history is an audit trail.
Created as the seed of a living knowledge base. Delete nothing; grow everything.