Clickstream foundational papers
Tight starter set — decades-old prior art so you don't reinvent solved wheels (Sessionization especially). OA legend: ✅ open · 🟡 free author/preprint copy · 🔒 paywalled. Read with How I read a paper; spin up a full note per paper from 80_Literature-Template.
🎯 Start here:
srivastava2000webusagemining— it defines the field's shared vocabulary every other paper assumes. Path: srivastava → spiliopoulou → bucklin2003 → montgomery → bucklin2009 → mobasher
⬜ P1 · Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data
Srivastava, Cooley, Deshpande & Tan (2000) · SIGKDD Explorations 1(2):12–23 · survey · srivastava2000webusagemining · 🟡 partial
🔗 https://dl.acm.org/doi/10.1145/846183.846188
Why: The canonical entry point — defines the three-phase WUM pipeline (preprocessing → pattern discovery → analysis) and the problem taxonomy.
⬜ P2 · A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis
Spiliopoulou, Mobasher, Berendt & Nakagawa (2003) · INFORMS J. Computing 15(2):171–190 · spiliopoulou2003sessionreconstruction · ✅ yes
🔗 http://facweb.cs.depaul.edu/mobasher/research/papers/SMBN03.pdf
Why: The definitive treatment of sessionization — taxonomy + evaluation of time-out / maximal-forward-reference / navigation heuristics. Directly reused in every clickstream pipeline.
⬜ P3 · Modeling Online Browsing and Path Analysis Using Clickstream Data
Montgomery, Li, Srinivasan & Liechty (2004) · Marketing Science 23(4):579–595 · montgomery2004pathanalysis · ✅ yes (CMU author page)
🔗 https://www.andrew.cmu.edu/user/alm3/papers/purchase conversion.pdf
Why: The core Markov path-prediction paper — shows first-order Markov is insufficient (memory matters). The "solved wheel" your synthetic generator's transition matrix is reinventing on purpose (Markov chains and HMMs).
⬜ P4 · Click Here for Internet Insight: Advances in Clickstream Data Analysis in Marketing
Bucklin & Sismeiro (2009) · J. Interactive Marketing 23(1):35–48 · review · bucklin2009clickstreaminsight · 🟡 SSRN
🔗 https://ssrn.com/abstract=1118315
Why: Authoritative literature review of clickstream research — a structured map of prior art, and what clickstream data can vs. cannot support.
⬜ P5 · A Model of Web Site Browsing Behavior Estimated on Clickstream Data
Bucklin & Sismeiro (2003) · J. Marketing Research 40(3):249–267 · bucklin2003browsingmodel · 🔒 no
🔗 https://doi.org/10.1509/jmkr.40.3.249.19241
Why: The foundational empirical browsing model (page-duration + navigation choice jointly). Read alongside P3 to see measurement model vs. prediction model.
⬜ P6 · Automatic Personalization Based on Web Usage Mining
Mobasher, Cooley & Srivastava (2000) · Communications of the ACM 43(8):142–151 · mobasher2000personalization · 🟡 author copy
🔗 https://doi.org/10.1145/345124.345169 · author PDFs: http://facweb.cs.depaul.edu/mobasher/pubs.html
Why: Turns mined usage patterns into a real recommendation engine — the WUM taxonomy made into a deployable architecture.