Offline compute stack

seed#domain#tooling#setup

Everything you need to do clickstream research on a laptop, offline. Philosophy: the minimum to iterate fast and reproduce — not a cathedral (Don't over-engineer the research infra). Install once while online; then it all runs locally.

⚠️ Document-only note. Nothing here is auto-installed. These are the commands to run yourself; pin versions when you do, and record what you actually installed in Inbox or here.

The stack at a glance

Need	Tool	Why this one
Local OLAP / SQL on files	DuckDB	Query Parquet/CSV with SQL, zero server, sessionize + funnels on a laptop
Fast dataframes	Polars	Multi-core, lazy, Arrow-native; event tables outgrow pandas
Numerics / stats	numpy + scipy	The substrate for every generator and metric
Hawkes / point processes	tick	Simulate + fit self-exciting processes (Point processes)
Agent simulation	SimPy	Discrete-event sim for synthetic user journeys
Experiment tracking	MLflow	Local file store; params/metrics/artifacts per run
Data/version control	DVC	Version big data + pipelines by hash, keep git text-only
Plotting	matplotlib	Log-log CCDFs, reliability diagrams, drift plots
Notebooks	JupyterLab	Fast iteration, throwaway exploration
Scale-up (optional)	ClickHouse	When local DuckDB stops scaling (ClickHouse)
Reference manager	Zotero	Paper PDFs + BibTeX, links from 50_Literature-MOC
The vault	Obsidian	Graph + backlinks over this folder (plain Markdown, optional)

One-shot Python environment

Use a single project venv. With uv (fast, recommended) or stock pip:

# create + activate an isolated env (Python 3.11+)
uv venv .venv && source .venv/bin/activate     # or: python -m venv .venv && source .venv/bin/activate

# core research stack
uv pip install \
  duckdb polars pyarrow \
  numpy scipy pandas matplotlib \
  scikit-learn \
  simpy \
  mlflow dvc \
  jupyterlab

# Hawkes / point processes (tick can be finicky; install separately so a failure doesn't block the rest)
uv pip install tick

tick sometimes lacks a wheel for the newest Python — if it won't install, pin an older Python in the venv, or substitute a hand-rolled Hawkes simulator (the math is in Point processes). Don't let one stubborn dependency stall the whole setup.

Freeze what worked so it's reproducible offline later:

uv pip freeze > requirements.lock     # commit this text file; it's tiny and it's the audit trail

DuckDB — the local event warehouse

No server, no daemon. Point it at Parquet and write SQL. Typical sessionization sketch:

-- gap-based sessions: new session when >30 min since this user's previous event
SELECT *,
  SUM(new_session) OVER (PARTITION BY user_id ORDER BY ts) AS session_id
FROM (
  SELECT *,
    CASE WHEN ts - LAG(ts) OVER (PARTITION BY user_id ORDER BY ts)
              > INTERVAL 30 MINUTE THEN 1 ELSE 0 END AS new_session
  FROM read_parquet('events/*.parquet')
);

See DuckDB for local event analytics and Sessionization for the boundary subtleties.

Experiment tracking & data versioning (keep it minimal)

MLflow local file store — no server needed:
```
mlflow ui --backend-store-uri ./mlruns      # browse runs at localhost:5000
```
Every experiment note links its run ID; that's the bridge from a written hypothesis to its numbers.
DVC versions the big stuff git ignores (see .gitignore):
```
dvc init
dvc add data/events.parquet                 # tracks a hash; the .dvc pointer is committed, the data isn't
```
Experiment notes record the data version (the DVC hash) so a result is reproducible.

What's intentionally not here

No Kafka/Flink/Spark. Streaming infra earns its keep only if in-flight latency genuinely matters (Batch vs streaming decision) — for research, batch on DuckDB/Polars is faster to iterate.
No cloud anything by default. Offline-first. ClickHouse only when a single laptop truly can't hold the workload.
No heavyweight orchestration. A handful of scripts + MLflow + DVC is the whole reproducibility story.