Offline compute stack
Up: 20_Domain-MOC
Everything you need to do clickstream research on a laptop, offline. Philosophy: the minimum to iterate fast and reproduce — not a cathedral (Don't over-engineer the research infra). Install once while online; then it all runs locally.
⚠️ Document-only note. Nothing here is auto-installed. These are the commands to run yourself; pin versions when you do, and record what you actually installed in Inbox or here.
The stack at a glance
| Need | Tool | Why this one |
|---|---|---|
| Local OLAP / SQL on files | DuckDB | Query Parquet/CSV with SQL, zero server, sessionize + funnels on a laptop |
| Fast dataframes | Polars | Multi-core, lazy, Arrow-native; event tables outgrow pandas |
| Numerics / stats | numpy + scipy | The substrate for every generator and metric |
| Hawkes / point processes | tick | Simulate + fit self-exciting processes (Point processes) |
| Agent simulation | SimPy | Discrete-event sim for synthetic user journeys |
| Experiment tracking | MLflow | Local file store; params/metrics/artifacts per run |
| Data/version control | DVC | Version big data + pipelines by hash, keep git text-only |
| Plotting | matplotlib | Log-log CCDFs, reliability diagrams, drift plots |
| Notebooks | JupyterLab | Fast iteration, throwaway exploration |
| Scale-up (optional) | ClickHouse | When local DuckDB stops scaling (ClickHouse) |
| Reference manager | Zotero | Paper PDFs + BibTeX, links from 50_Literature-MOC |
| The vault | Obsidian | Graph + backlinks over this folder (plain Markdown, optional) |
One-shot Python environment
Use a single project venv. With uv (fast, recommended) or stock pip:
# create + activate an isolated env (Python 3.11+)
uv venv .venv && source .venv/bin/activate # or: python -m venv .venv && source .venv/bin/activate
# core research stack
uv pip install \
duckdb polars pyarrow \
numpy scipy pandas matplotlib \
scikit-learn \
simpy \
mlflow dvc \
jupyterlab
# Hawkes / point processes (tick can be finicky; install separately so a failure doesn't block the rest)
uv pip install tick
ticksometimes lacks a wheel for the newest Python — if it won't install, pin an older Python in the venv, or substitute a hand-rolled Hawkes simulator (the math is in Point processes). Don't let one stubborn dependency stall the whole setup.
Freeze what worked so it's reproducible offline later:
uv pip freeze > requirements.lock # commit this text file; it's tiny and it's the audit trail
DuckDB — the local event warehouse
No server, no daemon. Point it at Parquet and write SQL. Typical sessionization sketch:
-- gap-based sessions: new session when >30 min since this user's previous event
SELECT *,
SUM(new_session) OVER (PARTITION BY user_id ORDER BY ts) AS session_id
FROM (
SELECT *,
CASE WHEN ts - LAG(ts) OVER (PARTITION BY user_id ORDER BY ts)
> INTERVAL 30 MINUTE THEN 1 ELSE 0 END AS new_session
FROM read_parquet('events/*.parquet')
);
See DuckDB for local event analytics and Sessionization for the boundary subtleties.
Experiment tracking & data versioning (keep it minimal)
- MLflow local file store — no server needed:
Every experiment note links its run ID; that's the bridge from a written hypothesis to its numbers.mlflow ui --backend-store-uri ./mlruns # browse runs at localhost:5000 - DVC versions the big stuff git ignores (see
.gitignore):
Experiment notes record the data version (the DVC hash) so a result is reproducible.dvc init dvc add data/events.parquet # tracks a hash; the .dvc pointer is committed, the data isn't
What's intentionally not here
- No Kafka/Flink/Spark. Streaming infra earns its keep only if in-flight latency genuinely matters (Batch vs streaming decision) — for research, batch on DuckDB/Polars is faster to iterate.
- No cloud anything by default. Offline-first. ClickHouse only when a single laptop truly can't hold the workload.
- No heavyweight orchestration. A handful of scripts + MLflow + DVC is the whole reproducibility story.
Links
- Substrate notes: DuckDB for local event analytics · Polars over pandas · ClickHouse · Batch vs streaming decision
- Used by: Synthetic-data toolkit · Generating background traffic · every experiment note
- Discipline: Don't over-engineer the research infra · Fast iteration beats parallelism