Offline compute stack

seed#domain#tooling#setup

Up: 20_Domain-MOC

Everything you need to do clickstream research on a laptop, offline. Philosophy: the minimum to iterate fast and reproduce — not a cathedral (Don't over-engineer the research infra). Install once while online; then it all runs locally.

⚠️ Document-only note. Nothing here is auto-installed. These are the commands to run yourself; pin versions when you do, and record what you actually installed in Inbox or here.

The stack at a glance

Need Tool Why this one
Local OLAP / SQL on files DuckDB Query Parquet/CSV with SQL, zero server, sessionize + funnels on a laptop
Fast dataframes Polars Multi-core, lazy, Arrow-native; event tables outgrow pandas
Numerics / stats numpy + scipy The substrate for every generator and metric
Hawkes / point processes tick Simulate + fit self-exciting processes (Point processes)
Agent simulation SimPy Discrete-event sim for synthetic user journeys
Experiment tracking MLflow Local file store; params/metrics/artifacts per run
Data/version control DVC Version big data + pipelines by hash, keep git text-only
Plotting matplotlib Log-log CCDFs, reliability diagrams, drift plots
Notebooks JupyterLab Fast iteration, throwaway exploration
Scale-up (optional) ClickHouse When local DuckDB stops scaling (ClickHouse)
Reference manager Zotero Paper PDFs + BibTeX, links from 50_Literature-MOC
The vault Obsidian Graph + backlinks over this folder (plain Markdown, optional)

One-shot Python environment

Use a single project venv. With uv (fast, recommended) or stock pip:

# create + activate an isolated env (Python 3.11+)
uv venv .venv && source .venv/bin/activate     # or: python -m venv .venv && source .venv/bin/activate

# core research stack
uv pip install \
  duckdb polars pyarrow \
  numpy scipy pandas matplotlib \
  scikit-learn \
  simpy \
  mlflow dvc \
  jupyterlab

# Hawkes / point processes (tick can be finicky; install separately so a failure doesn't block the rest)
uv pip install tick

tick sometimes lacks a wheel for the newest Python — if it won't install, pin an older Python in the venv, or substitute a hand-rolled Hawkes simulator (the math is in Point processes). Don't let one stubborn dependency stall the whole setup.

Freeze what worked so it's reproducible offline later:

uv pip freeze > requirements.lock     # commit this text file; it's tiny and it's the audit trail

DuckDB — the local event warehouse

No server, no daemon. Point it at Parquet and write SQL. Typical sessionization sketch:

-- gap-based sessions: new session when >30 min since this user's previous event
SELECT *,
  SUM(new_session) OVER (PARTITION BY user_id ORDER BY ts) AS session_id
FROM (
  SELECT *,
    CASE WHEN ts - LAG(ts) OVER (PARTITION BY user_id ORDER BY ts)
              > INTERVAL 30 MINUTE THEN 1 ELSE 0 END AS new_session
  FROM read_parquet('events/*.parquet')
);

See DuckDB for local event analytics and Sessionization for the boundary subtleties.

Experiment tracking & data versioning (keep it minimal)

What's intentionally not here