Causal inference primer

evergreen#foundations#causality

Bar: state the potential-outcomes definition of a treatment effect, explain confounding with a worked Simpson's-paradox table, and say why a propensity model ≠ an uplift model.

Prediction ≠ causation (the whole point)

A predictive model answers "given what I see, what will happen?" A causal model answers "if I intervene, what changes?" These come apart constantly: ice-cream sales predict drownings (both caused by summer), but banning ice cream saves no one. The business often thinks it wants a prediction when it actually wants to change the outcome (Predict vs cause checklist).

Potential outcomes

Each unit has two potential outcomes: $Y(1)$ if treated, $Y(0)$ if not.

Individual effect $\tau_i=Y_i(1)-Y_i(0)$ — never directly observable (you see only one).
ATE $=\mathbb{E}[Y(1)-Y(0)]$ ; CATE $=\mathbb{E}[Y(1)-Y(0)\mid X=x]$ (effect for a subgroup).
Fundamental problem of causal inference: for any unit you observe one outcome, never the counterfactual. Causal inference is the science of recovering the missing half.

Confounding (why correlation lies)

A confounder $Z$ causes both treatment $T$ and outcome $Y$ , creating a spurious $T\!-\!Y$ association ( $T\leftarrow Z\rightarrow Y$ ). Naïve $\mathbb{E}[Y\mid T=1]-\mathbb{E}[Y\mid T=0]$ then mixes the real effect with selection.

Worked: Simpson's paradox (by hand)

Recovery rates for a treatment, split by case severity:

	Treated	Untreated
Mild	90% (180/200)	85% (170/200)
Severe	50% (100/200)	40% (80/200)
Aggregate	70% (280/400)	62.5% (250/400)

Within each severity the treatment helps. But suppose doctors give the treatment mostly to severe cases — then the aggregate can reverse and make the treatment look harmful. Severity is the confounder; you must condition on it. The aggregate number is not the causal number.

How to actually estimate effects

Randomization (A/B test): assignment is independent of everything → no confounding. The gold standard; why offline lift must be confirmed online (Online vs offline gap).
Observational adjustment: condition on confounders — stratification, regression, propensity-score weighting (IPW), matching. Valid only under no unmeasured confounders (an untestable assumption).
Quasi-experiments: difference-in-differences, instrumental variables, regression discontinuity — borrow a source of "as-if random" variation.

Propensity vs uplift (the for-profit crux)

Propensity model: $P(\text{convert}\mid X)$ — who will convert. Target them and you mostly reach people who'd have converted anyway (Goodhart: you optimize a proxy, waste budget).
Uplift / CATE model: $P(\text{convert}\mid X, \text{treat}) - P(\text{convert}\mid X, \text{no treat})$ — who converts because of the treatment. That's the population worth spending on.

This distinction is the Q3 "causal turn" of the roadmap → Attribution / uplift.

By-hand exercise (meets the bar)

Construct numbers where treatment helps in two subgroups yet hurts in aggregate (reverse the table above).
Sketch the DAG $T\leftarrow Z\rightarrow Y$ , $T\rightarrow Y$ and mark which path randomization removes.