Evaluation theory

evergreen#foundations#evaluation

Bar: build a confusion matrix by hand, compute precision/recall/F1, and explain why accuracy and even ROC-AUC mislead under 1–3% class imbalance.

The confusion matrix and its ratios

	Predicted +	Predicted −
Actual +	TP	FN
Actual −	FP	TN

Precision $=\dfrac{TP}{TP+FP}$ — of those I flagged, how many were right? (cost of false alarms)
Recall / TPR $=\dfrac{TP}{TP+FN}$ — of the real positives, how many did I catch? (cost of misses)
F1 $=\dfrac{2\,PR}{P+R}$ — harmonic mean; punishes a lopsided pair.
Specificity / TNR $=\dfrac{TN}{TN+FP}$ .

Worked (by hand)

$TP=8,\ FP=2,\ FN=4,\ TN=86$ (100 cases, 12 actual positives).

Precision $=8/10=0.80$ , Recall $=8/12\approx0.667$ , F1 $=\dfrac{2(0.8)(0.667)}{0.8+0.667}\approx0.727$ .
Accuracy $=(8+86)/100=0.94$ — looks great, but a model that predicts "negative always" scores $88/100=0.88$ here. Accuracy rewards the majority class. At a 1% base rate, all-negative scores 99% and is useless. Never judge a rare-event model by accuracy.

Threshold-free views

ROC curve = TPR vs FPR across thresholds; AUC-ROC = P(model ranks a random positive above a random negative). Insensitive to base rate — which is exactly the problem: it can look healthy while precision is abysmal because negatives vastly outnumber positives.
Precision–Recall curve + AUC-PR: PR is the right lens under heavy imbalance because the baseline is the positive rate, not 0.5. Prefer PR for 1–3% conversions (Heavy-tailed distributions).

Probabilities, not just labels: proper scoring rules

A proper scoring rule is minimized in expectation only by the true probabilities — so it rewards honest, calibrated outputs and can't be gamed.

Log-loss $=-\frac1N\sum[y\log\hat p+(1-y)\log(1-\hat p)]$ — this is cross-entropy (Cross-entropy and KL divergence).
Brier score $=\frac1N\sum(\hat p-y)^2$ . Accuracy is not proper (it ignores confidence). If a decision uses the probability (ranking, thresholds, expected value), score the probability.

Calibration

A model is calibrated if, among cases it says "0.7", ~70% are actually positive. Check with a reliability diagram (predicted-prob bins vs observed frequency). A model can have great AUC yet be badly miscalibrated — fatal when the business acts on the probability (budgets, bids, expected revenue).

Why this matters here

Pick the metric that matches the business cost of FP vs FN, and the base rate. Wrong metric → confident nonsense.
Split by user & time, never by row — otherwise the score is leakage, not skill (Leakage checklist).
Offline metric ≠ business value: a great log-loss can still die in the A/B test (Online vs offline gap, Offline→online checklist).

By-hand exercise (meets the bar)

From $TP=20,FP=180,FN=5,TN=9795$ (1% positives), compute precision, recall, accuracy. Notice precision $=0.10$ despite 98%+ accuracy.
Two models: A has AUC 0.85 but says "0.5" for everything that's actually 0.9; B has AUC 0.80 but is calibrated. Which do you ship if you bid real money on the probability? (B — calibration.)