Evaluation theory

evergreen#foundations#evaluation

Up: 10_Foundations-MOC

Bar: build a confusion matrix by hand, compute precision/recall/F1, and explain why accuracy and even ROC-AUC mislead under 1–3% class imbalance.

The confusion matrix and its ratios

Predicted + Predicted −
Actual + TP FN
Actual − FP TN

Worked (by hand)

TP=8, FP=2, FN=4, TN=86TP=8,\ FP=2,\ FN=4,\ TN=86 (100 cases, 12 actual positives).

Threshold-free views

Probabilities, not just labels: proper scoring rules

A proper scoring rule is minimized in expectation only by the true probabilities — so it rewards honest, calibrated outputs and can't be gamed.

Calibration

A model is calibrated if, among cases it says "0.7", ~70% are actually positive. Check with a reliability diagram (predicted-prob bins vs observed frequency). A model can have great AUC yet be badly miscalibrated — fatal when the business acts on the probability (budgets, bids, expected revenue).

Why this matters here

By-hand exercise (meets the bar)

  1. From TP=20,FP=180,FN=5,TN=9795TP=20,FP=180,FN=5,TN=9795 (1% positives), compute precision, recall, accuracy. Notice precision =0.10=0.10 despite 98%+ accuracy.
  2. Two models: A has AUC 0.85 but says "0.5" for everything that's actually 0.9; B has AUC 0.80 but is calibrated. Which do you ship if you bid real money on the probability? (B — calibration.)