Evaluation theory
evergreen#foundations#evaluation
Bar: build a confusion matrix by hand, compute precision/recall/F1, and explain why accuracy and even ROC-AUC mislead under 1–3% class imbalance.
The confusion matrix and its ratios
| Predicted + | Predicted − | |
|---|---|---|
| Actual + | TP | FN |
| Actual − | FP | TN |
- Precision
— of those I flagged, how many were right? (cost of false alarms) - Recall / TPR
— of the real positives, how many did I catch? (cost of misses) - F1
— harmonic mean; punishes a lopsided pair. - Specificity / TNR
.
Worked (by hand)
- Precision
, Recall , F1 . - Accuracy
— looks great, but a model that predicts "negative always" scores here. Accuracy rewards the majority class. At a 1% base rate, all-negative scores 99% and is useless. Never judge a rare-event model by accuracy.
Threshold-free views
- ROC curve = TPR vs FPR across thresholds; AUC-ROC = P(model ranks a random positive above a random negative). Insensitive to base rate — which is exactly the problem: it can look healthy while precision is abysmal because negatives vastly outnumber positives.
- Precision–Recall curve + AUC-PR: PR is the right lens under heavy imbalance because the baseline is the positive rate, not 0.5. Prefer PR for 1–3% conversions (Heavy-tailed distributions).
Probabilities, not just labels: proper scoring rules
A proper scoring rule is minimized in expectation only by the true probabilities — so it rewards honest, calibrated outputs and can't be gamed.
- Log-loss
— this is cross-entropy (Cross-entropy and KL divergence). - Brier score
. Accuracy is not proper (it ignores confidence). If a decision uses the probability (ranking, thresholds, expected value), score the probability.
Calibration
A model is calibrated if, among cases it says "0.7", ~70% are actually positive. Check with a reliability diagram (predicted-prob bins vs observed frequency). A model can have great AUC yet be badly miscalibrated — fatal when the business acts on the probability (budgets, bids, expected revenue).
Why this matters here
- Pick the metric that matches the business cost of FP vs FN, and the base rate. Wrong metric → confident nonsense.
- Split by user & time, never by row — otherwise the score is leakage, not skill (Leakage checklist).
- Offline metric ≠ business value: a great log-loss can still die in the A/B test (Online vs offline gap, Offline→online checklist).
By-hand exercise (meets the bar)
- From
(1% positives), compute precision, recall, accuracy. Notice precision despite 98%+ accuracy. - Two models: A has AUC 0.85 but says "0.5" for everything that's actually 0.9; B has AUC 0.80 but is calibrated. Which do you ship if you bid real money on the probability? (B — calibration.)
Links
- Built on: Information theory basics · Cross-entropy and KL divergence
- Guards: Leakage checklist · Offline→online checklist · used by every experiment note
linked from
- 🧭 Research Vault — HOME (Map of Content)
- 10 · Foundations — MOC
- 30 · Methods — MOC
- Cross-entropy and KL divergence
- Foundations reading list
- Heavy-tailed distributions
- Information theory basics
- Leakage checklist
- Offline→online checklist
- Probability for sequences
- Reading roadmap
- Track — <name>
- Track — Synthetic generator + 3-algorithm bake-off