Precision / Recall / Prevalence @ Scale Playground

Fix a model threshold via TPR and FPR, then see how precision and false positives change when you deploy to larger areas and/or regions where the positive class is rarer.

What's going on when "precision looks great" on a balanced set, but you see many false positives when you run your model at scale?

Let \(\pi = P(Y=1)\) be the true prevalence of a class you care about in the region you evaluate on, and let \(\text{TPR} = P(\hat Y=1 \mid Y=1)\) and \(\text{FPR} = P(\hat Y=1 \mid Y=0)\) be properties of the model you trained for detecting this class at a fixed threshold. Precision is \(\text{Prec}=P(Y=1\mid \hat Y=1)\). By Bayes' rule:

\[ \text{Prec}(\pi) = \frac{\text{TPR}\,\pi}{\text{TPR}\,\pi + \text{FPR}\,(1-\pi)}. \]

Two different "scale" effects get mixed together:

(A) Base-rate (prevalence) effect: Precision depends on \(\pi\). If you move from a balanced evaluation (\(\pi=0.5\)) to a deployment region where \(\pi\) is tiny, precision can drop sharply unless \(\text{FPR}\) is extremely small.

(B) Volume (area) effect: The count of false positives depends on how many negatives you scan. If you deploy to an area \(k\) times larger than your validation set, expected false positives scale by \(k\):

\[ \mathbb{E}[\text{FP}] \approx \text{FPR}\,(1-\pi)\,k \cdot N_0. \]

Notice that \(\mathbb{E}[\text{FP}]\) grows linearly with deployment area, even if precision stays unchanged. So you can have a model that "looks good" on balanced tests but still produces a painful number of false alarms when deployment is huge and the landscape is mostly negative.

In this playground, the validation set is fixed at prevalence \(\pi_0=0.5\) and size. You can pick a TPR and FPR, then simulate scaling the model over a larger deployment area and optionally change the deployment prevalence \(\pi\) to see how precision and false positives change.

Model (threshold-fixed)
0.90
0.0010
When positives are rare, precision is often dominated by FPR.

Deployment scale
1000x
How many times larger is deployment vs validation? (1x to 10000x)
Validation: ~1,750 km2 (1.3x Phoenix, AZ). Deployment: loading...

Deployment prevalence
Fraction of true positives in deployment region.
0.0100
Log scale: -2 = 0.01, -3 = 0.001, -4 = 0.0001
Validation (pi=0.5, N=1000)
Precision
-
-
Recall
-
= TPR
False Positives
-
-
Deployment (pi, N x scale)
Precision
-
-
Recall
-
= TPR
False Positives
-
-

Precision vs prevalence
False positives vs deployment scale