Precision / Recall / Prevalence @ Scale Playground

Fix a model threshold via TPR and FPR, then see how precision and false positives change when you deploy to larger areas and/or regions where the positive class is rarer.

What's going on when "precision looks great" on a balanced set, but you see many false positives when you run your model at scale?

Let \(\pi = P(Y=1)\) be the true prevalence of a class you care about in the region you evaluate on, and let \(\text{TPR} = P(\hat Y=1 \mid Y=1)\) and \(\text{FPR} = P(\hat Y=1 \mid Y=0)\) be properties of the model you trained for detecting this class at a fixed threshold. Precision is \(\text{Prec}=P(Y=1\mid \hat Y=1)\). By Bayes' rule:

\[ \text{Prec}(\pi) = \frac{\text{TPR}\,\pi}{\text{TPR}\,\pi + \text{FPR}\,(1-\pi)}. \]

Two different "scale" effects get mixed together:

(A) Base-rate (prevalence) effect: Precision depends on \(\pi\). If you move from a balanced evaluation (\(\pi=0.5\)) to a deployment region where \(\pi\) is tiny, precision can drop sharply unless \(\text{FPR}\) is extremely small.

(B) Volume (area) effect: The count of false positives depends on how many negatives you scan. If you deploy to an area \(k\) times larger than your validation set, expected false positives scale by \(k\):

\[ \mathbb{E}[\text{FP}] \approx \text{FPR}\,(1-\pi)\,k \cdot N_0. \]

Notice that \(\mathbb{E}[\text{FP}]\) grows linearly with deployment area, even if precision stays unchanged. So you can have a model that "looks good" on balanced tests but still produces a painful number of false alarms when deployment is huge and the landscape is mostly negative.

In this playground, the validation set is fixed at prevalence \(\pi_0=0.5\) and size. You can pick a TPR and FPR, then simulate scaling the model over a larger deployment area and optionally change the deployment prevalence \(\pi\) to see how precision and false positives change.

Model (threshold-fixed)

TPR (Recall / Sensitivity)

0.90

FPR (False Positive Rate)

0.0010

When positives are rare, precision is often dominated by FPR.

Deployment scale

Deployment size (multiple of validation)

1000x

How many times larger is deployment vs validation? (1x to 10000x)

Validation: ~1,750 km² (1.3x Phoenix, AZ). Deployment: loading...

Deployment prevalence

Same prevalence as validation (pi = 0.5)

Precision stays the same; false positives grow due to more samples.

Different deployment prevalence (choose pi)

Precision changes because \(\text{Prec}(\pi)\) depends on prevalence.

Deployment prevalence pi

Fraction of true positives in deployment region.

0.0100

Log scale: -2 = 0.01, -3 = 0.001, -4 = 0.0001

Validation (pi=0.5, N=1000)

Precision

Recall

= TPR

False Positives

Deployment (pi, N x scale)

Precision

Recall

= TPR

False Positives

Precision vs prevalence

False positives vs deployment scale