voidly
Sentinel · 30-day backtest

The honest calibration plot

When the Sentinel model says “5% risk,” does the real outcome actually happen ~5% of the time? Below is the answer: 900 live (predicted, observed) pairs from the last 30 days, binned into a standard reliability diagram.

Updated every 30 min · last refresh Jun 15, 2026 · CC BY 4.0 · Binned JSON · Raw outcomes

Brier score
0.246
lower is better
Calibration MAE
0.221
0 = perfect
Accuracy
53.7%
900 evaluated
F1 (binary 0.5)
0.40
P=0.32 R=0.52

Reliability diagram

Each point is one prediction bin. X axis is the mean predicted probability inside the bin; Y axis is the fraction of those forecasts where the real outcome actually happened. Perfect calibration is the diagonal line — points above the line mean the model UNDER-estimates risk; points below mean it OVER-estimates.

0.000.000.250.250.500.500.750.751.001.00perfectn=711n=69n=2n=7n=23n=8n=31n=7n=17n=25Predicted probability (bin mean)Observed positive rate

Bubble area scales with bin count · Red = model under-estimated · Blue = model over-estimated

Per-bin breakdown

BinPredicted meanObserved rateΔn
[0.0, 0.1)0.0480.274+0.227711
[0.1, 0.2)0.1520.203+0.05169
[0.2, 0.3)0.2660.000-0.2662
[0.3, 0.4)0.3260.714+0.3887
[0.4, 0.5)0.4520.435-0.01723
[0.5, 0.6)0.5780.000-0.5788
[0.6, 0.7)0.6380.226-0.41231
[0.7, 0.8)0.7610.143-0.6187
[0.8, 0.9)0.8410.412-0.43017
[0.9, 1.0)0.9481.000+0.05225

Δ = observed − predicted. The 0.1 bin holds 711 of the 900 forecasts — this is where most action happens, and where the May 20 isotonic recalibration was aimed. See /sentinel/calibration for the time-series view of how this gap evolves day over day.

Per-country backtest (worst Brier first, n ≥ 5)

Countries where the forecast is currently performing worst — useful for targeting feature engineering or seeking expert review.

CountryBrierAccuracyPRnPos rate
BangladeshBD0.0000%0.500.27300%
BrazilBR0.0000%0.200.14300%
BelarusBY0.0000%0.360.24300%
ChinaCN0.0000%0.071.00300%
CubaCU0.0000%0.140.09300%
EgyptEG0.0000%0.660.95300%
ERER0.0000%0.00300%
EthiopiaET0.0000%0.151.00300%
IndonesiaID0.0000%0.420.83300%
IndiaIN0.0000%0.930.46300%
IranIR0.0000%0.900.67300%
North KoreaKP0.0000%0.00300%
KazakhstanKZ0.0000%0.090.14300%
LebanonLB0.0000%0.00300%
MyanmarMM0.0000%0.000.00300%
MalaysiaMY0.0000%0.000.00300%
NigeriaNG0.0000%0.00300%
NicaraguaNI0.0000%0.070.20300%
PhilippinesPH0.0000%0.000.00300%
PakistanPK0.0000%1.000.77300%

How to read these numbers

  • Brier score — mean squared error between predicted probability and actual 0/1 outcome. Lower is better. Less than 0.10 is excellent; 0.10-0.30 is OK; above 0.30 is concerning.
  • Calibration MAE — average gap between predicted-mean and observed-rate across bins. 0.00 means the model's probabilities are exactly right on average.
  • Reliability diagram — the visual version of calibration MAE. Bubble size = bin sample count.
  • F1 (P + R) — binary classification metrics at the 0.5 threshold. Useful when downstream decisions are binary (alert / no-alert).
  • The May 20, 2026 isotonic recalibration targeted the 0.1 bin specifically — see the recalibration finding.

Related