Sentinel · 30-day backtest

The honest calibration plot

When the Sentinel model says “5% risk,” does the real outcome actually happen ~5% of the time? Below is the answer: 900 live (predicted, observed) pairs from the last 30 days, binned into a standard reliability diagram.

Updated every 30 min · last refresh Jun 15, 2026 · CC BY 4.0 · Binned JSON · Raw outcomes

Brier score

0.246

lower is better

Calibration MAE

0.221

0 = perfect

Accuracy

53.7%

900 evaluated

F1 (binary 0.5)

0.40

P=0.32 R=0.52

Reliability diagram

Each point is one prediction bin. X axis is the mean predicted probability inside the bin; Y axis is the fraction of those forecasts where the real outcome actually happened. Perfect calibration is the diagonal line — points above the line mean the model UNDER-estimates risk; points below mean it OVER-estimates.

Bubble area scales with bin count · Red = model under-estimated · Blue = model over-estimated

Per-bin breakdown

Bin	Predicted mean	Observed rate	Δ	n
[0.0, 0.1)	0.048	0.274	+0.227	711
[0.1, 0.2)	0.152	0.203	+0.051	69
[0.2, 0.3)	0.266	0.000	-0.266	2
[0.3, 0.4)	0.326	0.714	+0.388	7
[0.4, 0.5)	0.452	0.435	-0.017	23
[0.5, 0.6)	0.578	0.000	-0.578	8
[0.6, 0.7)	0.638	0.226	-0.412	31
[0.7, 0.8)	0.761	0.143	-0.618	7
[0.8, 0.9)	0.841	0.412	-0.430	17
[0.9, 1.0)	0.948	1.000	+0.052	25

Δ = observed − predicted. The 0.1 bin holds 711 of the 900 forecasts — this is where most action happens, and where the May 20 isotonic recalibration was aimed. See /sentinel/calibration for the time-series view of how this gap evolves day over day.

Per-country backtest (worst Brier first, n ≥ 5)

Countries where the forecast is currently performing worst — useful for targeting feature engineering or seeking expert review.

Country	Accuracy	P	R	n	Pos rate
BangladeshBD	0%	0.50	0.27	30	0%
BrazilBR	0%	0.20	0.14	30	0%
BelarusBY	0%	0.36	0.24	30	0%
ChinaCN	0%	0.07	1.00	30	0%
CubaCU	0%	0.14	0.09	30	0%
EgyptEG	0%	0.66	0.95	30	0%
ERER	0%	0.00	—	30	0%
EthiopiaET	0%	0.15	1.00	30	0%
IndonesiaID	0%	0.42	0.83	30	0%
IndiaIN	0%	0.93	0.46	30	0%
IranIR	0%	0.90	0.67	30	0%
North KoreaKP	0%	0.00	—	30	0%
KazakhstanKZ	0%	0.09	0.14	30	0%
LebanonLB	0%	0.00	—	30	0%
MyanmarMM	0%	0.00	0.00	30	0%
MalaysiaMY	0%	0.00	0.00	30	0%
NigeriaNG	0%	0.00	—	30	0%
NicaraguaNI	0%	0.07	0.20	30	0%
PhilippinesPH	0%	0.00	0.00	30	0%
PakistanPK	0%	1.00	0.77	30	0%

How to read these numbers

Brier score — mean squared error between predicted probability and actual 0/1 outcome. Lower is better. Less than 0.10 is excellent; 0.10-0.30 is OK; above 0.30 is concerning.
Calibration MAE — average gap between predicted-mean and observed-rate across bins. 0.00 means the model's probabilities are exactly right on average.
Reliability diagram — the visual version of calibration MAE. Bubble size = bin sample count.
F1 (P + R) — binary classification metrics at the 0.5 threshold. Useful when downstream decisions are binary (alert / no-alert).
The May 20, 2026 isotonic recalibration targeted the 0.1 bin specifically — see the recalibration finding.

/sentinel/calibration — 90-day time series of empirical coverage vs the 90% conformal target
/methodology#validation — the three honest accuracy splits (LOCO, stratified, time-based)
/atlas/forecast/IR — per-country calibrated forecast detail with SHAP drivers

Reliability diagram

Per-bin breakdown

Per-country backtest (worst Brier first, n ≥ 5)

How to read these numbers

Related