The honest calibration plot
When the Sentinel model says “5% risk,” does the real outcome actually happen ~5% of the time? Below is the answer: 900 live (predicted, observed) pairs from the last 30 days, binned into a standard reliability diagram.
Updated every 30 min · last refresh Jun 15, 2026 · CC BY 4.0 · Binned JSON · Raw outcomes
Reliability diagram
Each point is one prediction bin. X axis is the mean predicted probability inside the bin; Y axis is the fraction of those forecasts where the real outcome actually happened. Perfect calibration is the diagonal line — points above the line mean the model UNDER-estimates risk; points below mean it OVER-estimates.
Bubble area scales with bin count · Red = model under-estimated · Blue = model over-estimated
Per-bin breakdown
| Bin | Predicted mean | Observed rate | Δ | n |
|---|---|---|---|---|
| [0.0, 0.1) | 0.048 | 0.274 | +0.227 | 711 |
| [0.1, 0.2) | 0.152 | 0.203 | +0.051 | 69 |
| [0.2, 0.3) | 0.266 | 0.000 | -0.266 | 2 |
| [0.3, 0.4) | 0.326 | 0.714 | +0.388 | 7 |
| [0.4, 0.5) | 0.452 | 0.435 | -0.017 | 23 |
| [0.5, 0.6) | 0.578 | 0.000 | -0.578 | 8 |
| [0.6, 0.7) | 0.638 | 0.226 | -0.412 | 31 |
| [0.7, 0.8) | 0.761 | 0.143 | -0.618 | 7 |
| [0.8, 0.9) | 0.841 | 0.412 | -0.430 | 17 |
| [0.9, 1.0) | 0.948 | 1.000 | +0.052 | 25 |
Δ = observed − predicted. The 0.1 bin holds 711 of the 900 forecasts — this is where most action happens, and where the May 20 isotonic recalibration was aimed. See /sentinel/calibration for the time-series view of how this gap evolves day over day.
Per-country backtest (worst Brier first, n ≥ 5)
Countries where the forecast is currently performing worst — useful for targeting feature engineering or seeking expert review.
| Country | Brier | Accuracy | P | R | n | Pos rate |
|---|---|---|---|---|---|---|
| BangladeshBD | 0.000 | 0% | 0.50 | 0.27 | 30 | 0% |
| BrazilBR | 0.000 | 0% | 0.20 | 0.14 | 30 | 0% |
| BelarusBY | 0.000 | 0% | 0.36 | 0.24 | 30 | 0% |
| ChinaCN | 0.000 | 0% | 0.07 | 1.00 | 30 | 0% |
| CubaCU | 0.000 | 0% | 0.14 | 0.09 | 30 | 0% |
| EgyptEG | 0.000 | 0% | 0.66 | 0.95 | 30 | 0% |
| ERER | 0.000 | 0% | 0.00 | — | 30 | 0% |
| EthiopiaET | 0.000 | 0% | 0.15 | 1.00 | 30 | 0% |
| IndonesiaID | 0.000 | 0% | 0.42 | 0.83 | 30 | 0% |
| IndiaIN | 0.000 | 0% | 0.93 | 0.46 | 30 | 0% |
| IranIR | 0.000 | 0% | 0.90 | 0.67 | 30 | 0% |
| North KoreaKP | 0.000 | 0% | 0.00 | — | 30 | 0% |
| KazakhstanKZ | 0.000 | 0% | 0.09 | 0.14 | 30 | 0% |
| LebanonLB | 0.000 | 0% | 0.00 | — | 30 | 0% |
| MyanmarMM | 0.000 | 0% | 0.00 | 0.00 | 30 | 0% |
| MalaysiaMY | 0.000 | 0% | 0.00 | 0.00 | 30 | 0% |
| NigeriaNG | 0.000 | 0% | 0.00 | — | 30 | 0% |
| NicaraguaNI | 0.000 | 0% | 0.07 | 0.20 | 30 | 0% |
| PhilippinesPH | 0.000 | 0% | 0.00 | 0.00 | 30 | 0% |
| PakistanPK | 0.000 | 0% | 1.00 | 0.77 | 30 | 0% |
How to read these numbers
- Brier score — mean squared error between predicted probability and actual 0/1 outcome. Lower is better. Less than 0.10 is excellent; 0.10-0.30 is OK; above 0.30 is concerning.
- Calibration MAE — average gap between predicted-mean and observed-rate across bins. 0.00 means the model's probabilities are exactly right on average.
- Reliability diagram — the visual version of calibration MAE. Bubble size = bin sample count.
- F1 (P + R) — binary classification metrics at the 0.5 threshold. Useful when downstream decisions are binary (alert / no-alert).
- The May 20, 2026 isotonic recalibration targeted the 0.1 bin specifically — see the recalibration finding.
Related
- /sentinel/calibration — 90-day time series of empirical coverage vs the 90% conformal target
- /methodology#validation — the three honest accuracy splits (LOCO, stratified, time-based)
- /atlas/forecast/IR — per-country calibrated forecast detail with SHAP drivers