Sentinel forecast — live calibration

Model honesty, public

Every Sentinel shutdown forecast ships with a 90% conformal interval. This page tracks how often the real outcome lands inside that interval — the closer to 90%, the more honest the model. Data lives at /v1/sentinel/calibration/history and updates every 24h.

Self-published warning: Stratified AUC overstates real-world performance by 50.8pp vs. time-based split. Do not cite the stratified number as a deployment figure; use the loco_median or the prod_rolling block once it populates.

⚡ Recent fix: On 2026-05-20 we refit isotonic regression on 810 live (predicted, observed) pairs. The base XGBoost was underestimating risk by ~15× in the dominant prediction range.Brier: 0.5904 → 0.2231 · Calibration MAE: 0.6040 → 0.0000 · Iran 7-day risk: 0.146 → 0.74Live numbers below catch up over the next 24h. Read the full refit writeup →

Calibration ≠ onset skill. This page shows the forecast probabilities are the right magnitude. It does not show the forecast can predict a new shutdown before it starts — a separate audit found it cannot. The 7-day forecast is a current-regime risk signal: honest forward-temporal AUC 0.589, and on the rows where a shutdown actually begins, AUC ~0.33 (below chance). Read the onset-skill finding →

🔁 ACI online conformal (live): Replaces manual isotonic recalibration with an online update (Gibbs & Candès, NeurIPS 2021). After every observed outcome the conformal quantile α_t nudges toward the empirical-coverage target — so calibration never drifts more than ~5pp from the 90% nominal even when the data distribution shifts.Initial state replay (840 outcomes, Apr 17 → May 14): α = 0.10 → 0.21 · empirical coverage 91.3% · cron 03:45 UTCLive ACI state visible in every /v1/forecast/{cc}/7day response under aci_alpha + aci.* fields. Full ACI methodology →

Latest coverage

91.3%

empirical (target 90%)

Latest q90

0.125

conformal width

Drift alerts (90d)

days

Model version

since Jun 7

Live forecast accuracy (prod_rolling, 30-day window)

Accuracy

53.7%

Brier score

0.25

Calibration MAE

0.22

Evaluated

900

Brier < 0.10 is good, > 0.30 is concerning. Calibration MAE < 0.05 means predicted-probabilities track observed-rates closely. See /sentinel/backtest for the actual reliability diagram (predicted-mean vs observed-rate scatter) and /methodology#validation for the full evaluation methodology + 3-split honest baselines.

Empirical coverage — 90-day rolling

The blue line is the actual fraction of forecasts where the real outcome landed inside the 90% conformal interval. The green dashed line is the nominal target (0.90). If the blue stays close to the green, the model is well calibrated.

Blue: empirical coverage · Dashed green: nominal 0.90 target · Orange circles: drift alerts

Last 14 days

Date	Coverage	q90	n holdout	Drift?
Jun 15	91.3%	0.125	2,203	—
Jun 14	91.3%	0.125	2,203	—
Jun 13	91.3%	0.125	2,203	—
Jun 12	91.3%	0.125	2,203	—
Jun 11	91.3%	0.125	2,203	—
Jun 10	91.3%	0.125	2,203	⚠️
Jun 9	91.3%	0.125	2,203	⚠️
Jun 8	91.3%	0.125	2,203	⚠️
Jun 7	91.3%	0.125	2,203	⚠️
Jun 6	90.9%	0.077	2,181	—
Jun 5	90.9%	0.077	2,181	—
Jun 4	90.9%	0.077	2,181	⚠️
Jun 3	90.9%	0.077	2,181	⚠️
Jun 2	90.9%	0.077	2,181	⚠️

What features the model actually uses

Sklearn feature_importances_ on the underlying XGBoost. 39 features total. Top-3 sum: 0.219 · Top-5: 0.317 · Top-10: 0.493. Healthy distribution — no single feature dominates the model.

1.recent_shutdown10.0%
2.week_of_year6.0%
3.month5.9%
4.high_urgency_signals_7d5.8%
5.gdelt_unrest_30d4.0%
6.election_in_7days3.8%
7.high_importance_event3.7%
8.block_rate_roll30_mean3.6%
9.critical_incident_7d3.4%
10.ioda_alert_7d3.2%
11.blocked_count_roll14_mean3.0%
12.block_rate_roll14_mean2.9%

Interpretation: The forecast model's top feature is gdelt_unrest_30d (0.25) — protest + conflict signals from the GDELT 1.0 global news feed. recent_shutdown, block_rate rolling means, and incident counts follow. risk_tier — the leaky country-level encoding that dominated our older classifier at 85% — contributes only ~2% here. Healthy distribution; no single feature dominates.

Raw JSON: /v1/sentinel/feature-importance

/methodology#validation — 3-split honest baselines (LOCO 0.91 AUC vs stratified 0.98)
/v1/sentinel/accuracy — full evaluation JSON, updated nightly
/atlas/elections — see the forecast in action (90-day upcoming elections)

Live forecast accuracy (prod_rolling, 30-day window)

Empirical coverage — 90-day rolling

Last 14 days

What features the model actually uses

Related