voidly
Sentinel forecast — live calibration

Model honesty, public

Every Sentinel shutdown forecast ships with a 90% conformal interval. This page tracks how often the real outcome lands inside that interval — the closer to 90%, the more honest the model. Data lives at /v1/sentinel/calibration/history and updates every 24h.

Self-published warning: Stratified AUC overstates real-world performance by 50.8pp vs. time-based split. Do not cite the stratified number as a deployment figure; use the loco_median or the prod_rolling block once it populates.
⚡ Recent fix: On 2026-05-20 we refit isotonic regression on 810 live (predicted, observed) pairs. The base XGBoost was underestimating risk by ~15× in the dominant prediction range.Brier: 0.5904 → 0.2231 · Calibration MAE: 0.6040 → 0.0000 · Iran 7-day risk: 0.146 → 0.74Live numbers below catch up over the next 24h. Read the full refit writeup →
Calibration ≠ onset skill. This page shows the forecast probabilities are the right magnitude. It does not show the forecast can predict a new shutdown before it starts — a separate audit found it cannot. The 7-day forecast is a current-regime risk signal: honest forward-temporal AUC 0.589, and on the rows where a shutdown actually begins, AUC ~0.33 (below chance). Read the onset-skill finding →
🔁 ACI online conformal (live): Replaces manual isotonic recalibration with an online update (Gibbs & Candès, NeurIPS 2021). After every observed outcome the conformal quantile αt nudges toward the empirical-coverage target — so calibration never drifts more than ~5pp from the 90% nominal even when the data distribution shifts.Initial state replay (840 outcomes, Apr 17 → May 14): α = 0.10 → 0.21 · empirical coverage 91.3% · cron 03:45 UTCLive ACI state visible in every /v1/forecast/{cc}/7day response under aci_alpha + aci.* fields. Full ACI methodology →
Latest coverage
91.3%
empirical (target 90%)
Latest q90
0.125
conformal width
Drift alerts (90d)
16
days
Model version
v1
since Jun 7

Live forecast accuracy (prod_rolling, 30-day window)

Accuracy
53.7%
Brier score
0.25
Calibration MAE
0.22
Evaluated
900

Brier < 0.10 is good, > 0.30 is concerning. Calibration MAE < 0.05 means predicted-probabilities track observed-rates closely. See /sentinel/backtest for the actual reliability diagram (predicted-mean vs observed-rate scatter) and /methodology#validation for the full evaluation methodology + 3-split honest baselines.

Empirical coverage — 90-day rolling

The blue line is the actual fraction of forecasts where the real outcome landed inside the 90% conformal interval. The green dashed line is the nominal target (0.90). If the blue stays close to the green, the model is well calibrated.

0.70.80.91.0target 0.90Apr 18Jun 15

Blue: empirical coverage · Dashed green: nominal 0.90 target · Orange circles: drift alerts

Last 14 days

DateCoverageq90n holdoutDrift?
Jun 1591.3%0.1252,203
Jun 1491.3%0.1252,203
Jun 1391.3%0.1252,203
Jun 1291.3%0.1252,203
Jun 1191.3%0.1252,203
Jun 1091.3%0.1252,203⚠️
Jun 991.3%0.1252,203⚠️
Jun 891.3%0.1252,203⚠️
Jun 791.3%0.1252,203⚠️
Jun 690.9%0.0772,181
Jun 590.9%0.0772,181
Jun 490.9%0.0772,181⚠️
Jun 390.9%0.0772,181⚠️
Jun 290.9%0.0772,181⚠️

What features the model actually uses

Sklearn feature_importances_ on the underlying XGBoost. 39 features total. Top-3 sum: 0.219 · Top-5: 0.317 · Top-10: 0.493. Healthy distribution — no single feature dominates the model.

  • 1.recent_shutdown10.0%
  • 2.week_of_year6.0%
  • 3.month5.9%
  • 4.high_urgency_signals_7d5.8%
  • 5.gdelt_unrest_30d4.0%
  • 6.election_in_7days3.8%
  • 7.high_importance_event3.7%
  • 8.block_rate_roll30_mean3.6%
  • 9.critical_incident_7d3.4%
  • 10.ioda_alert_7d3.2%
  • 11.blocked_count_roll14_mean3.0%
  • 12.block_rate_roll14_mean2.9%

Interpretation: The forecast model's top feature is gdelt_unrest_30d (0.25) — protest + conflict signals from the GDELT 1.0 global news feed. recent_shutdown, block_rate rolling means, and incident counts follow. risk_tier — the leaky country-level encoding that dominated our older classifier at 85% — contributes only ~2% here. Healthy distribution; no single feature dominates.

Raw JSON: /v1/sentinel/feature-importance

Related