Conformal Calibration of Escalation Thresholds
Problem
The model cascade router uses a heuristic threshold: route to the cheapest tier where P(success) >= 0.65. This threshold was hand-tuned. There is no guarantee that it provides adequate coverage — we might be escalating too often (wasting cost) or not often enough (accepting too many failures).
Solution
Conformal risk control (Angelopoulos et al., 2022, arxiv:2208.02814) provides distribution-free coverage guarantees. Applied to our escalation problem following RouteNLP (arxiv:2604.23577):
Guarantee: P(failure AND no escalation) ≤ α
This means: if the router decides NOT to escalate (i.e., it trusts the cheap model), the probability that the cheap model actually fails is at most α.
Method
1. Nonconformity Score
For each calibration example i at tier t:
- Compute the calibrated P(success) from the XGBoost + isotonic model
- If the model failed (y_i = 0), the nonconformity score is s_i = P(success)_i
- If the model succeeded (y_i = 1), the nonconformity score is s_i = -∞ (always safe)
2. Conformal Threshold
Set the escalation threshold as the ⌈(1-α)(n+1)/n⌉-th quantile of nonconformity scores among failed examples only (following RouteNLP's correctly-handled subset approach).
This gives threshold λ̂_t for each tier t.
3. Routing Decision
if P(success at tier t) >= λ̂_t:
use tier t (no escalation needed)
else:
escalate to tier t+1
4. Coverage Verification
On a held-out test set, verify that:
- P(y=fail | P(success) >= λ̂_t) ≤ α for each tier t
- This is the violation rate — should be ≤ α
Expected Impact
From RouteNLP's results:
- With 500 calibration examples per tier, violation rate is 4.2% at α=0.05
- With 100 examples, violation rate is 7.2%
- With 1000 examples, violation rate is 3.9%
Our SWE-Router dataset has 500 tasks × 8 models = 4000 total outcomes, giving us ~800 per tier. Expected violation rate: ~4%.
Implementation
The module is in aco/conformal.py:
from aco.conformal import ConformalEscalationCalibrator
# Calibrate
cal = ConformalEscalationCalibrator(alpha=0.05)
thresholds = cal.calibrate(psuccess, outcomes)
# Use in routing
if cal.should_escalate(tier=2, psuccess=0.62):
# Escalate to tier 3
...
Integration with v10 Router
The conformal calibrator replaces the hardcoded 0.65 threshold in route_v10():
# Before
for t in range(1, 6):
if tier_probs[t] >= 0.65: # heuristic
return t, tier_probs[t], tier_probs
# After
for t in range(1, 6):
if not cal.should_escalate(t, tier_probs[t]): # conformal
return t, tier_probs[t], tier_probs
Sensitivity Analysis
| α | Expected Violation Rate | Expected Escalation Rate | Cost Impact |
|---|---|---|---|
| 0.01 | ~1% | High (conservative) | +10-15% cost |
| 0.05 | ~4% | Medium | Baseline |
| 0.10 | ~8% | Low (aggressive) | -5-10% cost |
| 0.20 | ~15% | Very low | -15-20% cost |
For production use, α=0.05 is recommended. For high-risk domains (legal, medical), use α=0.01.
Caveats
- Exchangeability assumption: Conformal guarantees require that calibration and test data are exchangeable. Distribution shift (new task types, new models) invalidates the guarantee.
- Sample size: With only 500 tasks, per-tier calibration has ~100 examples per tier. More data improves calibration.
- Conditional vs marginal: The guarantee is marginal (averaged over all inputs), not conditional (for a specific input type). Conditional coverage requires stronger assumptions.
References
- Angelopoulos, A.N., et al. "Conformal Risk Control." NeurIPS 2022. arxiv:2208.02814
- RouteNLP: "Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization." arxiv:2604.23577
- CP-Router: "An Uncertainty-Aware Router Between LLM and LRM." arxiv:2505.19970
- CAP: "Learning Conformal Abstention Policies for Adaptive Risk Management." arxiv:2502.06884