Conformal Calibration of Escalation Thresholds

Problem

The model cascade router uses a heuristic threshold: route to the cheapest tier where P(success) >= 0.65. This threshold was hand-tuned. There is no guarantee that it provides adequate coverage — we might be escalating too often (wasting cost) or not often enough (accepting too many failures).

Solution

Conformal risk control (Angelopoulos et al., 2022, arxiv:2208.02814) provides distribution-free coverage guarantees. Applied to our escalation problem following RouteNLP (arxiv:2604.23577):

Guarantee: P(failure AND no escalation) ≤ α

This means: if the router decides NOT to escalate (i.e., it trusts the cheap model), the probability that the cheap model actually fails is at most α.

Method

1. Nonconformity Score

For each calibration example i at tier t:

Compute the calibrated P(success) from the XGBoost + isotonic model
If the model failed (y_i = 0), the nonconformity score is s_i = P(success)_i
If the model succeeded (y_i = 1), the nonconformity score is s_i = -∞ (always safe)

2. Conformal Threshold

Set the escalation threshold as the ⌈(1-α)(n+1)/n⌉-th quantile of nonconformity scores among failed examples only (following RouteNLP's correctly-handled subset approach).

This gives threshold λ̂_t for each tier t.

3. Routing Decision

if P(success at tier t) >= λ̂_t:
    use tier t  (no escalation needed)
else:
    escalate to tier t+1

4. Coverage Verification

On a held-out test set, verify that:

P(y=fail | P(success) >= λ̂_t) ≤ α for each tier t
This is the violation rate — should be ≤ α

Expected Impact

From RouteNLP's results:

With 500 calibration examples per tier, violation rate is 4.2% at α=0.05
With 100 examples, violation rate is 7.2%
With 1000 examples, violation rate is 3.9%

Our SWE-Router dataset has 500 tasks × 8 models = 4000 total outcomes, giving us ~800 per tier. Expected violation rate: ~4%.

Implementation

The module is in aco/conformal.py:

from aco.conformal import ConformalEscalationCalibrator

# Calibrate
cal = ConformalEscalationCalibrator(alpha=0.05)
thresholds = cal.calibrate(psuccess, outcomes)

# Use in routing
if cal.should_escalate(tier=2, psuccess=0.62):
    # Escalate to tier 3
    ...

Integration with v10 Router

The conformal calibrator replaces the hardcoded 0.65 threshold in route_v10():

# Before
for t in range(1, 6):
    if tier_probs[t] >= 0.65:  # heuristic
        return t, tier_probs[t], tier_probs

# After
for t in range(1, 6):
    if not cal.should_escalate(t, tier_probs[t]):  # conformal
        return t, tier_probs[t], tier_probs

Sensitivity Analysis

α	Expected Violation Rate	Expected Escalation Rate	Cost Impact
0.01	~1%	High (conservative)	+10-15% cost
0.05	~4%	Medium	Baseline
0.10	~8%	Low (aggressive)	-5-10% cost
0.20	~15%	Very low	-15-20% cost

For production use, α=0.05 is recommended. For high-risk domains (legal, medical), use α=0.01.

Caveats

Exchangeability assumption: Conformal guarantees require that calibration and test data are exchangeable. Distribution shift (new task types, new models) invalidates the guarantee.
Sample size: With only 500 tasks, per-tier calibration has ~100 examples per tier. More data improves calibration.
Conditional vs marginal: The guarantee is marginal (averaged over all inputs), not conditional (for a specific input type). Conditional coverage requires stronger assumptions.

References

Angelopoulos, A.N., et al. "Conformal Risk Control." NeurIPS 2022. arxiv:2208.02814
RouteNLP: "Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization." arxiv:2604.23577
CP-Router: "An Uncertainty-Aware Router Between LLM and LRM." arxiv:2505.19970
CAP: "Learning Conformal Abstention Policies for Adaptive Risk Management." arxiv:2502.06884