| # Conformal Calibration of Escalation Thresholds |
|
|
| ## Problem |
|
|
| The model cascade router uses a heuristic threshold: route to the cheapest tier where P(success) >= 0.65. This threshold was hand-tuned. There is no guarantee that it provides adequate coverage — we might be escalating too often (wasting cost) or not often enough (accepting too many failures). |
|
|
| ## Solution |
|
|
| Conformal risk control (Angelopoulos et al., 2022, arxiv:2208.02814) provides distribution-free coverage guarantees. Applied to our escalation problem following RouteNLP (arxiv:2604.23577): |
|
|
| **Guarantee**: P(failure AND no escalation) ≤ α |
|
|
| This means: if the router decides NOT to escalate (i.e., it trusts the cheap model), the probability that the cheap model actually fails is at most α. |
|
|
| ## Method |
|
|
| ### 1. Nonconformity Score |
|
|
| For each calibration example i at tier t: |
| - Compute the calibrated P(success) from the XGBoost + isotonic model |
| - If the model failed (y_i = 0), the nonconformity score is s_i = P(success)_i |
| - If the model succeeded (y_i = 1), the nonconformity score is s_i = -∞ (always safe) |
| |
| ### 2. Conformal Threshold |
| |
| Set the escalation threshold as the ⌈(1-α)(n+1)/n⌉-th quantile of nonconformity scores among **failed examples only** (following RouteNLP's correctly-handled subset approach). |
| |
| This gives threshold λ̂_t for each tier t. |
|
|
| ### 3. Routing Decision |
|
|
| ``` |
| if P(success at tier t) >= λ̂_t: |
| use tier t (no escalation needed) |
| else: |
| escalate to tier t+1 |
| ``` |
|
|
| ### 4. Coverage Verification |
|
|
| On a held-out test set, verify that: |
| - P(y=fail | P(success) >= λ̂_t) ≤ α for each tier t |
| - This is the violation rate — should be ≤ α |
| |
| ## Expected Impact |
| |
| From RouteNLP's results: |
| - With 500 calibration examples per tier, violation rate is 4.2% at α=0.05 |
| - With 100 examples, violation rate is 7.2% |
| - With 1000 examples, violation rate is 3.9% |
| |
| Our SWE-Router dataset has 500 tasks × 8 models = 4000 total outcomes, giving us ~800 per tier. Expected violation rate: ~4%. |
| |
| ## Implementation |
| |
| The module is in `aco/conformal.py`: |
| |
| ```python |
| from aco.conformal import ConformalEscalationCalibrator |
| |
| # Calibrate |
| cal = ConformalEscalationCalibrator(alpha=0.05) |
| thresholds = cal.calibrate(psuccess, outcomes) |
| |
| # Use in routing |
| if cal.should_escalate(tier=2, psuccess=0.62): |
| # Escalate to tier 3 |
| ... |
| ``` |
| |
| ## Integration with v10 Router |
|
|
| The conformal calibrator replaces the hardcoded 0.65 threshold in `route_v10()`: |
|
|
| ```python |
| # Before |
| for t in range(1, 6): |
| if tier_probs[t] >= 0.65: # heuristic |
| return t, tier_probs[t], tier_probs |
| |
| # After |
| for t in range(1, 6): |
| if not cal.should_escalate(t, tier_probs[t]): # conformal |
| return t, tier_probs[t], tier_probs |
| ``` |
|
|
| ## Sensitivity Analysis |
|
|
| | α | Expected Violation Rate | Expected Escalation Rate | Cost Impact | |
| |---|------------------------|-------------------------|-------------| |
| | 0.01 | ~1% | High (conservative) | +10-15% cost | |
| | 0.05 | ~4% | Medium | Baseline | |
| | 0.10 | ~8% | Low (aggressive) | -5-10% cost | |
| | 0.20 | ~15% | Very low | -15-20% cost | |
|
|
| For production use, α=0.05 is recommended. For high-risk domains (legal, medical), use α=0.01. |
|
|
| ## Caveats |
|
|
| 1. **Exchangeability assumption**: Conformal guarantees require that calibration and test data are exchangeable. Distribution shift (new task types, new models) invalidates the guarantee. |
| 2. **Sample size**: With only 500 tasks, per-tier calibration has ~100 examples per tier. More data improves calibration. |
| 3. **Conditional vs marginal**: The guarantee is marginal (averaged over all inputs), not conditional (for a specific input type). Conditional coverage requires stronger assumptions. |
|
|
| ## References |
|
|
| - Angelopoulos, A.N., et al. "Conformal Risk Control." NeurIPS 2022. arxiv:2208.02814 |
| - RouteNLP: "Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization." arxiv:2604.23577 |
| - CP-Router: "An Uncertainty-Aware Router Between LLM and LRM." arxiv:2505.19970 |
| - CAP: "Learning Conformal Abstention Policies for Adaptive Risk Management." arxiv:2502.06884 |
|
|