Upload docs/conformal_report.md
Browse files- docs/conformal_report.md +109 -0
docs/conformal_report.md
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Conformal Calibration of Escalation Thresholds
|
| 2 |
+
|
| 3 |
+
## Problem
|
| 4 |
+
|
| 5 |
+
The model cascade router uses a heuristic threshold: route to the cheapest tier where P(success) >= 0.65. This threshold was hand-tuned. There is no guarantee that it provides adequate coverage — we might be escalating too often (wasting cost) or not often enough (accepting too many failures).
|
| 6 |
+
|
| 7 |
+
## Solution
|
| 8 |
+
|
| 9 |
+
Conformal risk control (Angelopoulos et al., 2022, arxiv:2208.02814) provides distribution-free coverage guarantees. Applied to our escalation problem following RouteNLP (arxiv:2604.23577):
|
| 10 |
+
|
| 11 |
+
**Guarantee**: P(failure AND no escalation) ≤ α
|
| 12 |
+
|
| 13 |
+
This means: if the router decides NOT to escalate (i.e., it trusts the cheap model), the probability that the cheap model actually fails is at most α.
|
| 14 |
+
|
| 15 |
+
## Method
|
| 16 |
+
|
| 17 |
+
### 1. Nonconformity Score
|
| 18 |
+
|
| 19 |
+
For each calibration example i at tier t:
|
| 20 |
+
- Compute the calibrated P(success) from the XGBoost + isotonic model
|
| 21 |
+
- If the model failed (y_i = 0), the nonconformity score is s_i = P(success)_i
|
| 22 |
+
- If the model succeeded (y_i = 1), the nonconformity score is s_i = -∞ (always safe)
|
| 23 |
+
|
| 24 |
+
### 2. Conformal Threshold
|
| 25 |
+
|
| 26 |
+
Set the escalation threshold as the ⌈(1-α)(n+1)/n⌉-th quantile of nonconformity scores among **failed examples only** (following RouteNLP's correctly-handled subset approach).
|
| 27 |
+
|
| 28 |
+
This gives threshold λ̂_t for each tier t.
|
| 29 |
+
|
| 30 |
+
### 3. Routing Decision
|
| 31 |
+
|
| 32 |
+
```
|
| 33 |
+
if P(success at tier t) >= λ̂_t:
|
| 34 |
+
use tier t (no escalation needed)
|
| 35 |
+
else:
|
| 36 |
+
escalate to tier t+1
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
### 4. Coverage Verification
|
| 40 |
+
|
| 41 |
+
On a held-out test set, verify that:
|
| 42 |
+
- P(y=fail | P(success) >= λ̂_t) ≤ α for each tier t
|
| 43 |
+
- This is the violation rate — should be ≤ α
|
| 44 |
+
|
| 45 |
+
## Expected Impact
|
| 46 |
+
|
| 47 |
+
From RouteNLP's results:
|
| 48 |
+
- With 500 calibration examples per tier, violation rate is 4.2% at α=0.05
|
| 49 |
+
- With 100 examples, violation rate is 7.2%
|
| 50 |
+
- With 1000 examples, violation rate is 3.9%
|
| 51 |
+
|
| 52 |
+
Our SWE-Router dataset has 500 tasks × 8 models = 4000 total outcomes, giving us ~800 per tier. Expected violation rate: ~4%.
|
| 53 |
+
|
| 54 |
+
## Implementation
|
| 55 |
+
|
| 56 |
+
The module is in `aco/conformal.py`:
|
| 57 |
+
|
| 58 |
+
```python
|
| 59 |
+
from aco.conformal import ConformalEscalationCalibrator
|
| 60 |
+
|
| 61 |
+
# Calibrate
|
| 62 |
+
cal = ConformalEscalationCalibrator(alpha=0.05)
|
| 63 |
+
thresholds = cal.calibrate(psuccess, outcomes)
|
| 64 |
+
|
| 65 |
+
# Use in routing
|
| 66 |
+
if cal.should_escalate(tier=2, psuccess=0.62):
|
| 67 |
+
# Escalate to tier 3
|
| 68 |
+
...
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
## Integration with v10 Router
|
| 72 |
+
|
| 73 |
+
The conformal calibrator replaces the hardcoded 0.65 threshold in `route_v10()`:
|
| 74 |
+
|
| 75 |
+
```python
|
| 76 |
+
# Before
|
| 77 |
+
for t in range(1, 6):
|
| 78 |
+
if tier_probs[t] >= 0.65: # heuristic
|
| 79 |
+
return t, tier_probs[t], tier_probs
|
| 80 |
+
|
| 81 |
+
# After
|
| 82 |
+
for t in range(1, 6):
|
| 83 |
+
if not cal.should_escalate(t, tier_probs[t]): # conformal
|
| 84 |
+
return t, tier_probs[t], tier_probs
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## Sensitivity Analysis
|
| 88 |
+
|
| 89 |
+
| α | Expected Violation Rate | Expected Escalation Rate | Cost Impact |
|
| 90 |
+
|---|------------------------|-------------------------|-------------|
|
| 91 |
+
| 0.01 | ~1% | High (conservative) | +10-15% cost |
|
| 92 |
+
| 0.05 | ~4% | Medium | Baseline |
|
| 93 |
+
| 0.10 | ~8% | Low (aggressive) | -5-10% cost |
|
| 94 |
+
| 0.20 | ~15% | Very low | -15-20% cost |
|
| 95 |
+
|
| 96 |
+
For production use, α=0.05 is recommended. For high-risk domains (legal, medical), use α=0.01.
|
| 97 |
+
|
| 98 |
+
## Caveats
|
| 99 |
+
|
| 100 |
+
1. **Exchangeability assumption**: Conformal guarantees require that calibration and test data are exchangeable. Distribution shift (new task types, new models) invalidates the guarantee.
|
| 101 |
+
2. **Sample size**: With only 500 tasks, per-tier calibration has ~100 examples per tier. More data improves calibration.
|
| 102 |
+
3. **Conditional vs marginal**: The guarantee is marginal (averaged over all inputs), not conditional (for a specific input type). Conditional coverage requires stronger assumptions.
|
| 103 |
+
|
| 104 |
+
## References
|
| 105 |
+
|
| 106 |
+
- Angelopoulos, A.N., et al. "Conformal Risk Control." NeurIPS 2022. arxiv:2208.02814
|
| 107 |
+
- RouteNLP: "Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization." arxiv:2604.23577
|
| 108 |
+
- CP-Router: "An Uncertainty-Aware Router Between LLM and LRM." arxiv:2505.19970
|
| 109 |
+
- CAP: "Learning Conformal Abstention Policies for Adaptive Risk Management." arxiv:2502.06884
|