narcolepticchicken commited on
Commit
0087553
·
verified ·
1 Parent(s): ec846ef

Upload docs/conformal_report.md

Browse files
Files changed (1) hide show
  1. docs/conformal_report.md +109 -0
docs/conformal_report.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Conformal Calibration of Escalation Thresholds
2
+
3
+ ## Problem
4
+
5
+ The model cascade router uses a heuristic threshold: route to the cheapest tier where P(success) >= 0.65. This threshold was hand-tuned. There is no guarantee that it provides adequate coverage — we might be escalating too often (wasting cost) or not often enough (accepting too many failures).
6
+
7
+ ## Solution
8
+
9
+ Conformal risk control (Angelopoulos et al., 2022, arxiv:2208.02814) provides distribution-free coverage guarantees. Applied to our escalation problem following RouteNLP (arxiv:2604.23577):
10
+
11
+ **Guarantee**: P(failure AND no escalation) ≤ α
12
+
13
+ This means: if the router decides NOT to escalate (i.e., it trusts the cheap model), the probability that the cheap model actually fails is at most α.
14
+
15
+ ## Method
16
+
17
+ ### 1. Nonconformity Score
18
+
19
+ For each calibration example i at tier t:
20
+ - Compute the calibrated P(success) from the XGBoost + isotonic model
21
+ - If the model failed (y_i = 0), the nonconformity score is s_i = P(success)_i
22
+ - If the model succeeded (y_i = 1), the nonconformity score is s_i = -∞ (always safe)
23
+
24
+ ### 2. Conformal Threshold
25
+
26
+ Set the escalation threshold as the ⌈(1-α)(n+1)/n⌉-th quantile of nonconformity scores among **failed examples only** (following RouteNLP's correctly-handled subset approach).
27
+
28
+ This gives threshold λ̂_t for each tier t.
29
+
30
+ ### 3. Routing Decision
31
+
32
+ ```
33
+ if P(success at tier t) >= λ̂_t:
34
+ use tier t (no escalation needed)
35
+ else:
36
+ escalate to tier t+1
37
+ ```
38
+
39
+ ### 4. Coverage Verification
40
+
41
+ On a held-out test set, verify that:
42
+ - P(y=fail | P(success) >= λ̂_t) ≤ α for each tier t
43
+ - This is the violation rate — should be ≤ α
44
+
45
+ ## Expected Impact
46
+
47
+ From RouteNLP's results:
48
+ - With 500 calibration examples per tier, violation rate is 4.2% at α=0.05
49
+ - With 100 examples, violation rate is 7.2%
50
+ - With 1000 examples, violation rate is 3.9%
51
+
52
+ Our SWE-Router dataset has 500 tasks × 8 models = 4000 total outcomes, giving us ~800 per tier. Expected violation rate: ~4%.
53
+
54
+ ## Implementation
55
+
56
+ The module is in `aco/conformal.py`:
57
+
58
+ ```python
59
+ from aco.conformal import ConformalEscalationCalibrator
60
+
61
+ # Calibrate
62
+ cal = ConformalEscalationCalibrator(alpha=0.05)
63
+ thresholds = cal.calibrate(psuccess, outcomes)
64
+
65
+ # Use in routing
66
+ if cal.should_escalate(tier=2, psuccess=0.62):
67
+ # Escalate to tier 3
68
+ ...
69
+ ```
70
+
71
+ ## Integration with v10 Router
72
+
73
+ The conformal calibrator replaces the hardcoded 0.65 threshold in `route_v10()`:
74
+
75
+ ```python
76
+ # Before
77
+ for t in range(1, 6):
78
+ if tier_probs[t] >= 0.65: # heuristic
79
+ return t, tier_probs[t], tier_probs
80
+
81
+ # After
82
+ for t in range(1, 6):
83
+ if not cal.should_escalate(t, tier_probs[t]): # conformal
84
+ return t, tier_probs[t], tier_probs
85
+ ```
86
+
87
+ ## Sensitivity Analysis
88
+
89
+ | α | Expected Violation Rate | Expected Escalation Rate | Cost Impact |
90
+ |---|------------------------|-------------------------|-------------|
91
+ | 0.01 | ~1% | High (conservative) | +10-15% cost |
92
+ | 0.05 | ~4% | Medium | Baseline |
93
+ | 0.10 | ~8% | Low (aggressive) | -5-10% cost |
94
+ | 0.20 | ~15% | Very low | -15-20% cost |
95
+
96
+ For production use, α=0.05 is recommended. For high-risk domains (legal, medical), use α=0.01.
97
+
98
+ ## Caveats
99
+
100
+ 1. **Exchangeability assumption**: Conformal guarantees require that calibration and test data are exchangeable. Distribution shift (new task types, new models) invalidates the guarantee.
101
+ 2. **Sample size**: With only 500 tasks, per-tier calibration has ~100 examples per tier. More data improves calibration.
102
+ 3. **Conditional vs marginal**: The guarantee is marginal (averaged over all inputs), not conditional (for a specific input type). Conditional coverage requires stronger assumptions.
103
+
104
+ ## References
105
+
106
+ - Angelopoulos, A.N., et al. "Conformal Risk Control." NeurIPS 2022. arxiv:2208.02814
107
+ - RouteNLP: "Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization." arxiv:2604.23577
108
+ - CP-Router: "An Uncertainty-Aware Router Between LLM and LRM." arxiv:2505.19970
109
+ - CAP: "Learning Conformal Abstention Policies for Adaptive Risk Management." arxiv:2502.06884