narcolepticchicken
/

agent-cost-optimizer

Safetensors

Model card Files Files and versions

xet

Community

narcolepticchicken commited on about 8 hours ago

Commit

0087553

verified ·

1 Parent(s): ec846ef

Upload docs/conformal_report.md

Browse files

Files changed (1) hide show

docs/conformal_report.md +109 -0

docs/conformal_report.md ADDED Viewed

	@@ -0,0 +1,109 @@

+# Conformal Calibration of Escalation Thresholds
+## Problem
+The model cascade router uses a heuristic threshold: route to the cheapest tier where P(success) >= 0.65. This threshold was hand-tuned. There is no guarantee that it provides adequate coverage — we might be escalating too often (wasting cost) or not often enough (accepting too many failures).
+## Solution
+Conformal risk control (Angelopoulos et al., 2022, arxiv:2208.02814) provides distribution-free coverage guarantees. Applied to our escalation problem following RouteNLP (arxiv:2604.23577):
+**Guarantee**: P(failure AND no escalation) ≤ α
+This means: if the router decides NOT to escalate (i.e., it trusts the cheap model), the probability that the cheap model actually fails is at most α.
+## Method
+### 1. Nonconformity Score
+For each calibration example i at tier t:
+- Compute the calibrated P(success) from the XGBoost + isotonic model
+- If the model failed (y_i = 0), the nonconformity score is s_i = P(success)_i
+- If the model succeeded (y_i = 1), the nonconformity score is s_i = -∞ (always safe)
+### 2. Conformal Threshold
+Set the escalation threshold as the ⌈(1-α)(n+1)/n⌉-th quantile of nonconformity scores among **failed examples only** (following RouteNLP's correctly-handled subset approach).
+This gives threshold λ̂_t for each tier t.
+### 3. Routing Decision
+```
+if P(success at tier t) >= λ̂_t:
+    use tier t  (no escalation needed)
+else:
+    escalate to tier t+1
+```
+### 4. Coverage Verification
+On a held-out test set, verify that:
+- P(y=fail | P(success) >= λ̂_t) ≤ α  for each tier t
+- This is the violation rate — should be ≤ α
+## Expected Impact
+From RouteNLP's results:
+- With 500 calibration examples per tier, violation rate is 4.2% at α=0.05
+- With 100 examples, violation rate is 7.2%
+- With 1000 examples, violation rate is 3.9%
+Our SWE-Router dataset has 500 tasks × 8 models = 4000 total outcomes, giving us ~800 per tier. Expected violation rate: ~4%.
+## Implementation
+The module is in `aco/conformal.py`:
+```python
+from aco.conformal import ConformalEscalationCalibrator
+# Calibrate
+cal = ConformalEscalationCalibrator(alpha=0.05)
+thresholds = cal.calibrate(psuccess, outcomes)
+# Use in routing
+if cal.should_escalate(tier=2, psuccess=0.62):
+    # Escalate to tier 3
+    ...
+```
+## Integration with v10 Router
+The conformal calibrator replaces the hardcoded 0.65 threshold in `route_v10()`:
+```python
+# Before
+for t in range(1, 6):
+    if tier_probs[t] >= 0.65:  # heuristic
+        return t, tier_probs[t], tier_probs
+# After
+for t in range(1, 6):
+    if not cal.should_escalate(t, tier_probs[t]):  # conformal
+        return t, tier_probs[t], tier_probs
+```
+## Sensitivity Analysis
+| α | Expected Violation Rate | Expected Escalation Rate | Cost Impact |
+|---|------------------------|-------------------------|-------------|
+| 0.01 | ~1% | High (conservative) | +10-15% cost |
+| 0.05 | ~4% | Medium | Baseline |
+| 0.10 | ~8% | Low (aggressive) | -5-10% cost |
+| 0.20 | ~15% | Very low | -15-20% cost |
+For production use, α=0.05 is recommended. For high-risk domains (legal, medical), use α=0.01.
+## Caveats
+1. **Exchangeability assumption**: Conformal guarantees require that calibration and test data are exchangeable. Distribution shift (new task types, new models) invalidates the guarantee.
+2. **Sample size**: With only 500 tasks, per-tier calibration has ~100 examples per tier. More data improves calibration.
+3. **Conditional vs marginal**: The guarantee is marginal (averaged over all inputs), not conditional (for a specific input type). Conditional coverage requires stronger assumptions.
+## References
+- Angelopoulos, A.N., et al. "Conformal Risk Control." NeurIPS 2022. arxiv:2208.02814
+- RouteNLP: "Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization." arxiv:2604.23577
+- CP-Router: "An Uncertainty-Aware Router Between LLM and LRM." arxiv:2505.19970
+- CAP: "Learning Conformal Abstention Policies for Adaptive Risk Management." arxiv:2502.06884