agent-cost-optimizer / docs /conformal_report.md

Upload docs/conformal_report.md

0087553 verified about 7 hours ago

4.06 kB

	# Conformal Calibration of Escalation Thresholds

	## Problem

	The model cascade router uses a heuristic threshold: route to the cheapest tier where P(success) >= 0.65. This threshold was hand-tuned. There is no guarantee that it provides adequate coverage — we might be escalating too often (wasting cost) or not often enough (accepting too many failures).

	## Solution

	Conformal risk control (Angelopoulos et al., 2022, arxiv:2208.02814) provides distribution-free coverage guarantees. Applied to our escalation problem following RouteNLP (arxiv:2604.23577):

	Guarantee: P(failure AND no escalation) ≤ α

	This means: if the router decides NOT to escalate (i.e., it trusts the cheap model), the probability that the cheap model actually fails is at most α.

	## Method

	### 1. Nonconformity Score

	For each calibration example i at tier t:
	- Compute the calibrated P(success) from the XGBoost + isotonic model
	- If the model failed (y_i = 0), the nonconformity score is s_i = P(success)_i
	- If the model succeeded (y_i = 1), the nonconformity score is s_i = -∞ (always safe)

	### 2. Conformal Threshold

	Set the escalation threshold as the ⌈(1-α)(n+1)/n⌉-th quantile of nonconformity scores among failed examples only (following RouteNLP's correctly-handled subset approach).

	This gives threshold λ̂_t for each tier t.

	### 3. Routing Decision

	```
	if P(success at tier t) >= λ̂_t:
	use tier t (no escalation needed)
	else:
	escalate to tier t+1
	```

	### 4. Coverage Verification

	On a held-out test set, verify that:
	- P(y=fail \| P(success) >= λ̂_t) ≤ α for each tier t
	- This is the violation rate — should be ≤ α

	## Expected Impact

	From RouteNLP's results:
	- With 500 calibration examples per tier, violation rate is 4.2% at α=0.05
	- With 100 examples, violation rate is 7.2%
	- With 1000 examples, violation rate is 3.9%

	Our SWE-Router dataset has 500 tasks × 8 models = 4000 total outcomes, giving us ~800 per tier. Expected violation rate: ~4%.

	## Implementation

	The module is in `aco/conformal.py`:

	```python
	from aco.conformal import ConformalEscalationCalibrator

	# Calibrate
	cal = ConformalEscalationCalibrator(alpha=0.05)
	thresholds = cal.calibrate(psuccess, outcomes)

	# Use in routing
	if cal.should_escalate(tier=2, psuccess=0.62):
	# Escalate to tier 3
	...
	```

	## Integration with v10 Router

	The conformal calibrator replaces the hardcoded 0.65 threshold in `route_v10()`:

	```python
	# Before
	for t in range(1, 6):
	if tier_probs[t] >= 0.65: # heuristic
	return t, tier_probs[t], tier_probs

	# After
	for t in range(1, 6):
	if not cal.should_escalate(t, tier_probs[t]): # conformal
	return t, tier_probs[t], tier_probs
	```

	## Sensitivity Analysis

	\| α \| Expected Violation Rate \| Expected Escalation Rate \| Cost Impact \|
	\|---\|------------------------\|-------------------------\|-------------\|
	\| 0.01 \| ~1% \| High (conservative) \| +10-15% cost \|
	\| 0.05 \| ~4% \| Medium \| Baseline \|
	\| 0.10 \| ~8% \| Low (aggressive) \| -5-10% cost \|
	\| 0.20 \| ~15% \| Very low \| -15-20% cost \|

	For production use, α=0.05 is recommended. For high-risk domains (legal, medical), use α=0.01.

	## Caveats

	1. Exchangeability assumption: Conformal guarantees require that calibration and test data are exchangeable. Distribution shift (new task types, new models) invalidates the guarantee.
	2. Sample size: With only 500 tasks, per-tier calibration has ~100 examples per tier. More data improves calibration.
	3. Conditional vs marginal: The guarantee is marginal (averaged over all inputs), not conditional (for a specific input type). Conditional coverage requires stronger assumptions.

	## References

	- Angelopoulos, A.N., et al. "Conformal Risk Control." NeurIPS 2022. arxiv:2208.02814
	- RouteNLP: "Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization." arxiv:2604.23577
	- CP-Router: "An Uncertainty-Aware Router Between LLM and LRM." arxiv:2505.19970
	- CAP: "Learning Conformal Abstention Policies for Adaptive Risk Management." arxiv:2502.06884