| # Literature Review: Cost-Aware Agent Routing |
|
|
| ## What Exists |
|
|
| ### Model Routing |
|
|
| **RouteLLM** (2406.18665, UC Berkeley/LMSYS, 2024): Trains BERT-based router on Chatbot Arena preference data. Achieves 2x+ cost reduction without sacrificing quality. Simple BERT classifier is surprisingly effective. Does NOT use execution feedback β routes based on input features only. |
|
|
| **HybridLLM** (2404.14618, 2024): Probabilistic router predicts Pr[H(x) β₯ 0] (quality gap favorable for small model). Uses BART score as quality proxy. 40% fewer calls to large model with no quality drop. |
|
|
| **CARROT** (SPROUT dataset, 2025): Multi-model routing benchmark with per-model scores and token counts across 13 models and 44K prompts. Provides ground truth for which model succeeds on which task. |
|
|
| ### Cascade Inference |
|
|
| **Cascade Routing** (2410.10347, ETH SRI, ICLR 2025): Unifies routing (ex-ante) with cascading (post-hoc). Key finding: **quality estimators are the #1 factor**. Post-hoc estimates dramatically outperform ex-ante. Low Ο_post benefits cascading; low Ο_ante benefits routing. The combination wins. |
|
|
| **RouteNLP** (2604.23577, 2026): 3-component system: difficulty-aware router + confidence-calibrated cascading + distillation-routing co-optimization. Token-level uncertainty u(m,x) = (1/L)Ξ£(1 - p(y_i|x)) from softmax probabilities. Conformal risk control with Ξ±=0.05. **58% cost reduction in production** (5K queries/day). |
| |
| **CP-Router** (2505.19970, 2025): Training-free uncertainty-aware routing between LLM and Large Reasoning Model. Uses entropy from cheap model output as escalation signal. No training required β just compute entropy and compare to conformal threshold. |
| |
| ### Agentic Routing |
| |
| **BAAR** (2602.21227, 2025): Budget-Aware Agentic Routing via Boundary-Guided Training. Trains router (Qwen2.5-7B) to decide per-step which model to use. Two-phase training: BoSFT (difficulty taxonomy: Easy/Hard/Intractable) + BoPO (GRPO with boundary-relative rewards). Generalizes to strict per-task budget constraints. |
| |
| **BEST-Route** (2506.22716, Microsoft, 2025): Generates best-of-n samples from cheap model, selects best via proxy reward model. Router predicts both model and number of samples n. Up to 60% cost reduction with <1% performance drop. |
| |
| ### Execution Feedback |
| |
| **ClawTrace** (2604.23853): Per-step cost attribution in agent traces. TraceCard format with USD cost + token counts + redundancy flags. **Prune patches cut median cost 32%.** |
| |
| **LLMRouterBench** (2601.07206): 400K instances, 21 datasets, 33 models. Finding: **Simple baselines often match complex routers.** Model complementarity is real but hard to exploit. |
| |
| ### Failure Prediction |
| |
| **AgentRewardBench** (2504.08942): 1,302 web agent trajectories with expert success/side-effect/repetition labels across 5 benchmarks and 4 LLMs. |
| |
| ### Selective Verification |
| |
| **Process Reward Models** (multiple): Train verifier to score intermediate steps. Use only when confidence is low or risk is high. Reduces verification cost by 70-90% while maintaining safety. |
| |
| ## What Is Useful |
| |
| | Paper | Key Takeaway | Applied In ACO | |
| |-------|-------------|---------------| |
| | RouteNLP | Conformal cascading with token-level uncertainty | Execution-feedback router (module 4) | |
| | Cascade Routing | Post-hoc >> ex-ante quality estimates | v9 feedback escalation | |
| | BAAR | Per-step routing with difficulty taxonomy | Per-step router (module 3) | |
| | BEST-Route | Best-of-N cheap sampling + reward model | Planned next step | |
| | CP-Router | Training-free entropy-based escalation | Simple fallback router | |
| | ClawTrace | Per-step cost attribution | Telemetry schema | |
| | SPROUT | Multi-model eval data | v11 training data | |
| |
| ## What Is Overkill |
| |
| - **Full agent simulation environments** (SciWorld, ALFWorld) β we don't need to simulate the entire agent, just route each step |
| - **GRPO-based RL training** (BAAR) β XGBoost with real data outperforms RL with synthetic data |
| - **Distillation-routing co-optimization** (RouteNLP) β we're not training task-specific models |
| - **Complex multi-stage pipelines** β simple cascade + feedback is 80% of the benefit |
| |
| ## What Is Missing |
| |
| 1. **Execution-feedback routing with real model logprobs** β all work uses simulated or API-provided logprobs |
| 2. **Conformal calibration for agent routing** β no paper provides distribution-free quality guarantees |
| 3. **Per-step routing with per-step training data** β BAAR routes per step but trains on task-level outcomes |
| 4. **Cost-quality Pareto frontier construction** β no paper constructs the full frontier, only point comparisons |
| 5. **Real agent benchmarks with cost data** β SWE-Router is the only dataset with real USD costs per task |
| |
| ## What To Implement First |
| |
| 1. **Execution-feedback escalation** (RouteNLP pattern) β highest ROI, validated in production |
| 2. **Per-tier XGBoost with real data** (our v10/v11 approach) β simple, effective, requires real traces |
| 3. **Per-step routing** (BAAR pattern) β significant savings from routing steps differently |
| 4. **Conformal calibration** (CP-Router pattern) β safety guarantees without training |
| 5. **Best-of-N cheap sampling** (BEST-Route pattern) β orthogonal improvement to routing |
| |
| Priority: Execution feedback > Real data training > Per-step routing > Conformal calibration > Best-of-N |
| |