agent-cost-optimizer / docs /literature_review.md

Upload docs/literature_review.md

bd77292 verified 17 days ago

5.31 kB

	# Literature Review: Cost-Aware Agent Routing

	## What Exists

	### Model Routing

	RouteLLM (2406.18665, UC Berkeley/LMSYS, 2024): Trains BERT-based router on Chatbot Arena preference data. Achieves 2x+ cost reduction without sacrificing quality. Simple BERT classifier is surprisingly effective. Does NOT use execution feedback — routes based on input features only.

	HybridLLM (2404.14618, 2024): Probabilistic router predicts Pr[H(x) ≥ 0] (quality gap favorable for small model). Uses BART score as quality proxy. 40% fewer calls to large model with no quality drop.

	CARROT (SPROUT dataset, 2025): Multi-model routing benchmark with per-model scores and token counts across 13 models and 44K prompts. Provides ground truth for which model succeeds on which task.

	### Cascade Inference

	Cascade Routing (2410.10347, ETH SRI, ICLR 2025): Unifies routing (ex-ante) with cascading (post-hoc). Key finding: quality estimators are the #1 factor. Post-hoc estimates dramatically outperform ex-ante. Low σ_post benefits cascading; low σ_ante benefits routing. The combination wins.

	RouteNLP (2604.23577, 2026): 3-component system: difficulty-aware router + confidence-calibrated cascading + distillation-routing co-optimization. Token-level uncertainty u(m,x) = (1/L)Σ(1 - p(y_i\|x)) from softmax probabilities. Conformal risk control with α=0.05. 58% cost reduction in production (5K queries/day).

	CP-Router (2505.19970, 2025): Training-free uncertainty-aware routing between LLM and Large Reasoning Model. Uses entropy from cheap model output as escalation signal. No training required — just compute entropy and compare to conformal threshold.

	### Agentic Routing

	BAAR (2602.21227, 2025): Budget-Aware Agentic Routing via Boundary-Guided Training. Trains router (Qwen2.5-7B) to decide per-step which model to use. Two-phase training: BoSFT (difficulty taxonomy: Easy/Hard/Intractable) + BoPO (GRPO with boundary-relative rewards). Generalizes to strict per-task budget constraints.

	BEST-Route (2506.22716, Microsoft, 2025): Generates best-of-n samples from cheap model, selects best via proxy reward model. Router predicts both model and number of samples n. Up to 60% cost reduction with <1% performance drop.

	### Execution Feedback

	ClawTrace (2604.23853): Per-step cost attribution in agent traces. TraceCard format with USD cost + token counts + redundancy flags. Prune patches cut median cost 32%.

	LLMRouterBench (2601.07206): 400K instances, 21 datasets, 33 models. Finding: Simple baselines often match complex routers. Model complementarity is real but hard to exploit.

	### Failure Prediction

	AgentRewardBench (2504.08942): 1,302 web agent trajectories with expert success/side-effect/repetition labels across 5 benchmarks and 4 LLMs.

	### Selective Verification

	Process Reward Models (multiple): Train verifier to score intermediate steps. Use only when confidence is low or risk is high. Reduces verification cost by 70-90% while maintaining safety.

	## What Is Useful

	\| Paper \| Key Takeaway \| Applied In ACO \|
	\|-------\|-------------\|---------------\|
	\| RouteNLP \| Conformal cascading with token-level uncertainty \| Execution-feedback router (module 4) \|
	\| Cascade Routing \| Post-hoc >> ex-ante quality estimates \| v9 feedback escalation \|
	\| BAAR \| Per-step routing with difficulty taxonomy \| Per-step router (module 3) \|
	\| BEST-Route \| Best-of-N cheap sampling + reward model \| Planned next step \|
	\| CP-Router \| Training-free entropy-based escalation \| Simple fallback router \|
	\| ClawTrace \| Per-step cost attribution \| Telemetry schema \|
	\| SPROUT \| Multi-model eval data \| v11 training data \|

	## What Is Overkill

	- Full agent simulation environments (SciWorld, ALFWorld) — we don't need to simulate the entire agent, just route each step
	- GRPO-based RL training (BAAR) — XGBoost with real data outperforms RL with synthetic data
	- Distillation-routing co-optimization (RouteNLP) — we're not training task-specific models
	- Complex multi-stage pipelines — simple cascade + feedback is 80% of the benefit

	## What Is Missing

	1. Execution-feedback routing with real model logprobs — all work uses simulated or API-provided logprobs
	2. Conformal calibration for agent routing — no paper provides distribution-free quality guarantees
	3. Per-step routing with per-step training data — BAAR routes per step but trains on task-level outcomes
	4. Cost-quality Pareto frontier construction — no paper constructs the full frontier, only point comparisons
	5. Real agent benchmarks with cost data — SWE-Router is the only dataset with real USD costs per task

	## What To Implement First

	1. Execution-feedback escalation (RouteNLP pattern) — highest ROI, validated in production
	2. Per-tier XGBoost with real data (our v10/v11 approach) — simple, effective, requires real traces
	3. Per-step routing (BAAR pattern) — significant savings from routing steps differently
	4. Conformal calibration (CP-Router pattern) — safety guarantees without training
	5. Best-of-N cheap sampling (BEST-Route pattern) — orthogonal improvement to routing

	Priority: Execution feedback > Real data training > Per-step routing > Conformal calibration > Best-of-N

	# Literature Review: Cost-Aware Agent Routing

	## What Exists

	### Model Routing

	RouteLLM (2406.18665, UC Berkeley/LMSYS, 2024): Trains BERT-based router on Chatbot Arena preference data. Achieves 2x+ cost reduction without sacrificing quality. Simple BERT classifier is surprisingly effective. Does NOT use execution feedback — routes based on input features only.

	HybridLLM (2404.14618, 2024): Probabilistic router predicts Pr[H(x) ≥ 0] (quality gap favorable for small model). Uses BART score as quality proxy. 40% fewer calls to large model with no quality drop.

	CARROT (SPROUT dataset, 2025): Multi-model routing benchmark with per-model scores and token counts across 13 models and 44K prompts. Provides ground truth for which model succeeds on which task.

	### Cascade Inference

	Cascade Routing (2410.10347, ETH SRI, ICLR 2025): Unifies routing (ex-ante) with cascading (post-hoc). Key finding: quality estimators are the #1 factor. Post-hoc estimates dramatically outperform ex-ante. Low σ_post benefits cascading; low σ_ante benefits routing. The combination wins.

	RouteNLP (2604.23577, 2026): 3-component system: difficulty-aware router + confidence-calibrated cascading + distillation-routing co-optimization. Token-level uncertainty u(m,x) = (1/L)Σ(1 - p(y_i\|x)) from softmax probabilities. Conformal risk control with α=0.05. 58% cost reduction in production (5K queries/day).

	CP-Router (2505.19970, 2025): Training-free uncertainty-aware routing between LLM and Large Reasoning Model. Uses entropy from cheap model output as escalation signal. No training required — just compute entropy and compare to conformal threshold.

	### Agentic Routing

	BAAR (2602.21227, 2025): Budget-Aware Agentic Routing via Boundary-Guided Training. Trains router (Qwen2.5-7B) to decide per-step which model to use. Two-phase training: BoSFT (difficulty taxonomy: Easy/Hard/Intractable) + BoPO (GRPO with boundary-relative rewards). Generalizes to strict per-task budget constraints.

	BEST-Route (2506.22716, Microsoft, 2025): Generates best-of-n samples from cheap model, selects best via proxy reward model. Router predicts both model and number of samples n. Up to 60% cost reduction with <1% performance drop.

	### Execution Feedback

	ClawTrace (2604.23853): Per-step cost attribution in agent traces. TraceCard format with USD cost + token counts + redundancy flags. Prune patches cut median cost 32%.

	LLMRouterBench (2601.07206): 400K instances, 21 datasets, 33 models. Finding: Simple baselines often match complex routers. Model complementarity is real but hard to exploit.

	### Failure Prediction

	AgentRewardBench (2504.08942): 1,302 web agent trajectories with expert success/side-effect/repetition labels across 5 benchmarks and 4 LLMs.

	### Selective Verification

	Process Reward Models (multiple): Train verifier to score intermediate steps. Use only when confidence is low or risk is high. Reduces verification cost by 70-90% while maintaining safety.

	## What Is Useful

	\| Paper \| Key Takeaway \| Applied In ACO \|
	\|-------\|-------------\|---------------\|
	\| RouteNLP \| Conformal cascading with token-level uncertainty \| Execution-feedback router (module 4) \|
	\| Cascade Routing \| Post-hoc >> ex-ante quality estimates \| v9 feedback escalation \|
	\| BAAR \| Per-step routing with difficulty taxonomy \| Per-step router (module 3) \|
	\| BEST-Route \| Best-of-N cheap sampling + reward model \| Planned next step \|
	\| CP-Router \| Training-free entropy-based escalation \| Simple fallback router \|
	\| ClawTrace \| Per-step cost attribution \| Telemetry schema \|
	\| SPROUT \| Multi-model eval data \| v11 training data \|

	## What Is Overkill

	- Full agent simulation environments (SciWorld, ALFWorld) — we don't need to simulate the entire agent, just route each step
	- GRPO-based RL training (BAAR) — XGBoost with real data outperforms RL with synthetic data
	- Distillation-routing co-optimization (RouteNLP) — we're not training task-specific models
	- Complex multi-stage pipelines — simple cascade + feedback is 80% of the benefit

	## What Is Missing

	1. Execution-feedback routing with real model logprobs — all work uses simulated or API-provided logprobs
	2. Conformal calibration for agent routing — no paper provides distribution-free quality guarantees
	3. Per-step routing with per-step training data — BAAR routes per step but trains on task-level outcomes
	4. Cost-quality Pareto frontier construction — no paper constructs the full frontier, only point comparisons
	5. Real agent benchmarks with cost data — SWE-Router is the only dataset with real USD costs per task

	## What To Implement First

	1. Execution-feedback escalation (RouteNLP pattern) — highest ROI, validated in production
	2. Per-tier XGBoost with real data (our v10/v11 approach) — simple, effective, requires real traces
	3. Per-step routing (BAAR pattern) — significant savings from routing steps differently
	4. Conformal calibration (CP-Router pattern) — safety guarantees without training
	5. Best-of-N cheap sampling (BEST-Route pattern) — orthogonal improvement to routing

	Priority: Execution feedback > Real data training > Per-step routing > Conformal calibration > Best-of-N