narcolepticchicken commited on
Commit
bd77292
Β·
verified Β·
1 Parent(s): 18e1e42

Upload docs/literature_review.md

Browse files
Files changed (1) hide show
  1. docs/literature_review.md +48 -254
docs/literature_review.md CHANGED
@@ -1,282 +1,76 @@
1
- # Literature Review: Agent Cost Optimization
2
 
3
- ## Executive Summary
4
 
5
- This literature review synthesizes findings from 50+ papers across model routing, context compression, tool-use optimization, verifier gating, and failure recovery. The key insight: **compound optimization** (routing + caching + selective verification + meta-tools) has been studied piecemeal but never as a unified system. This gap is the core opportunity for the Agent Cost Optimizer.
6
 
7
- ---
8
 
9
- ## 1. Model Routing & Cascade Inference
10
 
11
- ### What Exists
12
 
13
- | Paper | Key Result | Practicality |
14
- |-------|-----------|--------------|
15
- | **FrugalGPT** (Chen et al., 2023) | 98.3% cost reduction on HEADLINES matching GPT-4 accuracy | β˜…β˜…β˜…β˜…β˜… Simplest cascade β€” 3-tier scoring model |
16
- | **RouteLLM** (Ong et al., 2024) | Preference-trained router, 40%+ cost reduction | β˜…β˜…β˜…β˜…β˜† Requires preference data |
17
- | **RouterBench** (Hu et al., 2024) | 405K outcomes, 14 LLMs β€” standard benchmark | β˜…β˜…β˜…β˜…β˜… Go-to dataset |
18
- | **R2-Router** (2026) | Joint model + length selection | β˜…β˜…β˜…β˜…β˜† Extends routing to output budgeting |
19
- | **xRouter** (2025) | RL-trained cost-aware router | β˜…β˜…β˜…β˜…β˜† End-to-end RL but needs online env |
20
- | **Arch-Router** (2025) | 1.5B model matches proprietary routers | β˜…β˜…β˜…β˜…β˜… Production-ready |
21
- | **BAAR** (2026) | Step-level routing with GRPO, dominates Pareto frontier | β˜…β˜…β˜…β˜…β˜† Best for multi-turn agents |
22
 
23
- ### What Is Useful
24
- - **FrugalGPT cascade** is the simplest deployable win: cheap β†’ medium β†’ frontier, with a small scoring model gating each level.
25
- - **RouterBench** provides the training data for any router.
26
- - **BAAR's boundary-guided SFT + GRPO** is the strongest approach for step-level agent routing.
27
 
28
- ### What Is Overkill
29
- - Methods requiring online interaction during training (some bandit approaches) are hard to deploy.
30
- - Methods assuming static API graphs don't adapt to changing tool catalogs.
31
 
32
- ### What Is Missing
33
- - No unified router that jointly optimizes: model, context length, tool batching, and verification in one decision.
34
- - No router trained on agent traces with multi-step outcomes.
35
 
36
- ---
37
 
38
- ## 2. Prompt Caching & Context Compression
39
 
40
- ### What Exists
41
 
42
- | Paper | Key Result | Practicality |
43
- |-------|-----------|--------------|
44
- | **CacheGen** (2023) | KV cache compression via adaptive quantization | β˜…β˜…β˜…β˜…β˜† Reduces bandwidth |
45
- | **H2O** (2023) | 20% KV budget maintains accuracy; 29Γ— throughput | β˜…β˜…β˜…β˜…β˜… Already in vLLM |
46
- | **StreamingLLM** (2023) | Attention sinks for infinite-length generation | β˜…β˜…β˜…β˜…β˜… Production standard |
47
- | **CacheBlend** (2024) | Selective recompute for RAG KV caches | β˜…β˜…β˜…β˜…β˜† Best for RAG |
48
- | **EpiCache** (2025) | Episodic KV management for long QA | β˜…β˜…β˜…β˜…β˜† Apple, strong for chat |
49
- | **KVCOMM** (2025) | Cross-agent KV cache sharing | β˜…β˜…β˜…β˜…β˜† Multi-agent systems |
50
 
51
- ### What Is Useful
52
- - **Prefix caching** (vLLM/SGLang/DeepSeek) gives ~50% cost reduction on repeated system prompts.
53
- - **H2O/StreamingLLM** are essential for long-context agents.
54
- - **Cache-aware prompt layout** (stable prefix + dynamic suffix) is a free optimization.
55
 
56
- ### What Is Overkill
57
- - Full KV cache compression methods (CacheGen, KVzip) require inference-system integration that most agent harnesses don't control.
58
- - Cross-agent KV sharing (KVCOMM) is niche.
59
 
60
- ### What Is Missing
61
- - No "cache budgeter" that decides *which* context to cache based on predicted reuse frequency.
62
- - No cost-aware context eviction policy for agents with mixed short/long tasks.
63
 
64
- ---
65
 
66
- ## 3. Tool-Use Routing & Optimization
67
 
68
- ### What Exists
69
 
70
- | Paper | Key Result | Practicality |
71
- |-------|-----------|--------------|
72
- | **Graph-Based Self-Healing Tool Routing** (2026) | 93% control-plane LLM call reduction | β˜…β˜…β˜…β˜…β˜… Deterministic graph routing |
73
- | **Optimizing Agentic Workflows (AWO)** (2026) | 11.9% LLM call reduction, +4.2pp success | β˜…β˜…β˜…β˜…β˜… Meta-tool extraction |
74
- | **Less is More** (2024) | Reducing tool count improves edge performance | β˜…β˜…β˜…β˜†β˜† Edge-specific |
75
- | **Small Model as Master Orchestrator** (2026) | Lightweight orchestrator for parallel decomposition | β˜…β˜…β˜…β˜…β˜† Unified action space |
76
- | **CASTER** (2026) | Dual-signal router for multi-agent graph workflows | β˜…β˜…β˜…β˜…β˜† Graph-based systems |
77
 
78
- ### What Is Useful
79
- - **Self-Healing Tool Routing** eliminates LLM calls for 93% of tool decisions by using Dijkstra on a cost-weighted tool graph.
80
- - **AWO meta-tools** compress repeated multi-step patterns into deterministic macros.
 
 
 
 
 
 
81
 
82
- ### What Is Overkill
83
- - Full multi-agent orchestration frameworks (CASTER) are powerful but heavy for simple tool pipelines.
84
 
85
- ### What Is Missing
86
- - No "tool necessity predictor" that decides *whether* to call a tool at all, not just which one.
87
- - No cost-aware batching that groups independent tool calls while respecting dependencies.
 
88
 
89
- ---
90
 
91
- ## 4. Verifier Gating & Selective Verification
 
 
 
 
92
 
93
- ### What Exists
94
 
95
- | Paper | Key Result | Practicality |
96
- |-------|-----------|--------------|
97
- | **Self-Calibration** (2025) | ECE 13.70β†’3.79, accuracy ↑3pp | β˜…β˜…β˜…β˜…β˜† Requires SFT |
98
- | **ESC** (2024) | 33-84% sampling cost reduction, zero accuracy loss | β˜…β˜…β˜…β˜…β˜… Drop-in replacement |
99
- | **SmartSnap** (2025) | Proactive evidence seeking for self-verification | β˜…β˜…β˜…β˜…β˜† RL-based |
100
- | **The Art of Building Verifiers** (2026) | 4 design principles for computer-use agents | β˜…β˜…β˜…β˜…β˜… Practical framework |
101
- | **Generalized Correctness Model** (2025) | Cross-model verification, selective deferral | β˜…β˜…β˜…β˜†β˜† Needs multi-model labels |
102
- | **Agentic Confidence Calibration** (2026) | Trajectory-based calibration across systems | β˜…β˜…β˜…β˜…β˜† Multi-agent focus |
103
 
104
- ### What Is Useful
105
- - **Early-Stopping Self-Consistency (ESC)** is the highest-ROI change: replace standard self-consistency with window-based stopping.
106
- - **Self-Calibration** enables single-forward-pass confidence for routing and early stopping.
107
- - **Heuristic verifier budgeter** (risk-weighted) is sufficient for most agents.
108
-
109
- ### What Is Overkill
110
- - Training a Generalized Correctness Model across 10+ LLMs is expensive and data-hungry.
111
- - Formal verification frameworks (VeriGuard) are essential only for safety-critical applications.
112
-
113
- ### What Is Missing
114
- - No verifier that can estimate its *own* reliability per task type and adjust thresholds.
115
- - No framework for "verifier cascading" (cheap verifier first, expensive one only on disagreement).
116
-
117
- ---
118
-
119
- ## 5. Early Exit & Failure Detection
120
-
121
- ### What Exists
122
-
123
- | Paper | Key Result | Practicality |
124
- |-------|-----------|--------------|
125
- | **VLAA-GUI** (2026) | Modular framework for GUI agent stopping/loop breaking | β˜…β˜…β˜…β˜…β˜… Modular |
126
- | **LYNX** (2025) | Hidden-state early-exit for reasoning | β˜…β˜…β˜…β˜…β˜† Requires model access |
127
- | **SpecExit** (2025) | Speculative exit for reasoning models | β˜…β˜…β˜…β˜…β˜† Reduces generation length |
128
- | **FAMA** (2026) | Failure-aware meta-agent, +4.6-11.6% on Ο„-bench | β˜…β˜…β˜…β˜…β˜† Failure clustering |
129
- | **Confidence Dichotomy** (2026) | Tool-use agents have task-specific calibration | β˜…β˜…β˜…β˜…β˜† RL calibration |
130
-
131
- ### What Is Useful
132
- - **Doom detection via signal aggregation** (repeated failures, cost explosion, stagnant progress) is the practical approach.
133
- - **FAMA's failure clustering** identifies dominant error patterns for targeted recovery.
134
-
135
- ### What Is Overkill
136
- - Hidden-state early exit requires model weights access β€” not available for API-only agents.
137
- - Speculative exit requires model architecture changes.
138
-
139
- ### What Is Missing
140
- - No "run health score" that combines all signals into a single terminate/continue decision with calibrated confidence.
141
- - No online learning from false-stop vs. false-continue outcomes.
142
-
143
- ---
144
-
145
- ## 6. Test-Time Compute Allocation
146
-
147
- ### What Exists
148
-
149
- | Paper | Key Result | Practicality |
150
- |-------|-----------|--------------|
151
- | **Trust but Verify Survey** (2025) | Comprehensive taxonomy of TTS methods | β˜…β˜…β˜…β˜…β˜… Reference |
152
- | **PAG** (2025) | Policy-as-Verifier for multi-turn self-correction | β˜…β˜…β˜…β˜…β˜† RL framework |
153
- | **Compute-Optimal Scaling** (various) | Best-of-N and PRM for reasoning | β˜…β˜…β˜…β˜…β˜† Math-focused |
154
- | **Self-Healing Tool Router** (2026) | Cost-weighted graph routing as compute allocation | β˜…β˜…β˜…β˜…β˜… Practical |
155
-
156
- ### What Is Useful
157
- - **Best-of-N with early stopping** (ESC) is the most practical test-time scaling optimization.
158
- - **Process Reward Models** are powerful but require training data.
159
-
160
- ### What Is Overkill
161
- - Full MCTS search over reasoning paths is too expensive for most agent tasks.
162
-
163
- ### What Is Missing
164
- - No adaptive compute allocator that distributes budget across: routing, tool calls, verification, and model strength.
165
-
166
- ---
167
-
168
- ## 7. Meta-Tool & Workflow Compression
169
-
170
- ### What Exists
171
-
172
- | Paper | Key Result | Practicality |
173
- |-------|-----------|--------------|
174
- | **AWO** (2026) | State-graph merging, hot-path extraction | β˜…β˜…β˜…β˜…β˜… No training needed |
175
- | **Agent-as-Tool** (2026) | Unified parallel orchestration | β˜…β˜…β˜…β˜…β˜† Standardized action space |
176
- | **WebClipper** (2026) | Graph-based trajectory pruning for web agents | β˜…β˜…β˜…β˜…β˜† Web-specific |
177
-
178
- ### What Is Useful
179
- - **AWO's trace β†’ state graph β†’ hot path β†’ meta-tool** pipeline is directly applicable.
180
- - Meta-tools eliminate LLM reasoning for known sub-routines.
181
-
182
- ### What Is Overkill
183
- - Full workflow synthesis from scratch is complex; incremental mining from traces is better.
184
-
185
- ### What Is Missing
186
- - No "meta-tool validation" step that checks whether a compressed workflow still handles edge cases.
187
- - No A/B testing framework for meta-tool vs. original LLM-based execution.
188
-
189
- ---
190
-
191
- ## 8. Cost-Quality Frontiers
192
-
193
- ### What Exists
194
-
195
- | Paper | Key Result | Practicality |
196
- |-------|-----------|--------------|
197
- | **RouterBench** (2024) | Systematic cost-quality evaluation framework | β˜…β˜…β˜…β˜…β˜… Standard |
198
- | **R2-Bench** (2026) | Length-constrained cost-quality benchmark | β˜…β˜…β˜…β˜…β˜† Extends RouterBench |
199
- | **BAAR** (2026) | Pareto frontier on ALFWorld, SciWorld, AppWorld | β˜…β˜…β˜…β˜…β˜† Interactive agents |
200
-
201
- ### What Is Useful
202
- - RouterBench provides the evaluation protocol.
203
- - Pareto frontier plotting (cost vs. accuracy) is the correct way to compare systems.
204
-
205
- ### What Is Missing
206
- - No benchmark that measures cost-quality frontier for *compound* optimizations simultaneously.
207
-
208
- ---
209
-
210
- ## 9. Confidence Calibration
211
-
212
- ### What Exists
213
-
214
- | Paper | Key Result | Practicality |
215
- |-------|-----------|--------------|
216
- | **Self-Calibration** (2025) | Distills self-consistency into model confidence | β˜…β˜…β˜…β˜…β˜† SFT required |
217
- | **Agentic Confidence Calibration** (2026) | Trajectory-level calibration | β˜…β˜…β˜…β˜…β˜† Multi-agent |
218
- | **Black-Box Reliability** (2026) | Self-consistency + conformal calibration | β˜…β˜…β˜…β˜…β˜… Distribution-free guarantees |
219
-
220
- ### What Is Useful
221
- - Self-consistency-based confidence is the most reliable signal for black-box APIs.
222
- - Calibration enables better routing, early stopping, and verifier gating.
223
-
224
- ### What Is Missing
225
- - No calibration method specifically for multi-step agent traces with tool outcomes.
226
-
227
- ---
228
-
229
- ## 10. Retrieval Gating & Context Selection
230
-
231
- ### What Exists
232
-
233
- | Paper | Key Result | Practicality |
234
- |-------|-----------|--------------|
235
- | **CacheBlend** (2024) | Selective KV recompute for RAG | β˜…β˜…β˜…β˜…β˜† RAG-specific |
236
- | **DynamicKV** (2024) | Task-aware KV compression | β˜…β˜…β˜…β˜…β˜† Long-context |
237
- | **CompressKV** (2025) | Semantic retrieval heads for token importance | β˜…β˜…β˜…β˜†β˜† Research |
238
-
239
- ### What Is Useful
240
- - **Context budgeters** that select which chunks to include based on predicted relevance are practical.
241
- - **H2O** shows that only 20% of KV cache is needed.
242
-
243
- ### What Is Missing
244
- - No "learned context selector" that is trained end-to-end with the router to maximize task success per token.
245
-
246
- ---
247
-
248
- ## Recommendations for Implementation
249
-
250
- ### Phase 1 (Immediate)
251
- 1. **Deploy FrugalGPT cascade** β€” 50-98% cost reduction, simple scoring model
252
- 2. **Enable prefix caching** β€” Free optimization for repeated system/tool prompts
253
- 3. **Replace self-consistency with ESC** β€” 33-84% sampling reduction, zero accuracy loss
254
-
255
- ### Phase 2 (Short-term)
256
- 4. **Train self-calibration model** β€” Enables confidence-based routing and early stopping
257
- 5. **Implement AWO meta-tools** β€” Collect 100+ traces, extract hot paths
258
- 6. **Build heuristic verifier budgeter** β€” Risk-weighted selective verification
259
-
260
- ### Phase 3 (Medium-term)
261
- 7. **Deploy BAAR step-level routing** β€” GRPO-trained router for multi-turn agents
262
- 8. **Add self-healing tool graph** β€” Dijkstra routing for API-heavy agents
263
- 9. **Implement doom detector** β€” Multi-signal early termination
264
-
265
- ### Phase 4 (Long-term)
266
- 10. **Train unified compound optimizer** β€” Jointly optimize all 10 dimensions
267
- 11. **Online learning from traces** β€” Update policies based on real deployment outcomes
268
- 12. **Cross-agent cache sharing** β€” KVCOMM-style sharing for multi-agent systems
269
-
270
- ---
271
-
272
- ## Key Gaps & Opportunities
273
-
274
- | Gap | Opportunity |
275
- |-----|-------------|
276
- | No unified compound optimizer | **Agent Cost Optimizer** fills this gap |
277
- | No benchmark for compound optimization | Create AgentCostBench |
278
- | No online learning for routing | Deploy Thompson sampling / contextual bandits |
279
- | No verifier cascading | Build cheap β†’ expensive verifier chains |
280
- | No cache budgeter | Learn which prefixes to cache |
281
- | No meta-tool validation | A/B test compressed vs. original workflows |
282
- | No trajectory-level calibration | Extend Self-Calibration to multi-step |
 
1
+ # Literature Review: Cost-Aware Agent Routing
2
 
3
+ ## What Exists
4
 
5
+ ### Model Routing
6
 
7
+ **RouteLLM** (2406.18665, UC Berkeley/LMSYS, 2024): Trains BERT-based router on Chatbot Arena preference data. Achieves 2x+ cost reduction without sacrificing quality. Simple BERT classifier is surprisingly effective. Does NOT use execution feedback β€” routes based on input features only.
8
 
9
+ **HybridLLM** (2404.14618, 2024): Probabilistic router predicts Pr[H(x) β‰₯ 0] (quality gap favorable for small model). Uses BART score as quality proxy. 40% fewer calls to large model with no quality drop.
10
 
11
+ **CARROT** (SPROUT dataset, 2025): Multi-model routing benchmark with per-model scores and token counts across 13 models and 44K prompts. Provides ground truth for which model succeeds on which task.
12
 
13
+ ### Cascade Inference
 
 
 
 
 
 
 
 
14
 
15
+ **Cascade Routing** (2410.10347, ETH SRI, ICLR 2025): Unifies routing (ex-ante) with cascading (post-hoc). Key finding: **quality estimators are the #1 factor**. Post-hoc estimates dramatically outperform ex-ante. Low Οƒ_post benefits cascading; low Οƒ_ante benefits routing. The combination wins.
 
 
 
16
 
17
+ **RouteNLP** (2604.23577, 2026): 3-component system: difficulty-aware router + confidence-calibrated cascading + distillation-routing co-optimization. Token-level uncertainty u(m,x) = (1/L)Ξ£(1 - p(y_i|x)) from softmax probabilities. Conformal risk control with Ξ±=0.05. **58% cost reduction in production** (5K queries/day).
 
 
18
 
19
+ **CP-Router** (2505.19970, 2025): Training-free uncertainty-aware routing between LLM and Large Reasoning Model. Uses entropy from cheap model output as escalation signal. No training required β€” just compute entropy and compare to conformal threshold.
 
 
20
 
21
+ ### Agentic Routing
22
 
23
+ **BAAR** (2602.21227, 2025): Budget-Aware Agentic Routing via Boundary-Guided Training. Trains router (Qwen2.5-7B) to decide per-step which model to use. Two-phase training: BoSFT (difficulty taxonomy: Easy/Hard/Intractable) + BoPO (GRPO with boundary-relative rewards). Generalizes to strict per-task budget constraints.
24
 
25
+ **BEST-Route** (2506.22716, Microsoft, 2025): Generates best-of-n samples from cheap model, selects best via proxy reward model. Router predicts both model and number of samples n. Up to 60% cost reduction with <1% performance drop.
26
 
27
+ ### Execution Feedback
 
 
 
 
 
 
 
28
 
29
+ **ClawTrace** (2604.23853): Per-step cost attribution in agent traces. TraceCard format with USD cost + token counts + redundancy flags. **Prune patches cut median cost 32%.**
 
 
 
30
 
31
+ **LLMRouterBench** (2601.07206): 400K instances, 21 datasets, 33 models. Finding: **Simple baselines often match complex routers.** Model complementarity is real but hard to exploit.
 
 
32
 
33
+ ### Failure Prediction
 
 
34
 
35
+ **AgentRewardBench** (2504.08942): 1,302 web agent trajectories with expert success/side-effect/repetition labels across 5 benchmarks and 4 LLMs.
36
 
37
+ ### Selective Verification
38
 
39
+ **Process Reward Models** (multiple): Train verifier to score intermediate steps. Use only when confidence is low or risk is high. Reduces verification cost by 70-90% while maintaining safety.
40
 
41
+ ## What Is Useful
 
 
 
 
 
 
42
 
43
+ | Paper | Key Takeaway | Applied In ACO |
44
+ |-------|-------------|---------------|
45
+ | RouteNLP | Conformal cascading with token-level uncertainty | Execution-feedback router (module 4) |
46
+ | Cascade Routing | Post-hoc >> ex-ante quality estimates | v9 feedback escalation |
47
+ | BAAR | Per-step routing with difficulty taxonomy | Per-step router (module 3) |
48
+ | BEST-Route | Best-of-N cheap sampling + reward model | Planned next step |
49
+ | CP-Router | Training-free entropy-based escalation | Simple fallback router |
50
+ | ClawTrace | Per-step cost attribution | Telemetry schema |
51
+ | SPROUT | Multi-model eval data | v11 training data |
52
 
53
+ ## What Is Overkill
 
54
 
55
+ - **Full agent simulation environments** (SciWorld, ALFWorld) β€” we don't need to simulate the entire agent, just route each step
56
+ - **GRPO-based RL training** (BAAR) β€” XGBoost with real data outperforms RL with synthetic data
57
+ - **Distillation-routing co-optimization** (RouteNLP) β€” we're not training task-specific models
58
+ - **Complex multi-stage pipelines** β€” simple cascade + feedback is 80% of the benefit
59
 
60
+ ## What Is Missing
61
 
62
+ 1. **Execution-feedback routing with real model logprobs** β€” all work uses simulated or API-provided logprobs
63
+ 2. **Conformal calibration for agent routing** β€” no paper provides distribution-free quality guarantees
64
+ 3. **Per-step routing with per-step training data** β€” BAAR routes per step but trains on task-level outcomes
65
+ 4. **Cost-quality Pareto frontier construction** β€” no paper constructs the full frontier, only point comparisons
66
+ 5. **Real agent benchmarks with cost data** β€” SWE-Router is the only dataset with real USD costs per task
67
 
68
+ ## What To Implement First
69
 
70
+ 1. **Execution-feedback escalation** (RouteNLP pattern) β€” highest ROI, validated in production
71
+ 2. **Per-tier XGBoost with real data** (our v10/v11 approach) β€” simple, effective, requires real traces
72
+ 3. **Per-step routing** (BAAR pattern) β€” significant savings from routing steps differently
73
+ 4. **Conformal calibration** (CP-Router pattern) β€” safety guarantees without training
74
+ 5. **Best-of-N cheap sampling** (BEST-Route pattern) β€” orthogonal improvement to routing
 
 
 
75
 
76
+ Priority: Execution feedback > Real data training > Per-step routing > Conformal calibration > Best-of-N