narcolepticchicken commited on
Commit
023e5d7
Β·
verified Β·
1 Parent(s): dce68d9

Upload docs/literature_review.md

Browse files
Files changed (1) hide show
  1. docs/literature_review.md +282 -0
docs/literature_review.md ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Literature Review: Agent Cost Optimization
2
+
3
+ ## Executive Summary
4
+
5
+ This literature review synthesizes findings from 50+ papers across model routing, context compression, tool-use optimization, verifier gating, and failure recovery. The key insight: **compound optimization** (routing + caching + selective verification + meta-tools) has been studied piecemeal but never as a unified system. This gap is the core opportunity for the Agent Cost Optimizer.
6
+
7
+ ---
8
+
9
+ ## 1. Model Routing & Cascade Inference
10
+
11
+ ### What Exists
12
+
13
+ | Paper | Key Result | Practicality |
14
+ |-------|-----------|--------------|
15
+ | **FrugalGPT** (Chen et al., 2023) | 98.3% cost reduction on HEADLINES matching GPT-4 accuracy | β˜…β˜…β˜…β˜…β˜… Simplest cascade β€” 3-tier scoring model |
16
+ | **RouteLLM** (Ong et al., 2024) | Preference-trained router, 40%+ cost reduction | β˜…β˜…β˜…β˜…β˜† Requires preference data |
17
+ | **RouterBench** (Hu et al., 2024) | 405K outcomes, 14 LLMs β€” standard benchmark | β˜…β˜…β˜…β˜…β˜… Go-to dataset |
18
+ | **R2-Router** (2026) | Joint model + length selection | β˜…β˜…β˜…β˜…β˜† Extends routing to output budgeting |
19
+ | **xRouter** (2025) | RL-trained cost-aware router | β˜…β˜…β˜…β˜…β˜† End-to-end RL but needs online env |
20
+ | **Arch-Router** (2025) | 1.5B model matches proprietary routers | β˜…β˜…β˜…β˜…β˜… Production-ready |
21
+ | **BAAR** (2026) | Step-level routing with GRPO, dominates Pareto frontier | β˜…β˜…β˜…β˜…β˜† Best for multi-turn agents |
22
+
23
+ ### What Is Useful
24
+ - **FrugalGPT cascade** is the simplest deployable win: cheap β†’ medium β†’ frontier, with a small scoring model gating each level.
25
+ - **RouterBench** provides the training data for any router.
26
+ - **BAAR's boundary-guided SFT + GRPO** is the strongest approach for step-level agent routing.
27
+
28
+ ### What Is Overkill
29
+ - Methods requiring online interaction during training (some bandit approaches) are hard to deploy.
30
+ - Methods assuming static API graphs don't adapt to changing tool catalogs.
31
+
32
+ ### What Is Missing
33
+ - No unified router that jointly optimizes: model, context length, tool batching, and verification in one decision.
34
+ - No router trained on agent traces with multi-step outcomes.
35
+
36
+ ---
37
+
38
+ ## 2. Prompt Caching & Context Compression
39
+
40
+ ### What Exists
41
+
42
+ | Paper | Key Result | Practicality |
43
+ |-------|-----------|--------------|
44
+ | **CacheGen** (2023) | KV cache compression via adaptive quantization | β˜…β˜…β˜…β˜…β˜† Reduces bandwidth |
45
+ | **H2O** (2023) | 20% KV budget maintains accuracy; 29Γ— throughput | β˜…β˜…β˜…β˜…β˜… Already in vLLM |
46
+ | **StreamingLLM** (2023) | Attention sinks for infinite-length generation | β˜…β˜…β˜…β˜…β˜… Production standard |
47
+ | **CacheBlend** (2024) | Selective recompute for RAG KV caches | β˜…β˜…β˜…β˜…β˜† Best for RAG |
48
+ | **EpiCache** (2025) | Episodic KV management for long QA | β˜…β˜…β˜…β˜…β˜† Apple, strong for chat |
49
+ | **KVCOMM** (2025) | Cross-agent KV cache sharing | β˜…β˜…β˜…β˜…β˜† Multi-agent systems |
50
+
51
+ ### What Is Useful
52
+ - **Prefix caching** (vLLM/SGLang/DeepSeek) gives ~50% cost reduction on repeated system prompts.
53
+ - **H2O/StreamingLLM** are essential for long-context agents.
54
+ - **Cache-aware prompt layout** (stable prefix + dynamic suffix) is a free optimization.
55
+
56
+ ### What Is Overkill
57
+ - Full KV cache compression methods (CacheGen, KVzip) require inference-system integration that most agent harnesses don't control.
58
+ - Cross-agent KV sharing (KVCOMM) is niche.
59
+
60
+ ### What Is Missing
61
+ - No "cache budgeter" that decides *which* context to cache based on predicted reuse frequency.
62
+ - No cost-aware context eviction policy for agents with mixed short/long tasks.
63
+
64
+ ---
65
+
66
+ ## 3. Tool-Use Routing & Optimization
67
+
68
+ ### What Exists
69
+
70
+ | Paper | Key Result | Practicality |
71
+ |-------|-----------|--------------|
72
+ | **Graph-Based Self-Healing Tool Routing** (2026) | 93% control-plane LLM call reduction | β˜…β˜…β˜…β˜…β˜… Deterministic graph routing |
73
+ | **Optimizing Agentic Workflows (AWO)** (2026) | 11.9% LLM call reduction, +4.2pp success | β˜…β˜…β˜…β˜…β˜… Meta-tool extraction |
74
+ | **Less is More** (2024) | Reducing tool count improves edge performance | β˜…β˜…β˜…β˜†β˜† Edge-specific |
75
+ | **Small Model as Master Orchestrator** (2026) | Lightweight orchestrator for parallel decomposition | β˜…β˜…β˜…β˜…β˜† Unified action space |
76
+ | **CASTER** (2026) | Dual-signal router for multi-agent graph workflows | β˜…β˜…β˜…β˜…β˜† Graph-based systems |
77
+
78
+ ### What Is Useful
79
+ - **Self-Healing Tool Routing** eliminates LLM calls for 93% of tool decisions by using Dijkstra on a cost-weighted tool graph.
80
+ - **AWO meta-tools** compress repeated multi-step patterns into deterministic macros.
81
+
82
+ ### What Is Overkill
83
+ - Full multi-agent orchestration frameworks (CASTER) are powerful but heavy for simple tool pipelines.
84
+
85
+ ### What Is Missing
86
+ - No "tool necessity predictor" that decides *whether* to call a tool at all, not just which one.
87
+ - No cost-aware batching that groups independent tool calls while respecting dependencies.
88
+
89
+ ---
90
+
91
+ ## 4. Verifier Gating & Selective Verification
92
+
93
+ ### What Exists
94
+
95
+ | Paper | Key Result | Practicality |
96
+ |-------|-----------|--------------|
97
+ | **Self-Calibration** (2025) | ECE 13.70β†’3.79, accuracy ↑3pp | β˜…β˜…β˜…β˜…β˜† Requires SFT |
98
+ | **ESC** (2024) | 33-84% sampling cost reduction, zero accuracy loss | β˜…β˜…β˜…β˜…β˜… Drop-in replacement |
99
+ | **SmartSnap** (2025) | Proactive evidence seeking for self-verification | β˜…β˜…β˜…β˜…β˜† RL-based |
100
+ | **The Art of Building Verifiers** (2026) | 4 design principles for computer-use agents | β˜…β˜…β˜…β˜…β˜… Practical framework |
101
+ | **Generalized Correctness Model** (2025) | Cross-model verification, selective deferral | β˜…β˜…β˜…β˜†β˜† Needs multi-model labels |
102
+ | **Agentic Confidence Calibration** (2026) | Trajectory-based calibration across systems | β˜…β˜…β˜…β˜…β˜† Multi-agent focus |
103
+
104
+ ### What Is Useful
105
+ - **Early-Stopping Self-Consistency (ESC)** is the highest-ROI change: replace standard self-consistency with window-based stopping.
106
+ - **Self-Calibration** enables single-forward-pass confidence for routing and early stopping.
107
+ - **Heuristic verifier budgeter** (risk-weighted) is sufficient for most agents.
108
+
109
+ ### What Is Overkill
110
+ - Training a Generalized Correctness Model across 10+ LLMs is expensive and data-hungry.
111
+ - Formal verification frameworks (VeriGuard) are essential only for safety-critical applications.
112
+
113
+ ### What Is Missing
114
+ - No verifier that can estimate its *own* reliability per task type and adjust thresholds.
115
+ - No framework for "verifier cascading" (cheap verifier first, expensive one only on disagreement).
116
+
117
+ ---
118
+
119
+ ## 5. Early Exit & Failure Detection
120
+
121
+ ### What Exists
122
+
123
+ | Paper | Key Result | Practicality |
124
+ |-------|-----------|--------------|
125
+ | **VLAA-GUI** (2026) | Modular framework for GUI agent stopping/loop breaking | β˜…β˜…β˜…β˜…β˜… Modular |
126
+ | **LYNX** (2025) | Hidden-state early-exit for reasoning | β˜…β˜…β˜…β˜…β˜† Requires model access |
127
+ | **SpecExit** (2025) | Speculative exit for reasoning models | β˜…β˜…β˜…β˜…β˜† Reduces generation length |
128
+ | **FAMA** (2026) | Failure-aware meta-agent, +4.6-11.6% on Ο„-bench | β˜…β˜…β˜…β˜…β˜† Failure clustering |
129
+ | **Confidence Dichotomy** (2026) | Tool-use agents have task-specific calibration | β˜…β˜…β˜…β˜…β˜† RL calibration |
130
+
131
+ ### What Is Useful
132
+ - **Doom detection via signal aggregation** (repeated failures, cost explosion, stagnant progress) is the practical approach.
133
+ - **FAMA's failure clustering** identifies dominant error patterns for targeted recovery.
134
+
135
+ ### What Is Overkill
136
+ - Hidden-state early exit requires model weights access β€” not available for API-only agents.
137
+ - Speculative exit requires model architecture changes.
138
+
139
+ ### What Is Missing
140
+ - No "run health score" that combines all signals into a single terminate/continue decision with calibrated confidence.
141
+ - No online learning from false-stop vs. false-continue outcomes.
142
+
143
+ ---
144
+
145
+ ## 6. Test-Time Compute Allocation
146
+
147
+ ### What Exists
148
+
149
+ | Paper | Key Result | Practicality |
150
+ |-------|-----------|--------------|
151
+ | **Trust but Verify Survey** (2025) | Comprehensive taxonomy of TTS methods | β˜…β˜…β˜…β˜…β˜… Reference |
152
+ | **PAG** (2025) | Policy-as-Verifier for multi-turn self-correction | β˜…β˜…β˜…β˜…β˜† RL framework |
153
+ | **Compute-Optimal Scaling** (various) | Best-of-N and PRM for reasoning | β˜…β˜…β˜…β˜…β˜† Math-focused |
154
+ | **Self-Healing Tool Router** (2026) | Cost-weighted graph routing as compute allocation | β˜…β˜…β˜…β˜…β˜… Practical |
155
+
156
+ ### What Is Useful
157
+ - **Best-of-N with early stopping** (ESC) is the most practical test-time scaling optimization.
158
+ - **Process Reward Models** are powerful but require training data.
159
+
160
+ ### What Is Overkill
161
+ - Full MCTS search over reasoning paths is too expensive for most agent tasks.
162
+
163
+ ### What Is Missing
164
+ - No adaptive compute allocator that distributes budget across: routing, tool calls, verification, and model strength.
165
+
166
+ ---
167
+
168
+ ## 7. Meta-Tool & Workflow Compression
169
+
170
+ ### What Exists
171
+
172
+ | Paper | Key Result | Practicality |
173
+ |-------|-----------|--------------|
174
+ | **AWO** (2026) | State-graph merging, hot-path extraction | β˜…β˜…β˜…β˜…β˜… No training needed |
175
+ | **Agent-as-Tool** (2026) | Unified parallel orchestration | β˜…β˜…β˜…β˜…β˜† Standardized action space |
176
+ | **WebClipper** (2026) | Graph-based trajectory pruning for web agents | β˜…β˜…β˜…β˜…β˜† Web-specific |
177
+
178
+ ### What Is Useful
179
+ - **AWO's trace β†’ state graph β†’ hot path β†’ meta-tool** pipeline is directly applicable.
180
+ - Meta-tools eliminate LLM reasoning for known sub-routines.
181
+
182
+ ### What Is Overkill
183
+ - Full workflow synthesis from scratch is complex; incremental mining from traces is better.
184
+
185
+ ### What Is Missing
186
+ - No "meta-tool validation" step that checks whether a compressed workflow still handles edge cases.
187
+ - No A/B testing framework for meta-tool vs. original LLM-based execution.
188
+
189
+ ---
190
+
191
+ ## 8. Cost-Quality Frontiers
192
+
193
+ ### What Exists
194
+
195
+ | Paper | Key Result | Practicality |
196
+ |-------|-----------|--------------|
197
+ | **RouterBench** (2024) | Systematic cost-quality evaluation framework | β˜…β˜…β˜…β˜…β˜… Standard |
198
+ | **R2-Bench** (2026) | Length-constrained cost-quality benchmark | β˜…β˜…β˜…β˜…β˜† Extends RouterBench |
199
+ | **BAAR** (2026) | Pareto frontier on ALFWorld, SciWorld, AppWorld | β˜…β˜…β˜…β˜…β˜† Interactive agents |
200
+
201
+ ### What Is Useful
202
+ - RouterBench provides the evaluation protocol.
203
+ - Pareto frontier plotting (cost vs. accuracy) is the correct way to compare systems.
204
+
205
+ ### What Is Missing
206
+ - No benchmark that measures cost-quality frontier for *compound* optimizations simultaneously.
207
+
208
+ ---
209
+
210
+ ## 9. Confidence Calibration
211
+
212
+ ### What Exists
213
+
214
+ | Paper | Key Result | Practicality |
215
+ |-------|-----------|--------------|
216
+ | **Self-Calibration** (2025) | Distills self-consistency into model confidence | β˜…β˜…β˜…β˜…β˜† SFT required |
217
+ | **Agentic Confidence Calibration** (2026) | Trajectory-level calibration | β˜…β˜…β˜…β˜…β˜† Multi-agent |
218
+ | **Black-Box Reliability** (2026) | Self-consistency + conformal calibration | β˜…β˜…β˜…β˜…β˜… Distribution-free guarantees |
219
+
220
+ ### What Is Useful
221
+ - Self-consistency-based confidence is the most reliable signal for black-box APIs.
222
+ - Calibration enables better routing, early stopping, and verifier gating.
223
+
224
+ ### What Is Missing
225
+ - No calibration method specifically for multi-step agent traces with tool outcomes.
226
+
227
+ ---
228
+
229
+ ## 10. Retrieval Gating & Context Selection
230
+
231
+ ### What Exists
232
+
233
+ | Paper | Key Result | Practicality |
234
+ |-------|-----------|--------------|
235
+ | **CacheBlend** (2024) | Selective KV recompute for RAG | β˜…β˜…β˜…β˜…β˜† RAG-specific |
236
+ | **DynamicKV** (2024) | Task-aware KV compression | β˜…β˜…β˜…β˜…β˜† Long-context |
237
+ | **CompressKV** (2025) | Semantic retrieval heads for token importance | β˜…β˜…β˜…β˜†β˜† Research |
238
+
239
+ ### What Is Useful
240
+ - **Context budgeters** that select which chunks to include based on predicted relevance are practical.
241
+ - **H2O** shows that only 20% of KV cache is needed.
242
+
243
+ ### What Is Missing
244
+ - No "learned context selector" that is trained end-to-end with the router to maximize task success per token.
245
+
246
+ ---
247
+
248
+ ## Recommendations for Implementation
249
+
250
+ ### Phase 1 (Immediate)
251
+ 1. **Deploy FrugalGPT cascade** β€” 50-98% cost reduction, simple scoring model
252
+ 2. **Enable prefix caching** β€” Free optimization for repeated system/tool prompts
253
+ 3. **Replace self-consistency with ESC** β€” 33-84% sampling reduction, zero accuracy loss
254
+
255
+ ### Phase 2 (Short-term)
256
+ 4. **Train self-calibration model** β€” Enables confidence-based routing and early stopping
257
+ 5. **Implement AWO meta-tools** β€” Collect 100+ traces, extract hot paths
258
+ 6. **Build heuristic verifier budgeter** β€” Risk-weighted selective verification
259
+
260
+ ### Phase 3 (Medium-term)
261
+ 7. **Deploy BAAR step-level routing** β€” GRPO-trained router for multi-turn agents
262
+ 8. **Add self-healing tool graph** β€” Dijkstra routing for API-heavy agents
263
+ 9. **Implement doom detector** β€” Multi-signal early termination
264
+
265
+ ### Phase 4 (Long-term)
266
+ 10. **Train unified compound optimizer** β€” Jointly optimize all 10 dimensions
267
+ 11. **Online learning from traces** β€” Update policies based on real deployment outcomes
268
+ 12. **Cross-agent cache sharing** β€” KVCOMM-style sharing for multi-agent systems
269
+
270
+ ---
271
+
272
+ ## Key Gaps & Opportunities
273
+
274
+ | Gap | Opportunity |
275
+ |-----|-------------|
276
+ | No unified compound optimizer | **Agent Cost Optimizer** fills this gap |
277
+ | No benchmark for compound optimization | Create AgentCostBench |
278
+ | No online learning for routing | Deploy Thompson sampling / contextual bandits |
279
+ | No verifier cascading | Build cheap β†’ expensive verifier chains |
280
+ | No cache budgeter | Learn which prefixes to cache |
281
+ | No meta-tool validation | A/B test compressed vs. original workflows |
282
+ | No trajectory-level calibration | Extend Self-Calibration to multi-step |