narcolepticchicken
/

agent-cost-optimizer

Safetensors

Model card Files Files and versions

xet

Community

narcolepticchicken commited on 1 day ago

Commit

023e5d7

verified ·

1 Parent(s): dce68d9

Upload docs/literature_review.md

Browse files

Files changed (1) hide show

docs/literature_review.md +282 -0

docs/literature_review.md ADDED Viewed

	@@ -0,0 +1,282 @@

+# Literature Review: Agent Cost Optimization
+## Executive Summary
+This literature review synthesizes findings from 50+ papers across model routing, context compression, tool-use optimization, verifier gating, and failure recovery. The key insight: **compound optimization** (routing + caching + selective verification + meta-tools) has been studied piecemeal but never as a unified system. This gap is the core opportunity for the Agent Cost Optimizer.
+---
+## 1. Model Routing & Cascade Inference
+### What Exists
+| Paper | Key Result | Practicality |
+|-------|-----------|--------------|
+| **FrugalGPT** (Chen et al., 2023) | 98.3% cost reduction on HEADLINES matching GPT-4 accuracy | ★★★★★ Simplest cascade — 3-tier scoring model |
+| **RouteLLM** (Ong et al., 2024) | Preference-trained router, 40%+ cost reduction | ★★★★☆ Requires preference data |
+| **RouterBench** (Hu et al., 2024) | 405K outcomes, 14 LLMs — standard benchmark | ★★★★★ Go-to dataset |
+| **R2-Router** (2026) | Joint model + length selection | ★★★★☆ Extends routing to output budgeting |
+| **xRouter** (2025) | RL-trained cost-aware router | ★★★★☆ End-to-end RL but needs online env |
+| **Arch-Router** (2025) | 1.5B model matches proprietary routers | ★★★★★ Production-ready |
+| **BAAR** (2026) | Step-level routing with GRPO, dominates Pareto frontier | ★★★★☆ Best for multi-turn agents |
+### What Is Useful
+- **FrugalGPT cascade** is the simplest deployable win: cheap → medium → frontier, with a small scoring model gating each level.
+- **RouterBench** provides the training data for any router.
+- **BAAR's boundary-guided SFT + GRPO** is the strongest approach for step-level agent routing.
+### What Is Overkill
+- Methods requiring online interaction during training (some bandit approaches) are hard to deploy.
+- Methods assuming static API graphs don't adapt to changing tool catalogs.
+### What Is Missing
+- No unified router that jointly optimizes: model, context length, tool batching, and verification in one decision.
+- No router trained on agent traces with multi-step outcomes.
+---
+## 2. Prompt Caching & Context Compression
+### What Exists
+| Paper | Key Result | Practicality |
+|-------|-----------|--------------|
+| **CacheGen** (2023) | KV cache compression via adaptive quantization | ★★★★☆ Reduces bandwidth |
+| **H2O** (2023) | 20% KV budget maintains accuracy; 29× throughput | ★★★★★ Already in vLLM |
+| **StreamingLLM** (2023) | Attention sinks for infinite-length generation | ★★★★★ Production standard |
+| **CacheBlend** (2024) | Selective recompute for RAG KV caches | ★★★★☆ Best for RAG |
+| **EpiCache** (2025) | Episodic KV management for long QA | ★★★★☆ Apple, strong for chat |
+| **KVCOMM** (2025) | Cross-agent KV cache sharing | ★★★★☆ Multi-agent systems |
+### What Is Useful
+- **Prefix caching** (vLLM/SGLang/DeepSeek) gives ~50% cost reduction on repeated system prompts.
+- **H2O/StreamingLLM** are essential for long-context agents.
+- **Cache-aware prompt layout** (stable prefix + dynamic suffix) is a free optimization.
+### What Is Overkill
+- Full KV cache compression methods (CacheGen, KVzip) require inference-system integration that most agent harnesses don't control.
+- Cross-agent KV sharing (KVCOMM) is niche.
+### What Is Missing
+- No "cache budgeter" that decides *which* context to cache based on predicted reuse frequency.
+- No cost-aware context eviction policy for agents with mixed short/long tasks.
+---
+## 3. Tool-Use Routing & Optimization
+### What Exists
+| Paper | Key Result | Practicality |
+|-------|-----------|--------------|
+| **Graph-Based Self-Healing Tool Routing** (2026) | 93% control-plane LLM call reduction | ★★★★★ Deterministic graph routing |
+| **Optimizing Agentic Workflows (AWO)** (2026) | 11.9% LLM call reduction, +4.2pp success | ★★★★★ Meta-tool extraction |
+| **Less is More** (2024) | Reducing tool count improves edge performance | ★★★☆☆ Edge-specific |
+| **Small Model as Master Orchestrator** (2026) | Lightweight orchestrator for parallel decomposition | ★★★★☆ Unified action space |
+| **CASTER** (2026) | Dual-signal router for multi-agent graph workflows | ★★★★☆ Graph-based systems |
+### What Is Useful
+- **Self-Healing Tool Routing** eliminates LLM calls for 93% of tool decisions by using Dijkstra on a cost-weighted tool graph.
+- **AWO meta-tools** compress repeated multi-step patterns into deterministic macros.
+### What Is Overkill
+- Full multi-agent orchestration frameworks (CASTER) are powerful but heavy for simple tool pipelines.
+### What Is Missing
+- No "tool necessity predictor" that decides *whether* to call a tool at all, not just which one.
+- No cost-aware batching that groups independent tool calls while respecting dependencies.
+---
+## 4. Verifier Gating & Selective Verification
+### What Exists
+| Paper | Key Result | Practicality |
+|-------|-----------|--------------|
+| **Self-Calibration** (2025) | ECE 13.70→3.79, accuracy ↑3pp | ★★★★☆ Requires SFT |
+| **ESC** (2024) | 33-84% sampling cost reduction, zero accuracy loss | ★★★★★ Drop-in replacement |
+| **SmartSnap** (2025) | Proactive evidence seeking for self-verification | ★★★★☆ RL-based |
+| **The Art of Building Verifiers** (2026) | 4 design principles for computer-use agents | ★★★★★ Practical framework |
+| **Generalized Correctness Model** (2025) | Cross-model verification, selective deferral | ★★★☆☆ Needs multi-model labels |
+| **Agentic Confidence Calibration** (2026) | Trajectory-based calibration across systems | ★★★★☆ Multi-agent focus |
+### What Is Useful
+- **Early-Stopping Self-Consistency (ESC)** is the highest-ROI change: replace standard self-consistency with window-based stopping.
+- **Self-Calibration** enables single-forward-pass confidence for routing and early stopping.
+- **Heuristic verifier budgeter** (risk-weighted) is sufficient for most agents.
+### What Is Overkill
+- Training a Generalized Correctness Model across 10+ LLMs is expensive and data-hungry.
+- Formal verification frameworks (VeriGuard) are essential only for safety-critical applications.
+### What Is Missing
+- No verifier that can estimate its *own* reliability per task type and adjust thresholds.
+- No framework for "verifier cascading" (cheap verifier first, expensive one only on disagreement).
+---
+## 5. Early Exit & Failure Detection
+### What Exists
+| Paper | Key Result | Practicality |
+|-------|-----------|--------------|
+| **VLAA-GUI** (2026) | Modular framework for GUI agent stopping/loop breaking | ★★★★★ Modular |
+| **LYNX** (2025) | Hidden-state early-exit for reasoning | ★★★★☆ Requires model access |
+| **SpecExit** (2025) | Speculative exit for reasoning models | ★★★★☆ Reduces generation length |
+| **FAMA** (2026) | Failure-aware meta-agent, +4.6-11.6% on τ-bench | ★★★★☆ Failure clustering |
+| **Confidence Dichotomy** (2026) | Tool-use agents have task-specific calibration | ★★★★☆ RL calibration |
+### What Is Useful
+- **Doom detection via signal aggregation** (repeated failures, cost explosion, stagnant progress) is the practical approach.
+- **FAMA's failure clustering** identifies dominant error patterns for targeted recovery.
+### What Is Overkill
+- Hidden-state early exit requires model weights access — not available for API-only agents.
+- Speculative exit requires model architecture changes.
+### What Is Missing
+- No "run health score" that combines all signals into a single terminate/continue decision with calibrated confidence.
+- No online learning from false-stop vs. false-continue outcomes.
+---
+## 6. Test-Time Compute Allocation
+### What Exists
+| Paper | Key Result | Practicality |
+|-------|-----------|--------------|
+| **Trust but Verify Survey** (2025) | Comprehensive taxonomy of TTS methods | ★★★★★ Reference |
+| **PAG** (2025) | Policy-as-Verifier for multi-turn self-correction | ★★★★☆ RL framework |
+| **Compute-Optimal Scaling** (various) | Best-of-N and PRM for reasoning | ★★★★☆ Math-focused |
+| **Self-Healing Tool Router** (2026) | Cost-weighted graph routing as compute allocation | ★★★★★ Practical |
+### What Is Useful
+- **Best-of-N with early stopping** (ESC) is the most practical test-time scaling optimization.
+- **Process Reward Models** are powerful but require training data.
+### What Is Overkill
+- Full MCTS search over reasoning paths is too expensive for most agent tasks.
+### What Is Missing
+- No adaptive compute allocator that distributes budget across: routing, tool calls, verification, and model strength.
+---
+## 7. Meta-Tool & Workflow Compression
+### What Exists
+| Paper | Key Result | Practicality |
+|-------|-----------|--------------|
+| **AWO** (2026) | State-graph merging, hot-path extraction | ★★★★★ No training needed |
+| **Agent-as-Tool** (2026) | Unified parallel orchestration | ★★★★☆ Standardized action space |
+| **WebClipper** (2026) | Graph-based trajectory pruning for web agents | ★★★★☆ Web-specific |
+### What Is Useful
+- **AWO's trace → state graph → hot path → meta-tool** pipeline is directly applicable.
+- Meta-tools eliminate LLM reasoning for known sub-routines.
+### What Is Overkill
+- Full workflow synthesis from scratch is complex; incremental mining from traces is better.
+### What Is Missing
+- No "meta-tool validation" step that checks whether a compressed workflow still handles edge cases.
+- No A/B testing framework for meta-tool vs. original LLM-based execution.
+---
+## 8. Cost-Quality Frontiers
+### What Exists
+| Paper | Key Result | Practicality |
+|-------|-----------|--------------|
+| **RouterBench** (2024) | Systematic cost-quality evaluation framework | ★★★★★ Standard |
+| **R2-Bench** (2026) | Length-constrained cost-quality benchmark | ★★★★☆ Extends RouterBench |
+| **BAAR** (2026) | Pareto frontier on ALFWorld, SciWorld, AppWorld | ★★★★☆ Interactive agents |
+### What Is Useful
+- RouterBench provides the evaluation protocol.
+- Pareto frontier plotting (cost vs. accuracy) is the correct way to compare systems.
+### What Is Missing
+- No benchmark that measures cost-quality frontier for *compound* optimizations simultaneously.
+---
+## 9. Confidence Calibration
+### What Exists
+| Paper | Key Result | Practicality |
+|-------|-----------|--------------|
+| **Self-Calibration** (2025) | Distills self-consistency into model confidence | ★★★★☆ SFT required |
+| **Agentic Confidence Calibration** (2026) | Trajectory-level calibration | ★★★★☆ Multi-agent |
+| **Black-Box Reliability** (2026) | Self-consistency + conformal calibration | ★★★★★ Distribution-free guarantees |
+### What Is Useful
+- Self-consistency-based confidence is the most reliable signal for black-box APIs.
+- Calibration enables better routing, early stopping, and verifier gating.
+### What Is Missing
+- No calibration method specifically for multi-step agent traces with tool outcomes.
+---
+## 10. Retrieval Gating & Context Selection
+### What Exists
+| Paper | Key Result | Practicality |
+|-------|-----------|--------------|
+| **CacheBlend** (2024) | Selective KV recompute for RAG | ★★★★☆ RAG-specific |
+| **DynamicKV** (2024) | Task-aware KV compression | ★★★★☆ Long-context |
+| **CompressKV** (2025) | Semantic retrieval heads for token importance | ★★★☆☆ Research |
+### What Is Useful
+- **Context budgeters** that select which chunks to include based on predicted relevance are practical.
+- **H2O** shows that only 20% of KV cache is needed.
+### What Is Missing
+- No "learned context selector" that is trained end-to-end with the router to maximize task success per token.
+---
+## Recommendations for Implementation
+### Phase 1 (Immediate)
+1. **Deploy FrugalGPT cascade** — 50-98% cost reduction, simple scoring model
+2. **Enable prefix caching** — Free optimization for repeated system/tool prompts
+3. **Replace self-consistency with ESC** — 33-84% sampling reduction, zero accuracy loss
+### Phase 2 (Short-term)
+4. **Train self-calibration model** — Enables confidence-based routing and early stopping
+5. **Implement AWO meta-tools** — Collect 100+ traces, extract hot paths
+6. **Build heuristic verifier budgeter** — Risk-weighted selective verification
+### Phase 3 (Medium-term)
+7. **Deploy BAAR step-level routing** — GRPO-trained router for multi-turn agents
+8. **Add self-healing tool graph** — Dijkstra routing for API-heavy agents
+9. **Implement doom detector** — Multi-signal early termination
+### Phase 4 (Long-term)
+10. **Train unified compound optimizer** — Jointly optimize all 10 dimensions
+11. **Online learning from traces** — Update policies based on real deployment outcomes
+12. **Cross-agent cache sharing** — KVCOMM-style sharing for multi-agent systems
+---
+## Key Gaps & Opportunities
+| Gap | Opportunity |
+|-----|-------------|
+| No unified compound optimizer | **Agent Cost Optimizer** fills this gap |
+| No benchmark for compound optimization | Create AgentCostBench |
+| No online learning for routing | Deploy Thompson sampling / contextual bandits |
+| No verifier cascading | Build cheap → expensive verifier chains |
+| No cache budgeter | Learn which prefixes to cache |
+| No meta-tool validation | A/B test compressed vs. original workflows |
+| No trajectory-level calibration | Extend Self-Calibration to multi-step |