narcolepticchicken
/

agent-cost-optimizer

Safetensors

Model card Files Files and versions

xet

Community

narcolepticchicken commited on about 15 hours ago

Commit

bd77292

verified ·

1 Parent(s): 18e1e42

Upload docs/literature_review.md

Browse files

Files changed (1) hide show

docs/literature_review.md +48 -254

docs/literature_review.md CHANGED Viewed

@@ -1,282 +1,76 @@
-# Literature Review: Agent Cost Optimization
-## Executive Summary
-This literature review synthesizes findings from 50+ papers across model routing, context compression, tool-use optimization, verifier gating, and failure recovery. The key insight: **compound optimization** (routing + caching + selective verification + meta-tools) has been studied piecemeal but never as a unified system. This gap is the core opportunity for the Agent Cost Optimizer.
----
-## 1. Model Routing & Cascade Inference
-### What Exists
-| Paper | Key Result | Practicality |
-|-------|-----------|--------------|
-| **FrugalGPT** (Chen et al., 2023) | 98.3% cost reduction on HEADLINES matching GPT-4 accuracy | ★★★★★ Simplest cascade — 3-tier scoring model |
-| **RouteLLM** (Ong et al., 2024) | Preference-trained router, 40%+ cost reduction | ★★★★☆ Requires preference data |
-| **RouterBench** (Hu et al., 2024) | 405K outcomes, 14 LLMs — standard benchmark | ★★★★★ Go-to dataset |
-| **R2-Router** (2026) | Joint model + length selection | ★★★★☆ Extends routing to output budgeting |
-| **xRouter** (2025) | RL-trained cost-aware router | ★★★★☆ End-to-end RL but needs online env |
-| **Arch-Router** (2025) | 1.5B model matches proprietary routers | ★★★★★ Production-ready |
-| **BAAR** (2026) | Step-level routing with GRPO, dominates Pareto frontier | ★★★★☆ Best for multi-turn agents |
-### What Is Useful
-- **FrugalGPT cascade** is the simplest deployable win: cheap → medium → frontier, with a small scoring model gating each level.
-- **RouterBench** provides the training data for any router.
-- **BAAR's boundary-guided SFT + GRPO** is the strongest approach for step-level agent routing.
-### What Is Overkill
-- Methods requiring online interaction during training (some bandit approaches) are hard to deploy.
-- Methods assuming static API graphs don't adapt to changing tool catalogs.
-### What Is Missing
-- No unified router that jointly optimizes: model, context length, tool batching, and verification in one decision.
-- No router trained on agent traces with multi-step outcomes.
----
-## 2. Prompt Caching & Context Compression
-### What Exists
-| Paper | Key Result | Practicality |
-|-------|-----------|--------------|
-| **CacheGen** (2023) | KV cache compression via adaptive quantization | ★★★★☆ Reduces bandwidth |
-| **H2O** (2023) | 20% KV budget maintains accuracy; 29× throughput | ★★★★★ Already in vLLM |
-| **StreamingLLM** (2023) | Attention sinks for infinite-length generation | ★★★★★ Production standard |
-| **CacheBlend** (2024) | Selective recompute for RAG KV caches | ★★★★☆ Best for RAG |
-| **EpiCache** (2025) | Episodic KV management for long QA | ★★★★☆ Apple, strong for chat |
-| **KVCOMM** (2025) | Cross-agent KV cache sharing | ★★★★☆ Multi-agent systems |
-### What Is Useful
-- **Prefix caching** (vLLM/SGLang/DeepSeek) gives ~50% cost reduction on repeated system prompts.
-- **H2O/StreamingLLM** are essential for long-context agents.
-- **Cache-aware prompt layout** (stable prefix + dynamic suffix) is a free optimization.
-### What Is Overkill
-- Full KV cache compression methods (CacheGen, KVzip) require inference-system integration that most agent harnesses don't control.
-- Cross-agent KV sharing (KVCOMM) is niche.
-### What Is Missing
-- No "cache budgeter" that decides *which* context to cache based on predicted reuse frequency.
-- No cost-aware context eviction policy for agents with mixed short/long tasks.
----
-## 3. Tool-Use Routing & Optimization
-### What Exists
-| Paper | Key Result | Practicality |
-|-------|-----------|--------------|
-| **Graph-Based Self-Healing Tool Routing** (2026) | 93% control-plane LLM call reduction | ★★★★★ Deterministic graph routing |
-| **Optimizing Agentic Workflows (AWO)** (2026) | 11.9% LLM call reduction, +4.2pp success | ★★★★★ Meta-tool extraction |
-| **Less is More** (2024) | Reducing tool count improves edge performance | ★★★☆☆ Edge-specific |
-| **Small Model as Master Orchestrator** (2026) | Lightweight orchestrator for parallel decomposition | ★★★★☆ Unified action space |
-| **CASTER** (2026) | Dual-signal router for multi-agent graph workflows | ★★★★☆ Graph-based systems |
-### What Is Useful
-- **Self-Healing Tool Routing** eliminates LLM calls for 93% of tool decisions by using Dijkstra on a cost-weighted tool graph.
-- **AWO meta-tools** compress repeated multi-step patterns into deterministic macros.
-### What Is Overkill
-- Full multi-agent orchestration frameworks (CASTER) are powerful but heavy for simple tool pipelines.
-### What Is Missing
-- No "tool necessity predictor" that decides *whether* to call a tool at all, not just which one.
-- No cost-aware batching that groups independent tool calls while respecting dependencies.
----
-## 4. Verifier Gating & Selective Verification
-### What Exists
-| Paper | Key Result | Practicality |
-|-------|-----------|--------------|
-| **Self-Calibration** (2025) | ECE 13.70→3.79, accuracy ↑3pp | ★★★★☆ Requires SFT |
-| **ESC** (2024) | 33-84% sampling cost reduction, zero accuracy loss | ★★★★★ Drop-in replacement |
-| **SmartSnap** (2025) | Proactive evidence seeking for self-verification | ★★★★☆ RL-based |
-| **The Art of Building Verifiers** (2026) | 4 design principles for computer-use agents | ★★★★★ Practical framework |
-| **Generalized Correctness Model** (2025) | Cross-model verification, selective deferral | ★★★☆☆ Needs multi-model labels |
-| **Agentic Confidence Calibration** (2026) | Trajectory-based calibration across systems | ★★★★☆ Multi-agent focus |
-### What Is Useful
-- **Early-Stopping Self-Consistency (ESC)** is the highest-ROI change: replace standard self-consistency with window-based stopping.
-- **Self-Calibration** enables single-forward-pass confidence for routing and early stopping.
-- **Heuristic verifier budgeter** (risk-weighted) is sufficient for most agents.
-### What Is Overkill
-- Training a Generalized Correctness Model across 10+ LLMs is expensive and data-hungry.
-- Formal verification frameworks (VeriGuard) are essential only for safety-critical applications.
-### What Is Missing
-- No verifier that can estimate its *own* reliability per task type and adjust thresholds.
-- No framework for "verifier cascading" (cheap verifier first, expensive one only on disagreement).
----
-## 5. Early Exit & Failure Detection
-### What Exists
-| Paper | Key Result | Practicality |
-|-------|-----------|--------------|
-| **VLAA-GUI** (2026) | Modular framework for GUI agent stopping/loop breaking | ★★★★★ Modular |
-| **LYNX** (2025) | Hidden-state early-exit for reasoning | ★★★★☆ Requires model access |
-| **SpecExit** (2025) | Speculative exit for reasoning models | ★★★★☆ Reduces generation length |
-| **FAMA** (2026) | Failure-aware meta-agent, +4.6-11.6% on τ-bench | ★★★★☆ Failure clustering |
-| **Confidence Dichotomy** (2026) | Tool-use agents have task-specific calibration | ★★★★☆ RL calibration |
-### What Is Useful
-- **Doom detection via signal aggregation** (repeated failures, cost explosion, stagnant progress) is the practical approach.
-- **FAMA's failure clustering** identifies dominant error patterns for targeted recovery.
-### What Is Overkill
-- Hidden-state early exit requires model weights access — not available for API-only agents.
-- Speculative exit requires model architecture changes.
-### What Is Missing
-- No "run health score" that combines all signals into a single terminate/continue decision with calibrated confidence.
-- No online learning from false-stop vs. false-continue outcomes.
----
-## 6. Test-Time Compute Allocation
-### What Exists
-| Paper | Key Result | Practicality |
-|-------|-----------|--------------|
-| **Trust but Verify Survey** (2025) | Comprehensive taxonomy of TTS methods | ★★★★★ Reference |
-| **PAG** (2025) | Policy-as-Verifier for multi-turn self-correction | ★★★★☆ RL framework |
-| **Compute-Optimal Scaling** (various) | Best-of-N and PRM for reasoning | ★★★★☆ Math-focused |
-| **Self-Healing Tool Router** (2026) | Cost-weighted graph routing as compute allocation | ★★★★★ Practical |
-### What Is Useful
-- **Best-of-N with early stopping** (ESC) is the most practical test-time scaling optimization.
-- **Process Reward Models** are powerful but require training data.
-### What Is Overkill
-- Full MCTS search over reasoning paths is too expensive for most agent tasks.
-### What Is Missing
-- No adaptive compute allocator that distributes budget across: routing, tool calls, verification, and model strength.
----
-## 7. Meta-Tool & Workflow Compression
-### What Exists
-| Paper | Key Result | Practicality |
-|-------|-----------|--------------|
-| **AWO** (2026) | State-graph merging, hot-path extraction | ★★★★★ No training needed |
-| **Agent-as-Tool** (2026) | Unified parallel orchestration | ★★★★☆ Standardized action space |
-| **WebClipper** (2026) | Graph-based trajectory pruning for web agents | ★★★★☆ Web-specific |
-### What Is Useful
-- **AWO's trace → state graph → hot path → meta-tool** pipeline is directly applicable.
-- Meta-tools eliminate LLM reasoning for known sub-routines.
-### What Is Overkill
-- Full workflow synthesis from scratch is complex; incremental mining from traces is better.
-### What Is Missing
-- No "meta-tool validation" step that checks whether a compressed workflow still handles edge cases.
-- No A/B testing framework for meta-tool vs. original LLM-based execution.
----
-## 8. Cost-Quality Frontiers
-### What Exists
-| Paper | Key Result | Practicality |
-|-------|-----------|--------------|
-| **RouterBench** (2024) | Systematic cost-quality evaluation framework | ★★★★★ Standard |
-| **R2-Bench** (2026) | Length-constrained cost-quality benchmark | ★★★★☆ Extends RouterBench |
-| **BAAR** (2026) | Pareto frontier on ALFWorld, SciWorld, AppWorld | ★★★★☆ Interactive agents |
-### What Is Useful
-- RouterBench provides the evaluation protocol.
-- Pareto frontier plotting (cost vs. accuracy) is the correct way to compare systems.
-### What Is Missing
-- No benchmark that measures cost-quality frontier for *compound* optimizations simultaneously.
----
-## 9. Confidence Calibration
-### What Exists
-| Paper | Key Result | Practicality |
-|-------|-----------|--------------|
-| **Self-Calibration** (2025) | Distills self-consistency into model confidence | ★★★★☆ SFT required |
-| **Agentic Confidence Calibration** (2026) | Trajectory-level calibration | ★★★★☆ Multi-agent |
-| **Black-Box Reliability** (2026) | Self-consistency + conformal calibration | ★★★★★ Distribution-free guarantees |
-### What Is Useful
-- Self-consistency-based confidence is the most reliable signal for black-box APIs.
-- Calibration enables better routing, early stopping, and verifier gating.
-### What Is Missing
-- No calibration method specifically for multi-step agent traces with tool outcomes.
----
-## 10. Retrieval Gating & Context Selection
-### What Exists
-| Paper | Key Result | Practicality |
-|-------|-----------|--------------|
-| **CacheBlend** (2024) | Selective KV recompute for RAG | ★★★★☆ RAG-specific |
-| **DynamicKV** (2024) | Task-aware KV compression | ★★★★☆ Long-context |
-| **CompressKV** (2025) | Semantic retrieval heads for token importance | ★★★☆☆ Research |
-### What Is Useful
-- **Context budgeters** that select which chunks to include based on predicted relevance are practical.
-- **H2O** shows that only 20% of KV cache is needed.
-### What Is Missing
-- No "learned context selector" that is trained end-to-end with the router to maximize task success per token.
----
-## Recommendations for Implementation
-### Phase 1 (Immediate)
-1. **Deploy FrugalGPT cascade** — 50-98% cost reduction, simple scoring model
-2. **Enable prefix caching** — Free optimization for repeated system/tool prompts
-3. **Replace self-consistency with ESC** — 33-84% sampling reduction, zero accuracy loss
-### Phase 2 (Short-term)
-4. **Train self-calibration model** — Enables confidence-based routing and early stopping
-5. **Implement AWO meta-tools** — Collect 100+ traces, extract hot paths
-6. **Build heuristic verifier budgeter** — Risk-weighted selective verification
-### Phase 3 (Medium-term)
-7. **Deploy BAAR step-level routing** — GRPO-trained router for multi-turn agents
-8. **Add self-healing tool graph** — Dijkstra routing for API-heavy agents
-9. **Implement doom detector** — Multi-signal early termination
-### Phase 4 (Long-term)
-10. **Train unified compound optimizer** — Jointly optimize all 10 dimensions
-11. **Online learning from traces** — Update policies based on real deployment outcomes
-12. **Cross-agent cache sharing** — KVCOMM-style sharing for multi-agent systems
----
-## Key Gaps & Opportunities
-| Gap | Opportunity |
-|-----|-------------|
-| No unified compound optimizer | **Agent Cost Optimizer** fills this gap |
-| No benchmark for compound optimization | Create AgentCostBench |
-| No online learning for routing | Deploy Thompson sampling / contextual bandits |
-| No verifier cascading | Build cheap → expensive verifier chains |
-| No cache budgeter | Learn which prefixes to cache |
-| No meta-tool validation | A/B test compressed vs. original workflows |
-| No trajectory-level calibration | Extend Self-Calibration to multi-step |

+# Literature Review: Cost-Aware Agent Routing
+## What Exists
+### Model Routing
+**RouteLLM** (2406.18665, UC Berkeley/LMSYS, 2024): Trains BERT-based router on Chatbot Arena preference data. Achieves 2x+ cost reduction without sacrificing quality. Simple BERT classifier is surprisingly effective. Does NOT use execution feedback — routes based on input features only.
+**HybridLLM** (2404.14618, 2024): Probabilistic router predicts Pr[H(x) ≥ 0] (quality gap favorable for small model). Uses BART score as quality proxy. 40% fewer calls to large model with no quality drop.
+**CARROT** (SPROUT dataset, 2025): Multi-model routing benchmark with per-model scores and token counts across 13 models and 44K prompts. Provides ground truth for which model succeeds on which task.
+### Cascade Inference
+**Cascade Routing** (2410.10347, ETH SRI, ICLR 2025): Unifies routing (ex-ante) with cascading (post-hoc). Key finding: **quality estimators are the #1 factor**. Post-hoc estimates dramatically outperform ex-ante. Low σ_post benefits cascading; low σ_ante benefits routing. The combination wins.
+**RouteNLP** (2604.23577, 2026): 3-component system: difficulty-aware router + confidence-calibrated cascading + distillation-routing co-optimization. Token-level uncertainty u(m,x) = (1/L)Σ(1 - p(y_i|x)) from softmax probabilities. Conformal risk control with α=0.05. **58% cost reduction in production** (5K queries/day).
+**CP-Router** (2505.19970, 2025): Training-free uncertainty-aware routing between LLM and Large Reasoning Model. Uses entropy from cheap model output as escalation signal. No training required — just compute entropy and compare to conformal threshold.
+### Agentic Routing
+**BAAR** (2602.21227, 2025): Budget-Aware Agentic Routing via Boundary-Guided Training. Trains router (Qwen2.5-7B) to decide per-step which model to use. Two-phase training: BoSFT (difficulty taxonomy: Easy/Hard/Intractable) + BoPO (GRPO with boundary-relative rewards). Generalizes to strict per-task budget constraints.
+**BEST-Route** (2506.22716, Microsoft, 2025): Generates best-of-n samples from cheap model, selects best via proxy reward model. Router predicts both model and number of samples n. Up to 60% cost reduction with <1% performance drop.
+### Execution Feedback
+**ClawTrace** (2604.23853): Per-step cost attribution in agent traces. TraceCard format with USD cost + token counts + redundancy flags. **Prune patches cut median cost 32%.**
+**LLMRouterBench** (2601.07206): 400K instances, 21 datasets, 33 models. Finding: **Simple baselines often match complex routers.** Model complementarity is real but hard to exploit.
+### Failure Prediction
+**AgentRewardBench** (2504.08942): 1,302 web agent trajectories with expert success/side-effect/repetition labels across 5 benchmarks and 4 LLMs.
+### Selective Verification
+**Process Reward Models** (multiple): Train verifier to score intermediate steps. Use only when confidence is low or risk is high. Reduces verification cost by 70-90% while maintaining safety.
+## What Is Useful
+| Paper | Key Takeaway | Applied In ACO |
+|-------|-------------|---------------|
+| RouteNLP | Conformal cascading with token-level uncertainty | Execution-feedback router (module 4) |
+| Cascade Routing | Post-hoc >> ex-ante quality estimates | v9 feedback escalation |
+| BAAR | Per-step routing with difficulty taxonomy | Per-step router (module 3) |
+| BEST-Route | Best-of-N cheap sampling + reward model | Planned next step |
+| CP-Router | Training-free entropy-based escalation | Simple fallback router |
+| ClawTrace | Per-step cost attribution | Telemetry schema |
+| SPROUT | Multi-model eval data | v11 training data |
+## What Is Overkill
+- **Full agent simulation environments** (SciWorld, ALFWorld) — we don't need to simulate the entire agent, just route each step
+- **GRPO-based RL training** (BAAR) — XGBoost with real data outperforms RL with synthetic data
+- **Distillation-routing co-optimization** (RouteNLP) — we're not training task-specific models
+- **Complex multi-stage pipelines** — simple cascade + feedback is 80% of the benefit
+## What Is Missing
+1. **Execution-feedback routing with real model logprobs** — all work uses simulated or API-provided logprobs
+2. **Conformal calibration for agent routing** — no paper provides distribution-free quality guarantees
+3. **Per-step routing with per-step training data** — BAAR routes per step but trains on task-level outcomes
+4. **Cost-quality Pareto frontier construction** — no paper constructs the full frontier, only point comparisons
+5. **Real agent benchmarks with cost data** — SWE-Router is the only dataset with real USD costs per task
+## What To Implement First
+1. **Execution-feedback escalation** (RouteNLP pattern) — highest ROI, validated in production
+2. **Per-tier XGBoost with real data** (our v10/v11 approach) — simple, effective, requires real traces
+3. **Per-step routing** (BAAR pattern) — significant savings from routing steps differently
+4. **Conformal calibration** (CP-Router pattern) — safety guarantees without training
+5. **Best-of-N cheap sampling** (BEST-Route pattern) — orthogonal improvement to routing
+Priority: Execution feedback > Real data training > Per-step routing > Conformal calibration > Best-of-N