# Agent Cost Optimizer — Roadmap ## Current Status (v1.0) - ✅ 10 core modules implemented and benchmarked - ✅ 28% cost reduction at iso-quality (94.3% success rate) - ✅ Synthetic benchmark (2K traces, 19 scenarios) - ✅ Learned router skeleton (trainable, not yet trained on real data) - ✅ Deployment guide, model card, technical report - ✅ Gradio dashboard code (not yet deployed) --- ## Phase 1: Learned Router (Immediate Priority) **Goal:** Replace heuristic router with classifier trained on real traces. ### Why This Is #1 The ablation study shows the model router is the most critical module. A trained classifier could: - Increase savings from 28% to 35–40% - Reduce false escalations by 50% - Enable task-specific routing (code → Claude, reasoning → o3-mini) ### Implementation 1. Collect 10K+ real traces with full telemetry 2. Extract (request_features, optimal_tier) pairs 3. Train simple logistic regression / small neural classifier 4. Or: Train with GRPO using cost-adjusted reward (BAAR-style boundary-guided routing) 5. A/B test against heuristic router 6. Fall back to heuristic when confidence < 0.7 **Estimated effort:** 2–3 days **Expected impact:** +7–12pp cost savings, <1pp quality regression --- ## Phase 2: Real Interactive Benchmark **Goal:** Evaluate ACO against real agent tasks with actual model calls. ### Why Synthetic Is Not Enough Our synthetic benchmark assumes fixed success probabilities per tier. Real-world model behavior: - Is non-stationary (models improve, new models release) - Depends on prompt engineering, not just model strength - Has provider-specific quirks (Claude vs GPT vs DeepSeek) - Is affected by rate limits, timeouts, transient failures ### Implementation 1. **Coding benchmark:** Integrate with SWE-bench lite (500 tasks) - Run with cheap model first, escalate on failure - Measure: pass@1, LLM calls, cost, time 2. **Tool-use benchmark:** Integrate with BFCL (2,000 function-calling tasks) - Measure: tool accuracy, missed tools, cost 3. **Research benchmark:** 100 real research questions - Run with retrieval + cheap model vs retrieval + frontier - Human evaluation: source quality, hallucination, coverage 4. **Long-horizon benchmark:** 50 multi-step tasks (WebArena-style) - Measure: task completion, cost growth over steps, cache hit rate **Estimated effort:** 1 week **Expected impact:** Calibrate all module thresholds, discover edge cases --- ## Phase 3: Online Learning Loop **Goal:** Update routing probabilities, tool gate thresholds, and doom detector thresholds from live trace feedback. ### Why Static Policies Fail - Model capabilities improve (GPT-4o → GPT-5) - New cheap models release (GPT-4o-mini → even cheaper) - Task mix changes over time - User behavior shifts ### Implementation 1. **Trace ingestion pipeline:** Collect traces from production runs 2. **Outcome labeling:** Success/failure/escalation labels from user feedback 3. **Online update:** Update router classifier weights weekly 4. **Thompson sampling:** Explore new routing decisions with small probability 5. **Drift detection:** Alert when success rate drops >5pp for a task type **Estimated effort:** 1 week **Expected impact:** Maintains 28%+ savings as models/task mix evolve --- ## Phase 4: Verifier Cascading **Goal:** Use cheap verifier first, escalate to expensive verifier only on disagreement. ### Current State - Verifier budgeter decides WHETHER to verify - When it decides YES, it always uses the same verifier model ### Improvement - **Tier 1:** Simple regex/rule-based checks (free) - **Tier 2:** Cheap model verifier (GPT-4o-mini, $0.15/M tok) - **Tier 3:** Expensive verifier (GPT-4o, $2.5/M tok) — only when tier 2 flags issue - **Consensus mode:** Run cheap + medium verifier, escalate if disagree **Estimated impact:** 60–80% verifier cost reduction on low-risk tasks --- ## Phase 5: Cross-Provider Cost Optimization **Goal:** Route to cheapest provider offering adequate model tier. ### Providers with Similar-Tier Models | Tier | OpenAI | Anthropic | DeepSeek | Google | Together | Fireworks | |------|--------|-----------|----------|--------|----------|-----------| | 2 (Cheap) | GPT-4o-mini | Haiku | DeepSeek-V3 | Gemini-Flash | Llama-3.1-8B | Mixtral-8x7B | | 3 (Medium) | GPT-4o | Sonnet | DeepSeek-V3 | Gemini-Pro | Llama-3.1-70B | Qwen-2.5-72B | | 4 (Frontier) | o1 | Opus | — | Gemini-Ultra | Llama-3.1-405B | — | ### Implementation 1. Maintain provider pricing API (auto-fetch current prices) 2. Add provider latency/availability monitoring 3. Route to cheapest available tier-adequate provider 4. Fallback chain: primary → secondary → tertiary 5. Cache routing decisions per provider for stability **Estimated impact:** Additional 5–10% cost reduction on multi-provider setups --- ## Phase 6: KV Cache Sharing **Goal:** Share prefix KV caches across concurrent agent runs using identical system prompts. ### How It Works - Many agent runs share the same system prompt + tool descriptions - vLLM and SGLang support prefix caching / KV cache sharing - Running N agents concurrently → cache system prompt once, reuse N-1 times ### Implementation 1. Integrate with vLLM/SGLang backend for local models 2. Group agent runs by identical prefix hash 3. Pre-fill shared prefix once, append per-run suffix 4. Track cache hit rate per prefix group 5. Apply to multi-tenant agent deployments **Estimated impact:** 20–40% cost reduction on concurrent agent farms --- ## Phase 7: Speculative Agent Actions **Goal:** Generate next N actions with cheap model, validate with frontier only on divergence. ### How It Works 1. Cheap model generates next action sequence (plan + tool calls) 2. Frontier model validates only the *divergent* or *high-risk* actions 3. If cheap model plan matches frontier with >0.9 similarity → use cheap 4. If divergence > threshold → regenerate with frontier ### Use Cases - Multi-step coding workflows (cheap generates plan, frontier validates critical steps) - Research workflows (cheap suggests search queries, frontier validates synthesis) - Tool-heavy workflows (cheap predicts tool sequence, frontier validates data transformations) **Estimated impact:** 15–25% cost reduction on multi-step tasks --- ## Phase 8: Confidence Calibration with Process Reward Models **Goal:** Train a per-step success predictor for dynamic compute allocation. ### Current State - Router uses task-level difficulty classification - Does not adapt compute within a task based on step-level confidence ### Improvement 1. Train a small PRM (Process Reward Model) on agent traces 2. At each step, PRM predicts P(success | current state) 3. If P(success) < 0.5 → escalate model, retrieve more context, or call verifier 4. If P(success) > 0.95 → use cheaper model for next step 5. Dynamically allocate compute based on real-time trajectory quality **Estimated impact:** 10–15% cost reduction with quality preservation --- ## Phase 9: Human-in-the-Loop Integration **Goal:** Learn from human corrections to improve routing and reduce future mistakes. ### Implementation 1. When human corrects agent output → label the trace 2. If human says "should have used stronger model" → update routing probabilities 3. If human says "didn't need to call that tool" → update tool gate thresholds 4. If human says "stopped too early" → update doom detector thresholds 5. Feed corrections into online learning loop (Phase 3) **Estimated impact:** Reduces false-DONE rate and missed escalation rate by 30–50% --- ## Phase 10: Meta-Learning Across Tasks **Goal:** Learn task-specific optimal policies from a small number of examples. ### How It Works - New task type appears (e.g., "medical diagnosis assistant") - ACO has no prior traces for this task type - Meta-learner transfers policies from similar task types (e.g., legal → medical, both high-risk) - Few-shot calibrates thresholds from first 10–20 traces ### Implementation 1. Embed task types in semantic space 2. Find k-nearest task types with sufficient trace history 3. Transfer router weights, tool gate thresholds, verifier rules 4. Bayesian update with new task traces 5. Converge to task-specific policy within 50 traces **Estimated impact:** Reduces cold-start period from 100 traces to 20 traces --- ## Summary: Priority Ranking | Phase | Impact | Effort | Priority | |-------|--------|--------|----------| | 1. Learned Router | ⭐⭐⭐⭐⭐ | Medium | **#1** | | 2. Real Benchmark | ⭐⭐⭐⭐⭐ | High | #2 | | 3. Online Learning | ⭐⭐⭐⭐⭐ | High | #3 | | 4. Verifier Cascading | ⭐⭐⭐⭐ | Low | #4 | | 5. Cross-Provider | ⭐⭐⭐⭐ | Medium | #5 | | 6. KV Cache Sharing | ⭐⭐⭐ | High | #6 | | 7. Speculative Actions | ⭐⭐⭐⭐ | High | #7 | | 8. PRM Calibration | ⭐⭐⭐⭐ | High | #8 | | 9. Human-in-the-Loop | ⭐⭐⭐ | Medium | #9 | | 10. Meta-Learning | ⭐⭐⭐ | High | #10 | --- ## Success Metrics for Each Phase Track these metrics for every phase: 1. **Cost per successful task** (primary) 2. **Cost per artifact** (secondary) 3. **Task success rate** (must not regress) 4. **False-DONE rate** (must not increase) 5. **Unsafe cheap-model miss rate** (must be <2%) 6. **Missed escalation rate** (must be <5%) 7. **Cache hit rate** (target >60%) 8. **Tool call efficiency** (used/called ratio >80%) 9. **Verifier pass rate** (target >85%) 10. **Latency per task** (must not increase >20%) --- *Last updated: 2025-07-05*