| # Agent Cost Optimizer β Roadmap |
|
|
| ## Current Status (v1.0) |
|
|
| - β
10 core modules implemented and benchmarked |
| - β
28% cost reduction at iso-quality (94.3% success rate) |
| - β
Synthetic benchmark (2K traces, 19 scenarios) |
| - β
Learned router skeleton (trainable, not yet trained on real data) |
| - β
Deployment guide, model card, technical report |
| - β
Gradio dashboard code (not yet deployed) |
|
|
| --- |
|
|
| ## Phase 1: Learned Router (Immediate Priority) |
|
|
| **Goal:** Replace heuristic router with classifier trained on real traces. |
|
|
| ### Why This Is #1 |
| The ablation study shows the model router is the most critical module. A trained classifier could: |
| - Increase savings from 28% to 35β40% |
| - Reduce false escalations by 50% |
| - Enable task-specific routing (code β Claude, reasoning β o3-mini) |
|
|
| ### Implementation |
| 1. Collect 10K+ real traces with full telemetry |
| 2. Extract (request_features, optimal_tier) pairs |
| 3. Train simple logistic regression / small neural classifier |
| 4. Or: Train with GRPO using cost-adjusted reward (BAAR-style boundary-guided routing) |
| 5. A/B test against heuristic router |
| 6. Fall back to heuristic when confidence < 0.7 |
|
|
| **Estimated effort:** 2β3 days |
| **Expected impact:** +7β12pp cost savings, <1pp quality regression |
|
|
| --- |
|
|
| ## Phase 2: Real Interactive Benchmark |
|
|
| **Goal:** Evaluate ACO against real agent tasks with actual model calls. |
|
|
| ### Why Synthetic Is Not Enough |
| Our synthetic benchmark assumes fixed success probabilities per tier. Real-world model behavior: |
| - Is non-stationary (models improve, new models release) |
| - Depends on prompt engineering, not just model strength |
| - Has provider-specific quirks (Claude vs GPT vs DeepSeek) |
| - Is affected by rate limits, timeouts, transient failures |
|
|
| ### Implementation |
| 1. **Coding benchmark:** Integrate with SWE-bench lite (500 tasks) |
| - Run with cheap model first, escalate on failure |
| - Measure: pass@1, LLM calls, cost, time |
| 2. **Tool-use benchmark:** Integrate with BFCL (2,000 function-calling tasks) |
| - Measure: tool accuracy, missed tools, cost |
| 3. **Research benchmark:** 100 real research questions |
| - Run with retrieval + cheap model vs retrieval + frontier |
| - Human evaluation: source quality, hallucination, coverage |
| 4. **Long-horizon benchmark:** 50 multi-step tasks (WebArena-style) |
| - Measure: task completion, cost growth over steps, cache hit rate |
|
|
| **Estimated effort:** 1 week |
| **Expected impact:** Calibrate all module thresholds, discover edge cases |
|
|
| --- |
|
|
| ## Phase 3: Online Learning Loop |
|
|
| **Goal:** Update routing probabilities, tool gate thresholds, and doom detector thresholds from live trace feedback. |
|
|
| ### Why Static Policies Fail |
| - Model capabilities improve (GPT-4o β GPT-5) |
| - New cheap models release (GPT-4o-mini β even cheaper) |
| - Task mix changes over time |
| - User behavior shifts |
|
|
| ### Implementation |
| 1. **Trace ingestion pipeline:** Collect traces from production runs |
| 2. **Outcome labeling:** Success/failure/escalation labels from user feedback |
| 3. **Online update:** Update router classifier weights weekly |
| 4. **Thompson sampling:** Explore new routing decisions with small probability |
| 5. **Drift detection:** Alert when success rate drops >5pp for a task type |
|
|
| **Estimated effort:** 1 week |
| **Expected impact:** Maintains 28%+ savings as models/task mix evolve |
|
|
| --- |
|
|
| ## Phase 4: Verifier Cascading |
|
|
| **Goal:** Use cheap verifier first, escalate to expensive verifier only on disagreement. |
|
|
| ### Current State |
| - Verifier budgeter decides WHETHER to verify |
| - When it decides YES, it always uses the same verifier model |
|
|
| ### Improvement |
| - **Tier 1:** Simple regex/rule-based checks (free) |
| - **Tier 2:** Cheap model verifier (GPT-4o-mini, $0.15/M tok) |
| - **Tier 3:** Expensive verifier (GPT-4o, $2.5/M tok) β only when tier 2 flags issue |
| - **Consensus mode:** Run cheap + medium verifier, escalate if disagree |
|
|
| **Estimated impact:** 60β80% verifier cost reduction on low-risk tasks |
|
|
| --- |
|
|
| ## Phase 5: Cross-Provider Cost Optimization |
|
|
| **Goal:** Route to cheapest provider offering adequate model tier. |
|
|
| ### Providers with Similar-Tier Models |
|
|
| | Tier | OpenAI | Anthropic | DeepSeek | Google | Together | Fireworks | |
| |------|--------|-----------|----------|--------|----------|-----------| |
| | 2 (Cheap) | GPT-4o-mini | Haiku | DeepSeek-V3 | Gemini-Flash | Llama-3.1-8B | Mixtral-8x7B | |
| | 3 (Medium) | GPT-4o | Sonnet | DeepSeek-V3 | Gemini-Pro | Llama-3.1-70B | Qwen-2.5-72B | |
| | 4 (Frontier) | o1 | Opus | β | Gemini-Ultra | Llama-3.1-405B | β | |
|
|
| ### Implementation |
| 1. Maintain provider pricing API (auto-fetch current prices) |
| 2. Add provider latency/availability monitoring |
| 3. Route to cheapest available tier-adequate provider |
| 4. Fallback chain: primary β secondary β tertiary |
| 5. Cache routing decisions per provider for stability |
|
|
| **Estimated impact:** Additional 5β10% cost reduction on multi-provider setups |
|
|
| --- |
|
|
| ## Phase 6: KV Cache Sharing |
|
|
| **Goal:** Share prefix KV caches across concurrent agent runs using identical system prompts. |
|
|
| ### How It Works |
| - Many agent runs share the same system prompt + tool descriptions |
| - vLLM and SGLang support prefix caching / KV cache sharing |
| - Running N agents concurrently β cache system prompt once, reuse N-1 times |
|
|
| ### Implementation |
| 1. Integrate with vLLM/SGLang backend for local models |
| 2. Group agent runs by identical prefix hash |
| 3. Pre-fill shared prefix once, append per-run suffix |
| 4. Track cache hit rate per prefix group |
| 5. Apply to multi-tenant agent deployments |
|
|
| **Estimated impact:** 20β40% cost reduction on concurrent agent farms |
|
|
| --- |
|
|
| ## Phase 7: Speculative Agent Actions |
|
|
| **Goal:** Generate next N actions with cheap model, validate with frontier only on divergence. |
|
|
| ### How It Works |
| 1. Cheap model generates next action sequence (plan + tool calls) |
| 2. Frontier model validates only the *divergent* or *high-risk* actions |
| 3. If cheap model plan matches frontier with >0.9 similarity β use cheap |
| 4. If divergence > threshold β regenerate with frontier |
|
|
| ### Use Cases |
| - Multi-step coding workflows (cheap generates plan, frontier validates critical steps) |
| - Research workflows (cheap suggests search queries, frontier validates synthesis) |
| - Tool-heavy workflows (cheap predicts tool sequence, frontier validates data transformations) |
|
|
| **Estimated impact:** 15β25% cost reduction on multi-step tasks |
|
|
| --- |
|
|
| ## Phase 8: Confidence Calibration with Process Reward Models |
|
|
| **Goal:** Train a per-step success predictor for dynamic compute allocation. |
|
|
| ### Current State |
| - Router uses task-level difficulty classification |
| - Does not adapt compute within a task based on step-level confidence |
|
|
| ### Improvement |
| 1. Train a small PRM (Process Reward Model) on agent traces |
| 2. At each step, PRM predicts P(success | current state) |
| 3. If P(success) < 0.5 β escalate model, retrieve more context, or call verifier |
| 4. If P(success) > 0.95 β use cheaper model for next step |
| 5. Dynamically allocate compute based on real-time trajectory quality |
|
|
| **Estimated impact:** 10β15% cost reduction with quality preservation |
|
|
| --- |
|
|
| ## Phase 9: Human-in-the-Loop Integration |
|
|
| **Goal:** Learn from human corrections to improve routing and reduce future mistakes. |
|
|
| ### Implementation |
| 1. When human corrects agent output β label the trace |
| 2. If human says "should have used stronger model" β update routing probabilities |
| 3. If human says "didn't need to call that tool" β update tool gate thresholds |
| 4. If human says "stopped too early" β update doom detector thresholds |
| 5. Feed corrections into online learning loop (Phase 3) |
|
|
| **Estimated impact:** Reduces false-DONE rate and missed escalation rate by 30β50% |
|
|
| --- |
|
|
| ## Phase 10: Meta-Learning Across Tasks |
|
|
| **Goal:** Learn task-specific optimal policies from a small number of examples. |
|
|
| ### How It Works |
| - New task type appears (e.g., "medical diagnosis assistant") |
| - ACO has no prior traces for this task type |
| - Meta-learner transfers policies from similar task types (e.g., legal β medical, both high-risk) |
| - Few-shot calibrates thresholds from first 10β20 traces |
|
|
| ### Implementation |
| 1. Embed task types in semantic space |
| 2. Find k-nearest task types with sufficient trace history |
| 3. Transfer router weights, tool gate thresholds, verifier rules |
| 4. Bayesian update with new task traces |
| 5. Converge to task-specific policy within 50 traces |
|
|
| **Estimated impact:** Reduces cold-start period from 100 traces to 20 traces |
|
|
| --- |
|
|
| ## Summary: Priority Ranking |
|
|
| | Phase | Impact | Effort | Priority | |
| |-------|--------|--------|----------| |
| | 1. Learned Router | βββββ | Medium | **#1** | |
| | 2. Real Benchmark | βββββ | High | #2 | |
| | 3. Online Learning | βββββ | High | #3 | |
| | 4. Verifier Cascading | ββββ | Low | #4 | |
| | 5. Cross-Provider | ββββ | Medium | #5 | |
| | 6. KV Cache Sharing | βββ | High | #6 | |
| | 7. Speculative Actions | ββββ | High | #7 | |
| | 8. PRM Calibration | ββββ | High | #8 | |
| | 9. Human-in-the-Loop | βββ | Medium | #9 | |
| | 10. Meta-Learning | βββ | High | #10 | |
|
|
| --- |
|
|
| ## Success Metrics for Each Phase |
|
|
| Track these metrics for every phase: |
|
|
| 1. **Cost per successful task** (primary) |
| 2. **Cost per artifact** (secondary) |
| 3. **Task success rate** (must not regress) |
| 4. **False-DONE rate** (must not increase) |
| 5. **Unsafe cheap-model miss rate** (must be <2%) |
| 6. **Missed escalation rate** (must be <5%) |
| 7. **Cache hit rate** (target >60%) |
| 8. **Tool call efficiency** (used/called ratio >80%) |
| 9. **Verifier pass rate** (target >85%) |
| 10. **Latency per task** (must not increase >20%) |
|
|
| --- |
|
|
| *Last updated: 2025-07-05* |
|
|