Agent Cost Optimizer β Roadmap
Current Status (v1.0)
- β 10 core modules implemented and benchmarked
- β 28% cost reduction at iso-quality (94.3% success rate)
- β Synthetic benchmark (2K traces, 19 scenarios)
- β Learned router skeleton (trainable, not yet trained on real data)
- β Deployment guide, model card, technical report
- β Gradio dashboard code (not yet deployed)
Phase 1: Learned Router (Immediate Priority)
Goal: Replace heuristic router with classifier trained on real traces.
Why This Is #1
The ablation study shows the model router is the most critical module. A trained classifier could:
- Increase savings from 28% to 35β40%
- Reduce false escalations by 50%
- Enable task-specific routing (code β Claude, reasoning β o3-mini)
Implementation
- Collect 10K+ real traces with full telemetry
- Extract (request_features, optimal_tier) pairs
- Train simple logistic regression / small neural classifier
- Or: Train with GRPO using cost-adjusted reward (BAAR-style boundary-guided routing)
- A/B test against heuristic router
- Fall back to heuristic when confidence < 0.7
Estimated effort: 2β3 days
Expected impact: +7β12pp cost savings, <1pp quality regression
Phase 2: Real Interactive Benchmark
Goal: Evaluate ACO against real agent tasks with actual model calls.
Why Synthetic Is Not Enough
Our synthetic benchmark assumes fixed success probabilities per tier. Real-world model behavior:
- Is non-stationary (models improve, new models release)
- Depends on prompt engineering, not just model strength
- Has provider-specific quirks (Claude vs GPT vs DeepSeek)
- Is affected by rate limits, timeouts, transient failures
Implementation
- Coding benchmark: Integrate with SWE-bench lite (500 tasks)
- Run with cheap model first, escalate on failure
- Measure: pass@1, LLM calls, cost, time
- Tool-use benchmark: Integrate with BFCL (2,000 function-calling tasks)
- Measure: tool accuracy, missed tools, cost
- Research benchmark: 100 real research questions
- Run with retrieval + cheap model vs retrieval + frontier
- Human evaluation: source quality, hallucination, coverage
- Long-horizon benchmark: 50 multi-step tasks (WebArena-style)
- Measure: task completion, cost growth over steps, cache hit rate
Estimated effort: 1 week
Expected impact: Calibrate all module thresholds, discover edge cases
Phase 3: Online Learning Loop
Goal: Update routing probabilities, tool gate thresholds, and doom detector thresholds from live trace feedback.
Why Static Policies Fail
- Model capabilities improve (GPT-4o β GPT-5)
- New cheap models release (GPT-4o-mini β even cheaper)
- Task mix changes over time
- User behavior shifts
Implementation
- Trace ingestion pipeline: Collect traces from production runs
- Outcome labeling: Success/failure/escalation labels from user feedback
- Online update: Update router classifier weights weekly
- Thompson sampling: Explore new routing decisions with small probability
- Drift detection: Alert when success rate drops >5pp for a task type
Estimated effort: 1 week
Expected impact: Maintains 28%+ savings as models/task mix evolve
Phase 4: Verifier Cascading
Goal: Use cheap verifier first, escalate to expensive verifier only on disagreement.
Current State
- Verifier budgeter decides WHETHER to verify
- When it decides YES, it always uses the same verifier model
Improvement
- Tier 1: Simple regex/rule-based checks (free)
- Tier 2: Cheap model verifier (GPT-4o-mini, $0.15/M tok)
- Tier 3: Expensive verifier (GPT-4o, $2.5/M tok) β only when tier 2 flags issue
- Consensus mode: Run cheap + medium verifier, escalate if disagree
Estimated impact: 60β80% verifier cost reduction on low-risk tasks
Phase 5: Cross-Provider Cost Optimization
Goal: Route to cheapest provider offering adequate model tier.
Providers with Similar-Tier Models
| Tier | OpenAI | Anthropic | DeepSeek | Together | Fireworks | |
|---|---|---|---|---|---|---|
| 2 (Cheap) | GPT-4o-mini | Haiku | DeepSeek-V3 | Gemini-Flash | Llama-3.1-8B | Mixtral-8x7B |
| 3 (Medium) | GPT-4o | Sonnet | DeepSeek-V3 | Gemini-Pro | Llama-3.1-70B | Qwen-2.5-72B |
| 4 (Frontier) | o1 | Opus | β | Gemini-Ultra | Llama-3.1-405B | β |
Implementation
- Maintain provider pricing API (auto-fetch current prices)
- Add provider latency/availability monitoring
- Route to cheapest available tier-adequate provider
- Fallback chain: primary β secondary β tertiary
- Cache routing decisions per provider for stability
Estimated impact: Additional 5β10% cost reduction on multi-provider setups
Phase 6: KV Cache Sharing
Goal: Share prefix KV caches across concurrent agent runs using identical system prompts.
How It Works
- Many agent runs share the same system prompt + tool descriptions
- vLLM and SGLang support prefix caching / KV cache sharing
- Running N agents concurrently β cache system prompt once, reuse N-1 times
Implementation
- Integrate with vLLM/SGLang backend for local models
- Group agent runs by identical prefix hash
- Pre-fill shared prefix once, append per-run suffix
- Track cache hit rate per prefix group
- Apply to multi-tenant agent deployments
Estimated impact: 20β40% cost reduction on concurrent agent farms
Phase 7: Speculative Agent Actions
Goal: Generate next N actions with cheap model, validate with frontier only on divergence.
How It Works
- Cheap model generates next action sequence (plan + tool calls)
- Frontier model validates only the divergent or high-risk actions
- If cheap model plan matches frontier with >0.9 similarity β use cheap
- If divergence > threshold β regenerate with frontier
Use Cases
- Multi-step coding workflows (cheap generates plan, frontier validates critical steps)
- Research workflows (cheap suggests search queries, frontier validates synthesis)
- Tool-heavy workflows (cheap predicts tool sequence, frontier validates data transformations)
Estimated impact: 15β25% cost reduction on multi-step tasks
Phase 8: Confidence Calibration with Process Reward Models
Goal: Train a per-step success predictor for dynamic compute allocation.
Current State
- Router uses task-level difficulty classification
- Does not adapt compute within a task based on step-level confidence
Improvement
- Train a small PRM (Process Reward Model) on agent traces
- At each step, PRM predicts P(success | current state)
- If P(success) < 0.5 β escalate model, retrieve more context, or call verifier
- If P(success) > 0.95 β use cheaper model for next step
- Dynamically allocate compute based on real-time trajectory quality
Estimated impact: 10β15% cost reduction with quality preservation
Phase 9: Human-in-the-Loop Integration
Goal: Learn from human corrections to improve routing and reduce future mistakes.
Implementation
- When human corrects agent output β label the trace
- If human says "should have used stronger model" β update routing probabilities
- If human says "didn't need to call that tool" β update tool gate thresholds
- If human says "stopped too early" β update doom detector thresholds
- Feed corrections into online learning loop (Phase 3)
Estimated impact: Reduces false-DONE rate and missed escalation rate by 30β50%
Phase 10: Meta-Learning Across Tasks
Goal: Learn task-specific optimal policies from a small number of examples.
How It Works
- New task type appears (e.g., "medical diagnosis assistant")
- ACO has no prior traces for this task type
- Meta-learner transfers policies from similar task types (e.g., legal β medical, both high-risk)
- Few-shot calibrates thresholds from first 10β20 traces
Implementation
- Embed task types in semantic space
- Find k-nearest task types with sufficient trace history
- Transfer router weights, tool gate thresholds, verifier rules
- Bayesian update with new task traces
- Converge to task-specific policy within 50 traces
Estimated impact: Reduces cold-start period from 100 traces to 20 traces
Summary: Priority Ranking
| Phase | Impact | Effort | Priority |
|---|---|---|---|
| 1. Learned Router | βββββ | Medium | #1 |
| 2. Real Benchmark | βββββ | High | #2 |
| 3. Online Learning | βββββ | High | #3 |
| 4. Verifier Cascading | ββββ | Low | #4 |
| 5. Cross-Provider | ββββ | Medium | #5 |
| 6. KV Cache Sharing | βββ | High | #6 |
| 7. Speculative Actions | ββββ | High | #7 |
| 8. PRM Calibration | ββββ | High | #8 |
| 9. Human-in-the-Loop | βββ | Medium | #9 |
| 10. Meta-Learning | βββ | High | #10 |
Success Metrics for Each Phase
Track these metrics for every phase:
- Cost per successful task (primary)
- Cost per artifact (secondary)
- Task success rate (must not regress)
- False-DONE rate (must not increase)
- Unsafe cheap-model miss rate (must be <2%)
- Missed escalation rate (must be <5%)
- Cache hit rate (target >60%)
- Tool call efficiency (used/called ratio >80%)
- Verifier pass rate (target >85%)
- Latency per task (must not increase >20%)
Last updated: 2025-07-05