agent-cost-optimizer / docs /ROADMAP.md
narcolepticchicken's picture
Upload docs/ROADMAP.md
318d1cd verified
|
raw
history blame
9.55 kB
# Agent Cost Optimizer β€” Roadmap
## Current Status (v1.0)
- βœ… 10 core modules implemented and benchmarked
- βœ… 28% cost reduction at iso-quality (94.3% success rate)
- βœ… Synthetic benchmark (2K traces, 19 scenarios)
- βœ… Learned router skeleton (trainable, not yet trained on real data)
- βœ… Deployment guide, model card, technical report
- βœ… Gradio dashboard code (not yet deployed)
---
## Phase 1: Learned Router (Immediate Priority)
**Goal:** Replace heuristic router with classifier trained on real traces.
### Why This Is #1
The ablation study shows the model router is the most critical module. A trained classifier could:
- Increase savings from 28% to 35–40%
- Reduce false escalations by 50%
- Enable task-specific routing (code β†’ Claude, reasoning β†’ o3-mini)
### Implementation
1. Collect 10K+ real traces with full telemetry
2. Extract (request_features, optimal_tier) pairs
3. Train simple logistic regression / small neural classifier
4. Or: Train with GRPO using cost-adjusted reward (BAAR-style boundary-guided routing)
5. A/B test against heuristic router
6. Fall back to heuristic when confidence < 0.7
**Estimated effort:** 2–3 days
**Expected impact:** +7–12pp cost savings, <1pp quality regression
---
## Phase 2: Real Interactive Benchmark
**Goal:** Evaluate ACO against real agent tasks with actual model calls.
### Why Synthetic Is Not Enough
Our synthetic benchmark assumes fixed success probabilities per tier. Real-world model behavior:
- Is non-stationary (models improve, new models release)
- Depends on prompt engineering, not just model strength
- Has provider-specific quirks (Claude vs GPT vs DeepSeek)
- Is affected by rate limits, timeouts, transient failures
### Implementation
1. **Coding benchmark:** Integrate with SWE-bench lite (500 tasks)
- Run with cheap model first, escalate on failure
- Measure: pass@1, LLM calls, cost, time
2. **Tool-use benchmark:** Integrate with BFCL (2,000 function-calling tasks)
- Measure: tool accuracy, missed tools, cost
3. **Research benchmark:** 100 real research questions
- Run with retrieval + cheap model vs retrieval + frontier
- Human evaluation: source quality, hallucination, coverage
4. **Long-horizon benchmark:** 50 multi-step tasks (WebArena-style)
- Measure: task completion, cost growth over steps, cache hit rate
**Estimated effort:** 1 week
**Expected impact:** Calibrate all module thresholds, discover edge cases
---
## Phase 3: Online Learning Loop
**Goal:** Update routing probabilities, tool gate thresholds, and doom detector thresholds from live trace feedback.
### Why Static Policies Fail
- Model capabilities improve (GPT-4o β†’ GPT-5)
- New cheap models release (GPT-4o-mini β†’ even cheaper)
- Task mix changes over time
- User behavior shifts
### Implementation
1. **Trace ingestion pipeline:** Collect traces from production runs
2. **Outcome labeling:** Success/failure/escalation labels from user feedback
3. **Online update:** Update router classifier weights weekly
4. **Thompson sampling:** Explore new routing decisions with small probability
5. **Drift detection:** Alert when success rate drops >5pp for a task type
**Estimated effort:** 1 week
**Expected impact:** Maintains 28%+ savings as models/task mix evolve
---
## Phase 4: Verifier Cascading
**Goal:** Use cheap verifier first, escalate to expensive verifier only on disagreement.
### Current State
- Verifier budgeter decides WHETHER to verify
- When it decides YES, it always uses the same verifier model
### Improvement
- **Tier 1:** Simple regex/rule-based checks (free)
- **Tier 2:** Cheap model verifier (GPT-4o-mini, $0.15/M tok)
- **Tier 3:** Expensive verifier (GPT-4o, $2.5/M tok) β€” only when tier 2 flags issue
- **Consensus mode:** Run cheap + medium verifier, escalate if disagree
**Estimated impact:** 60–80% verifier cost reduction on low-risk tasks
---
## Phase 5: Cross-Provider Cost Optimization
**Goal:** Route to cheapest provider offering adequate model tier.
### Providers with Similar-Tier Models
| Tier | OpenAI | Anthropic | DeepSeek | Google | Together | Fireworks |
|------|--------|-----------|----------|--------|----------|-----------|
| 2 (Cheap) | GPT-4o-mini | Haiku | DeepSeek-V3 | Gemini-Flash | Llama-3.1-8B | Mixtral-8x7B |
| 3 (Medium) | GPT-4o | Sonnet | DeepSeek-V3 | Gemini-Pro | Llama-3.1-70B | Qwen-2.5-72B |
| 4 (Frontier) | o1 | Opus | β€” | Gemini-Ultra | Llama-3.1-405B | β€” |
### Implementation
1. Maintain provider pricing API (auto-fetch current prices)
2. Add provider latency/availability monitoring
3. Route to cheapest available tier-adequate provider
4. Fallback chain: primary β†’ secondary β†’ tertiary
5. Cache routing decisions per provider for stability
**Estimated impact:** Additional 5–10% cost reduction on multi-provider setups
---
## Phase 6: KV Cache Sharing
**Goal:** Share prefix KV caches across concurrent agent runs using identical system prompts.
### How It Works
- Many agent runs share the same system prompt + tool descriptions
- vLLM and SGLang support prefix caching / KV cache sharing
- Running N agents concurrently β†’ cache system prompt once, reuse N-1 times
### Implementation
1. Integrate with vLLM/SGLang backend for local models
2. Group agent runs by identical prefix hash
3. Pre-fill shared prefix once, append per-run suffix
4. Track cache hit rate per prefix group
5. Apply to multi-tenant agent deployments
**Estimated impact:** 20–40% cost reduction on concurrent agent farms
---
## Phase 7: Speculative Agent Actions
**Goal:** Generate next N actions with cheap model, validate with frontier only on divergence.
### How It Works
1. Cheap model generates next action sequence (plan + tool calls)
2. Frontier model validates only the *divergent* or *high-risk* actions
3. If cheap model plan matches frontier with >0.9 similarity β†’ use cheap
4. If divergence > threshold β†’ regenerate with frontier
### Use Cases
- Multi-step coding workflows (cheap generates plan, frontier validates critical steps)
- Research workflows (cheap suggests search queries, frontier validates synthesis)
- Tool-heavy workflows (cheap predicts tool sequence, frontier validates data transformations)
**Estimated impact:** 15–25% cost reduction on multi-step tasks
---
## Phase 8: Confidence Calibration with Process Reward Models
**Goal:** Train a per-step success predictor for dynamic compute allocation.
### Current State
- Router uses task-level difficulty classification
- Does not adapt compute within a task based on step-level confidence
### Improvement
1. Train a small PRM (Process Reward Model) on agent traces
2. At each step, PRM predicts P(success | current state)
3. If P(success) < 0.5 β†’ escalate model, retrieve more context, or call verifier
4. If P(success) > 0.95 β†’ use cheaper model for next step
5. Dynamically allocate compute based on real-time trajectory quality
**Estimated impact:** 10–15% cost reduction with quality preservation
---
## Phase 9: Human-in-the-Loop Integration
**Goal:** Learn from human corrections to improve routing and reduce future mistakes.
### Implementation
1. When human corrects agent output β†’ label the trace
2. If human says "should have used stronger model" β†’ update routing probabilities
3. If human says "didn't need to call that tool" β†’ update tool gate thresholds
4. If human says "stopped too early" β†’ update doom detector thresholds
5. Feed corrections into online learning loop (Phase 3)
**Estimated impact:** Reduces false-DONE rate and missed escalation rate by 30–50%
---
## Phase 10: Meta-Learning Across Tasks
**Goal:** Learn task-specific optimal policies from a small number of examples.
### How It Works
- New task type appears (e.g., "medical diagnosis assistant")
- ACO has no prior traces for this task type
- Meta-learner transfers policies from similar task types (e.g., legal β†’ medical, both high-risk)
- Few-shot calibrates thresholds from first 10–20 traces
### Implementation
1. Embed task types in semantic space
2. Find k-nearest task types with sufficient trace history
3. Transfer router weights, tool gate thresholds, verifier rules
4. Bayesian update with new task traces
5. Converge to task-specific policy within 50 traces
**Estimated impact:** Reduces cold-start period from 100 traces to 20 traces
---
## Summary: Priority Ranking
| Phase | Impact | Effort | Priority |
|-------|--------|--------|----------|
| 1. Learned Router | ⭐⭐⭐⭐⭐ | Medium | **#1** |
| 2. Real Benchmark | ⭐⭐⭐⭐⭐ | High | #2 |
| 3. Online Learning | ⭐⭐⭐⭐⭐ | High | #3 |
| 4. Verifier Cascading | ⭐⭐⭐⭐ | Low | #4 |
| 5. Cross-Provider | ⭐⭐⭐⭐ | Medium | #5 |
| 6. KV Cache Sharing | ⭐⭐⭐ | High | #6 |
| 7. Speculative Actions | ⭐⭐⭐⭐ | High | #7 |
| 8. PRM Calibration | ⭐⭐⭐⭐ | High | #8 |
| 9. Human-in-the-Loop | ⭐⭐⭐ | Medium | #9 |
| 10. Meta-Learning | ⭐⭐⭐ | High | #10 |
---
## Success Metrics for Each Phase
Track these metrics for every phase:
1. **Cost per successful task** (primary)
2. **Cost per artifact** (secondary)
3. **Task success rate** (must not regress)
4. **False-DONE rate** (must not increase)
5. **Unsafe cheap-model miss rate** (must be <2%)
6. **Missed escalation rate** (must be <5%)
7. **Cache hit rate** (target >60%)
8. **Tool call efficiency** (used/called ratio >80%)
9. **Verifier pass rate** (target >85%)
10. **Latency per task** (must not increase >20%)
---
*Last updated: 2025-07-05*