File size: 9,553 Bytes

318d1cd

# Agent Cost Optimizer — Roadmap

## Current Status (v1.0)

- ✅ 10 core modules implemented and benchmarked
- ✅ 28% cost reduction at iso-quality (94.3% success rate)
- ✅ Synthetic benchmark (2K traces, 19 scenarios)
- ✅ Learned router skeleton (trainable, not yet trained on real data)
- ✅ Deployment guide, model card, technical report
- ✅ Gradio dashboard code (not yet deployed)

---

## Phase 1: Learned Router (Immediate Priority)

**Goal:** Replace heuristic router with classifier trained on real traces.

### Why This Is #1
The ablation study shows the model router is the most critical module. A trained classifier could:
- Increase savings from 28% to 35–40%
- Reduce false escalations by 50%
- Enable task-specific routing (code → Claude, reasoning → o3-mini)

### Implementation
1. Collect 10K+ real traces with full telemetry
2. Extract (request_features, optimal_tier) pairs
3. Train simple logistic regression / small neural classifier
4. Or: Train with GRPO using cost-adjusted reward (BAAR-style boundary-guided routing)
5. A/B test against heuristic router
6. Fall back to heuristic when confidence < 0.7

**Estimated effort:** 2–3 days  
**Expected impact:** +7–12pp cost savings, <1pp quality regression

---

## Phase 2: Real Interactive Benchmark

**Goal:** Evaluate ACO against real agent tasks with actual model calls.

### Why Synthetic Is Not Enough
Our synthetic benchmark assumes fixed success probabilities per tier. Real-world model behavior:
- Is non-stationary (models improve, new models release)
- Depends on prompt engineering, not just model strength
- Has provider-specific quirks (Claude vs GPT vs DeepSeek)
- Is affected by rate limits, timeouts, transient failures

### Implementation
1. **Coding benchmark:** Integrate with SWE-bench lite (500 tasks)
   - Run with cheap model first, escalate on failure
   - Measure: pass@1, LLM calls, cost, time
2. **Tool-use benchmark:** Integrate with BFCL (2,000 function-calling tasks)
   - Measure: tool accuracy, missed tools, cost
3. **Research benchmark:** 100 real research questions
   - Run with retrieval + cheap model vs retrieval + frontier
   - Human evaluation: source quality, hallucination, coverage
4. **Long-horizon benchmark:** 50 multi-step tasks (WebArena-style)
   - Measure: task completion, cost growth over steps, cache hit rate

**Estimated effort:** 1 week  
**Expected impact:** Calibrate all module thresholds, discover edge cases

---

## Phase 3: Online Learning Loop

**Goal:** Update routing probabilities, tool gate thresholds, and doom detector thresholds from live trace feedback.

### Why Static Policies Fail
- Model capabilities improve (GPT-4o → GPT-5)
- New cheap models release (GPT-4o-mini → even cheaper)
- Task mix changes over time
- User behavior shifts

### Implementation
1. **Trace ingestion pipeline:** Collect traces from production runs
2. **Outcome labeling:** Success/failure/escalation labels from user feedback
3. **Online update:** Update router classifier weights weekly
4. **Thompson sampling:** Explore new routing decisions with small probability
5. **Drift detection:** Alert when success rate drops >5pp for a task type

**Estimated effort:** 1 week  
**Expected impact:** Maintains 28%+ savings as models/task mix evolve

---

## Phase 4: Verifier Cascading

**Goal:** Use cheap verifier first, escalate to expensive verifier only on disagreement.

### Current State
- Verifier budgeter decides WHETHER to verify
- When it decides YES, it always uses the same verifier model

### Improvement
- **Tier 1:** Simple regex/rule-based checks (free)
- **Tier 2:** Cheap model verifier (GPT-4o-mini, $0.15/M tok)
- **Tier 3:** Expensive verifier (GPT-4o, $2.5/M tok) — only when tier 2 flags issue
- **Consensus mode:** Run cheap + medium verifier, escalate if disagree

**Estimated impact:** 60–80% verifier cost reduction on low-risk tasks

---

## Phase 5: Cross-Provider Cost Optimization

**Goal:** Route to cheapest provider offering adequate model tier.

### Providers with Similar-Tier Models

| Tier | OpenAI | Anthropic | DeepSeek | Google | Together | Fireworks |
|------|--------|-----------|----------|--------|----------|-----------|
| 2 (Cheap) | GPT-4o-mini | Haiku | DeepSeek-V3 | Gemini-Flash | Llama-3.1-8B | Mixtral-8x7B |
| 3 (Medium) | GPT-4o | Sonnet | DeepSeek-V3 | Gemini-Pro | Llama-3.1-70B | Qwen-2.5-72B |
| 4 (Frontier) | o1 | Opus | — | Gemini-Ultra | Llama-3.1-405B | — |

### Implementation
1. Maintain provider pricing API (auto-fetch current prices)
2. Add provider latency/availability monitoring
3. Route to cheapest available tier-adequate provider
4. Fallback chain: primary → secondary → tertiary
5. Cache routing decisions per provider for stability

**Estimated impact:** Additional 5–10% cost reduction on multi-provider setups

---

## Phase 6: KV Cache Sharing

**Goal:** Share prefix KV caches across concurrent agent runs using identical system prompts.

### How It Works
- Many agent runs share the same system prompt + tool descriptions
- vLLM and SGLang support prefix caching / KV cache sharing
- Running N agents concurrently → cache system prompt once, reuse N-1 times

### Implementation
1. Integrate with vLLM/SGLang backend for local models
2. Group agent runs by identical prefix hash
3. Pre-fill shared prefix once, append per-run suffix
4. Track cache hit rate per prefix group
5. Apply to multi-tenant agent deployments

**Estimated impact:** 20–40% cost reduction on concurrent agent farms

---

## Phase 7: Speculative Agent Actions

**Goal:** Generate next N actions with cheap model, validate with frontier only on divergence.

### How It Works
1. Cheap model generates next action sequence (plan + tool calls)
2. Frontier model validates only the *divergent* or *high-risk* actions
3. If cheap model plan matches frontier with >0.9 similarity → use cheap
4. If divergence > threshold → regenerate with frontier

### Use Cases
- Multi-step coding workflows (cheap generates plan, frontier validates critical steps)
- Research workflows (cheap suggests search queries, frontier validates synthesis)
- Tool-heavy workflows (cheap predicts tool sequence, frontier validates data transformations)

**Estimated impact:** 15–25% cost reduction on multi-step tasks

---

## Phase 8: Confidence Calibration with Process Reward Models

**Goal:** Train a per-step success predictor for dynamic compute allocation.

### Current State
- Router uses task-level difficulty classification
- Does not adapt compute within a task based on step-level confidence

### Improvement
1. Train a small PRM (Process Reward Model) on agent traces
2. At each step, PRM predicts P(success | current state)
3. If P(success) < 0.5 → escalate model, retrieve more context, or call verifier
4. If P(success) > 0.95 → use cheaper model for next step
5. Dynamically allocate compute based on real-time trajectory quality

**Estimated impact:** 10–15% cost reduction with quality preservation

---

## Phase 9: Human-in-the-Loop Integration

**Goal:** Learn from human corrections to improve routing and reduce future mistakes.

### Implementation
1. When human corrects agent output → label the trace
2. If human says "should have used stronger model" → update routing probabilities
3. If human says "didn't need to call that tool" → update tool gate thresholds
4. If human says "stopped too early" → update doom detector thresholds
5. Feed corrections into online learning loop (Phase 3)

**Estimated impact:** Reduces false-DONE rate and missed escalation rate by 30–50%

---

## Phase 10: Meta-Learning Across Tasks

**Goal:** Learn task-specific optimal policies from a small number of examples.

### How It Works
- New task type appears (e.g., "medical diagnosis assistant")
- ACO has no prior traces for this task type
- Meta-learner transfers policies from similar task types (e.g., legal → medical, both high-risk)
- Few-shot calibrates thresholds from first 10–20 traces

### Implementation
1. Embed task types in semantic space
2. Find k-nearest task types with sufficient trace history
3. Transfer router weights, tool gate thresholds, verifier rules
4. Bayesian update with new task traces
5. Converge to task-specific policy within 50 traces

**Estimated impact:** Reduces cold-start period from 100 traces to 20 traces

---

## Summary: Priority Ranking

| Phase | Impact | Effort | Priority |
|-------|--------|--------|----------|
| 1. Learned Router | ⭐⭐⭐⭐⭐ | Medium | **#1** |
| 2. Real Benchmark | ⭐⭐⭐⭐⭐ | High | #2 |
| 3. Online Learning | ⭐⭐⭐⭐⭐ | High | #3 |
| 4. Verifier Cascading | ⭐⭐⭐⭐ | Low | #4 |
| 5. Cross-Provider | ⭐⭐⭐⭐ | Medium | #5 |
| 6. KV Cache Sharing | ⭐⭐⭐ | High | #6 |
| 7. Speculative Actions | ⭐⭐⭐⭐ | High | #7 |
| 8. PRM Calibration | ⭐⭐⭐⭐ | High | #8 |
| 9. Human-in-the-Loop | ⭐⭐⭐ | Medium | #9 |
| 10. Meta-Learning | ⭐⭐⭐ | High | #10 |

---

## Success Metrics for Each Phase

Track these metrics for every phase:

1. **Cost per successful task** (primary)
2. **Cost per artifact** (secondary)
3. **Task success rate** (must not regress)
4. **False-DONE rate** (must not increase)
5. **Unsafe cheap-model miss rate** (must be <2%)
6. **Missed escalation rate** (must be <5%)
7. **Cache hit rate** (target >60%)
8. **Tool call efficiency** (used/called ratio >80%)
9. **Verifier pass rate** (target >85%)
10. **Latency per task** (must not increase >20%)

---

*Last updated: 2025-07-05*