Upload docs/ROADMAP.md

318d1cd verified 1 day ago

9.55 kB

	# Agent Cost Optimizer — Roadmap

	## Current Status (v1.0)

	- ✅ 10 core modules implemented and benchmarked
	- ✅ 28% cost reduction at iso-quality (94.3% success rate)
	- ✅ Synthetic benchmark (2K traces, 19 scenarios)
	- ✅ Learned router skeleton (trainable, not yet trained on real data)
	- ✅ Deployment guide, model card, technical report
	- ✅ Gradio dashboard code (not yet deployed)

	---

	## Phase 1: Learned Router (Immediate Priority)

	Goal: Replace heuristic router with classifier trained on real traces.

	### Why This Is #1
	The ablation study shows the model router is the most critical module. A trained classifier could:
	- Increase savings from 28% to 35–40%
	- Reduce false escalations by 50%
	- Enable task-specific routing (code → Claude, reasoning → o3-mini)

	### Implementation
	1. Collect 10K+ real traces with full telemetry
	2. Extract (request_features, optimal_tier) pairs
	3. Train simple logistic regression / small neural classifier
	4. Or: Train with GRPO using cost-adjusted reward (BAAR-style boundary-guided routing)
	5. A/B test against heuristic router
	6. Fall back to heuristic when confidence < 0.7

	Estimated effort: 2–3 days
	Expected impact: +7–12pp cost savings, <1pp quality regression

	---

	## Phase 2: Real Interactive Benchmark

	Goal: Evaluate ACO against real agent tasks with actual model calls.

	### Why Synthetic Is Not Enough
	Our synthetic benchmark assumes fixed success probabilities per tier. Real-world model behavior:
	- Is non-stationary (models improve, new models release)
	- Depends on prompt engineering, not just model strength
	- Has provider-specific quirks (Claude vs GPT vs DeepSeek)
	- Is affected by rate limits, timeouts, transient failures

	### Implementation
	1. Coding benchmark: Integrate with SWE-bench lite (500 tasks)
	- Run with cheap model first, escalate on failure
	- Measure: pass@1, LLM calls, cost, time
	2. Tool-use benchmark: Integrate with BFCL (2,000 function-calling tasks)
	- Measure: tool accuracy, missed tools, cost
	3. Research benchmark: 100 real research questions
	- Run with retrieval + cheap model vs retrieval + frontier
	- Human evaluation: source quality, hallucination, coverage
	4. Long-horizon benchmark: 50 multi-step tasks (WebArena-style)
	- Measure: task completion, cost growth over steps, cache hit rate

	Estimated effort: 1 week
	Expected impact: Calibrate all module thresholds, discover edge cases

	---

	## Phase 3: Online Learning Loop

	Goal: Update routing probabilities, tool gate thresholds, and doom detector thresholds from live trace feedback.

	### Why Static Policies Fail
	- Model capabilities improve (GPT-4o → GPT-5)
	- New cheap models release (GPT-4o-mini → even cheaper)
	- Task mix changes over time
	- User behavior shifts

	### Implementation
	1. Trace ingestion pipeline: Collect traces from production runs
	2. Outcome labeling: Success/failure/escalation labels from user feedback
	3. Online update: Update router classifier weights weekly
	4. Thompson sampling: Explore new routing decisions with small probability
	5. Drift detection: Alert when success rate drops >5pp for a task type

	Estimated effort: 1 week
	Expected impact: Maintains 28%+ savings as models/task mix evolve

	---

	## Phase 4: Verifier Cascading

	Goal: Use cheap verifier first, escalate to expensive verifier only on disagreement.

	### Current State
	- Verifier budgeter decides WHETHER to verify
	- When it decides YES, it always uses the same verifier model

	### Improvement
	- Tier 1: Simple regex/rule-based checks (free)
	- Tier 2: Cheap model verifier (GPT-4o-mini, $0.15/M tok)
	- Tier 3: Expensive verifier (GPT-4o, $2.5/M tok) — only when tier 2 flags issue
	- Consensus mode: Run cheap + medium verifier, escalate if disagree

	Estimated impact: 60–80% verifier cost reduction on low-risk tasks

	---

	## Phase 5: Cross-Provider Cost Optimization

	Goal: Route to cheapest provider offering adequate model tier.

	### Providers with Similar-Tier Models

	\| Tier \| OpenAI \| Anthropic \| DeepSeek \| Google \| Together \| Fireworks \|
	\|------\|--------\|-----------\|----------\|--------\|----------\|-----------\|
	\| 2 (Cheap) \| GPT-4o-mini \| Haiku \| DeepSeek-V3 \| Gemini-Flash \| Llama-3.1-8B \| Mixtral-8x7B \|
	\| 3 (Medium) \| GPT-4o \| Sonnet \| DeepSeek-V3 \| Gemini-Pro \| Llama-3.1-70B \| Qwen-2.5-72B \|
	\| 4 (Frontier) \| o1 \| Opus \| — \| Gemini-Ultra \| Llama-3.1-405B \| — \|

	### Implementation
	1. Maintain provider pricing API (auto-fetch current prices)
	2. Add provider latency/availability monitoring
	3. Route to cheapest available tier-adequate provider
	4. Fallback chain: primary → secondary → tertiary
	5. Cache routing decisions per provider for stability

	Estimated impact: Additional 5–10% cost reduction on multi-provider setups

	---

	## Phase 6: KV Cache Sharing

	Goal: Share prefix KV caches across concurrent agent runs using identical system prompts.

	### How It Works
	- Many agent runs share the same system prompt + tool descriptions
	- vLLM and SGLang support prefix caching / KV cache sharing
	- Running N agents concurrently → cache system prompt once, reuse N-1 times

	### Implementation
	1. Integrate with vLLM/SGLang backend for local models
	2. Group agent runs by identical prefix hash
	3. Pre-fill shared prefix once, append per-run suffix
	4. Track cache hit rate per prefix group
	5. Apply to multi-tenant agent deployments

	Estimated impact: 20–40% cost reduction on concurrent agent farms

	---

	## Phase 7: Speculative Agent Actions

	Goal: Generate next N actions with cheap model, validate with frontier only on divergence.

	### How It Works
	1. Cheap model generates next action sequence (plan + tool calls)
	2. Frontier model validates only the divergent or high-risk actions
	3. If cheap model plan matches frontier with >0.9 similarity → use cheap
	4. If divergence > threshold → regenerate with frontier

	### Use Cases
	- Multi-step coding workflows (cheap generates plan, frontier validates critical steps)
	- Research workflows (cheap suggests search queries, frontier validates synthesis)
	- Tool-heavy workflows (cheap predicts tool sequence, frontier validates data transformations)

	Estimated impact: 15–25% cost reduction on multi-step tasks

	---

	## Phase 8: Confidence Calibration with Process Reward Models

	Goal: Train a per-step success predictor for dynamic compute allocation.

	### Current State
	- Router uses task-level difficulty classification
	- Does not adapt compute within a task based on step-level confidence

	### Improvement
	1. Train a small PRM (Process Reward Model) on agent traces
	2. At each step, PRM predicts P(success \| current state)
	3. If P(success) < 0.5 → escalate model, retrieve more context, or call verifier
	4. If P(success) > 0.95 → use cheaper model for next step
	5. Dynamically allocate compute based on real-time trajectory quality

	Estimated impact: 10–15% cost reduction with quality preservation

	---

	## Phase 9: Human-in-the-Loop Integration

	Goal: Learn from human corrections to improve routing and reduce future mistakes.

	### Implementation
	1. When human corrects agent output → label the trace
	2. If human says "should have used stronger model" → update routing probabilities
	3. If human says "didn't need to call that tool" → update tool gate thresholds
	4. If human says "stopped too early" → update doom detector thresholds
	5. Feed corrections into online learning loop (Phase 3)

	Estimated impact: Reduces false-DONE rate and missed escalation rate by 30–50%

	---

	## Phase 10: Meta-Learning Across Tasks

	Goal: Learn task-specific optimal policies from a small number of examples.

	### How It Works
	- New task type appears (e.g., "medical diagnosis assistant")
	- ACO has no prior traces for this task type
	- Meta-learner transfers policies from similar task types (e.g., legal → medical, both high-risk)
	- Few-shot calibrates thresholds from first 10–20 traces

	### Implementation
	1. Embed task types in semantic space
	2. Find k-nearest task types with sufficient trace history
	3. Transfer router weights, tool gate thresholds, verifier rules
	4. Bayesian update with new task traces
	5. Converge to task-specific policy within 50 traces

	Estimated impact: Reduces cold-start period from 100 traces to 20 traces

	---

	## Summary: Priority Ranking

	\| Phase \| Impact \| Effort \| Priority \|
	\|-------\|--------\|--------\|----------\|
	\| 1. Learned Router \| ⭐⭐⭐⭐⭐ \| Medium \| #1 \|
	\| 2. Real Benchmark \| ⭐⭐⭐⭐⭐ \| High \| #2 \|
	\| 3. Online Learning \| ⭐⭐⭐⭐⭐ \| High \| #3 \|
	\| 4. Verifier Cascading \| ⭐⭐⭐⭐ \| Low \| #4 \|
	\| 5. Cross-Provider \| ⭐⭐⭐⭐ \| Medium \| #5 \|
	\| 6. KV Cache Sharing \| ⭐⭐⭐ \| High \| #6 \|
	\| 7. Speculative Actions \| ⭐⭐⭐⭐ \| High \| #7 \|
	\| 8. PRM Calibration \| ⭐⭐⭐⭐ \| High \| #8 \|
	\| 9. Human-in-the-Loop \| ⭐⭐⭐ \| Medium \| #9 \|
	\| 10. Meta-Learning \| ⭐⭐⭐ \| High \| #10 \|

	---

	## Success Metrics for Each Phase

	Track these metrics for every phase:

	1. Cost per successful task (primary)
	2. Cost per artifact (secondary)
	3. Task success rate (must not regress)
	4. False-DONE rate (must not increase)
	5. Unsafe cheap-model miss rate (must be <2%)
	6. Missed escalation rate (must be <5%)
	7. Cache hit rate (target >60%)
	8. Tool call efficiency (used/called ratio >80%)
	9. Verifier pass rate (target >85%)
	10. Latency per task (must not increase >20%)

	---

	Last updated: 2025-07-05