narcolepticchicken

Upload docs/ROADMAP.md

318d1cd verified 1 day ago

preview code

raw

history blame

9.55 kB

Agent Cost Optimizer — Roadmap

Current Status (v1.0)

✅ 10 core modules implemented and benchmarked
✅ 28% cost reduction at iso-quality (94.3% success rate)
✅ Synthetic benchmark (2K traces, 19 scenarios)
✅ Learned router skeleton (trainable, not yet trained on real data)
✅ Deployment guide, model card, technical report
✅ Gradio dashboard code (not yet deployed)

Phase 1: Learned Router (Immediate Priority)

Goal: Replace heuristic router with classifier trained on real traces.

Why This Is #1

The ablation study shows the model router is the most critical module. A trained classifier could:

Increase savings from 28% to 35–40%
Reduce false escalations by 50%
Enable task-specific routing (code → Claude, reasoning → o3-mini)

Implementation

Collect 10K+ real traces with full telemetry
Extract (request_features, optimal_tier) pairs
Train simple logistic regression / small neural classifier
Or: Train with GRPO using cost-adjusted reward (BAAR-style boundary-guided routing)
A/B test against heuristic router
Fall back to heuristic when confidence < 0.7

Estimated effort: 2–3 days
Expected impact: +7–12pp cost savings, <1pp quality regression

Phase 2: Real Interactive Benchmark

Goal: Evaluate ACO against real agent tasks with actual model calls.

Why Synthetic Is Not Enough

Our synthetic benchmark assumes fixed success probabilities per tier. Real-world model behavior:

Is non-stationary (models improve, new models release)
Depends on prompt engineering, not just model strength
Has provider-specific quirks (Claude vs GPT vs DeepSeek)
Is affected by rate limits, timeouts, transient failures

Implementation

Coding benchmark: Integrate with SWE-bench lite (500 tasks)
- Run with cheap model first, escalate on failure
- Measure: pass@1, LLM calls, cost, time
Tool-use benchmark: Integrate with BFCL (2,000 function-calling tasks)
- Measure: tool accuracy, missed tools, cost
Research benchmark: 100 real research questions
- Run with retrieval + cheap model vs retrieval + frontier
- Human evaluation: source quality, hallucination, coverage
Long-horizon benchmark: 50 multi-step tasks (WebArena-style)
- Measure: task completion, cost growth over steps, cache hit rate

Estimated effort: 1 week
Expected impact: Calibrate all module thresholds, discover edge cases

Phase 3: Online Learning Loop

Goal: Update routing probabilities, tool gate thresholds, and doom detector thresholds from live trace feedback.

Why Static Policies Fail

Model capabilities improve (GPT-4o → GPT-5)
New cheap models release (GPT-4o-mini → even cheaper)
Task mix changes over time
User behavior shifts

Implementation

Trace ingestion pipeline: Collect traces from production runs
Outcome labeling: Success/failure/escalation labels from user feedback
Online update: Update router classifier weights weekly
Thompson sampling: Explore new routing decisions with small probability
Drift detection: Alert when success rate drops >5pp for a task type

Estimated effort: 1 week
Expected impact: Maintains 28%+ savings as models/task mix evolve

Phase 4: Verifier Cascading

Goal: Use cheap verifier first, escalate to expensive verifier only on disagreement.

Current State

Verifier budgeter decides WHETHER to verify
When it decides YES, it always uses the same verifier model

Improvement

Tier 1: Simple regex/rule-based checks (free)
Tier 2: Cheap model verifier (GPT-4o-mini, $0.15/M tok)
Tier 3: Expensive verifier (GPT-4o, $2.5/M tok) — only when tier 2 flags issue
Consensus mode: Run cheap + medium verifier, escalate if disagree

Estimated impact: 60–80% verifier cost reduction on low-risk tasks

Phase 5: Cross-Provider Cost Optimization

Goal: Route to cheapest provider offering adequate model tier.

Providers with Similar-Tier Models

Tier	OpenAI	Anthropic	DeepSeek	Google	Together	Fireworks
2 (Cheap)	GPT-4o-mini	Haiku	DeepSeek-V3	Gemini-Flash	Llama-3.1-8B	Mixtral-8x7B
3 (Medium)	GPT-4o	Sonnet	DeepSeek-V3	Gemini-Pro	Llama-3.1-70B	Qwen-2.5-72B
4 (Frontier)	o1	Opus	—	Gemini-Ultra	Llama-3.1-405B	—

Implementation

Maintain provider pricing API (auto-fetch current prices)
Add provider latency/availability monitoring
Route to cheapest available tier-adequate provider
Fallback chain: primary → secondary → tertiary
Cache routing decisions per provider for stability

Estimated impact: Additional 5–10% cost reduction on multi-provider setups

Phase 6: KV Cache Sharing

Goal: Share prefix KV caches across concurrent agent runs using identical system prompts.

How It Works

Many agent runs share the same system prompt + tool descriptions
vLLM and SGLang support prefix caching / KV cache sharing
Running N agents concurrently → cache system prompt once, reuse N-1 times

Implementation

Integrate with vLLM/SGLang backend for local models
Group agent runs by identical prefix hash
Pre-fill shared prefix once, append per-run suffix
Track cache hit rate per prefix group
Apply to multi-tenant agent deployments

Estimated impact: 20–40% cost reduction on concurrent agent farms

Phase 7: Speculative Agent Actions

Goal: Generate next N actions with cheap model, validate with frontier only on divergence.

How It Works

Cheap model generates next action sequence (plan + tool calls)
Frontier model validates only the divergent or high-risk actions
If cheap model plan matches frontier with >0.9 similarity → use cheap
If divergence > threshold → regenerate with frontier

Use Cases

Multi-step coding workflows (cheap generates plan, frontier validates critical steps)
Research workflows (cheap suggests search queries, frontier validates synthesis)
Tool-heavy workflows (cheap predicts tool sequence, frontier validates data transformations)

Estimated impact: 15–25% cost reduction on multi-step tasks

Phase 8: Confidence Calibration with Process Reward Models

Goal: Train a per-step success predictor for dynamic compute allocation.

Current State

Router uses task-level difficulty classification
Does not adapt compute within a task based on step-level confidence

Improvement

Train a small PRM (Process Reward Model) on agent traces
At each step, PRM predicts P(success | current state)
If P(success) < 0.5 → escalate model, retrieve more context, or call verifier
If P(success) > 0.95 → use cheaper model for next step
Dynamically allocate compute based on real-time trajectory quality

Estimated impact: 10–15% cost reduction with quality preservation

Phase 9: Human-in-the-Loop Integration

Goal: Learn from human corrections to improve routing and reduce future mistakes.

Implementation

When human corrects agent output → label the trace
If human says "should have used stronger model" → update routing probabilities
If human says "didn't need to call that tool" → update tool gate thresholds
If human says "stopped too early" → update doom detector thresholds
Feed corrections into online learning loop (Phase 3)

Estimated impact: Reduces false-DONE rate and missed escalation rate by 30–50%

Phase 10: Meta-Learning Across Tasks

Goal: Learn task-specific optimal policies from a small number of examples.

How It Works

New task type appears (e.g., "medical diagnosis assistant")
ACO has no prior traces for this task type
Meta-learner transfers policies from similar task types (e.g., legal → medical, both high-risk)
Few-shot calibrates thresholds from first 10–20 traces

Implementation

Embed task types in semantic space
Find k-nearest task types with sufficient trace history
Transfer router weights, tool gate thresholds, verifier rules
Bayesian update with new task traces
Converge to task-specific policy within 50 traces

Estimated impact: Reduces cold-start period from 100 traces to 20 traces

Summary: Priority Ranking

Phase	Impact	Effort	Priority
1. Learned Router	⭐⭐⭐⭐⭐	Medium	#1
2. Real Benchmark	⭐⭐⭐⭐⭐	High	#2
3. Online Learning	⭐⭐⭐⭐⭐	High	#3
4. Verifier Cascading	⭐⭐⭐⭐	Low	#4
5. Cross-Provider	⭐⭐⭐⭐	Medium	#5
6. KV Cache Sharing	⭐⭐⭐	High	#6
7. Speculative Actions	⭐⭐⭐⭐	High	#7
8. PRM Calibration	⭐⭐⭐⭐	High	#8
9. Human-in-the-Loop	⭐⭐⭐	Medium	#9
10. Meta-Learning	⭐⭐⭐	High	#10

Success Metrics for Each Phase

Track these metrics for every phase:

Cost per successful task (primary)
Cost per artifact (secondary)
Task success rate (must not regress)
False-DONE rate (must not increase)
Unsafe cheap-model miss rate (must be <2%)
Missed escalation rate (must be <5%)
Cache hit rate (target >60%)
Tool call efficiency (used/called ratio >80%)
Verifier pass rate (target >85%)
Latency per task (must not increase >20%)

Last updated: 2025-07-05