agent-cost-optimizer / docs /ROADMAP.md
narcolepticchicken's picture
Upload docs/ROADMAP.md
318d1cd verified
|
raw
history blame
9.55 kB

Agent Cost Optimizer β€” Roadmap

Current Status (v1.0)

  • βœ… 10 core modules implemented and benchmarked
  • βœ… 28% cost reduction at iso-quality (94.3% success rate)
  • βœ… Synthetic benchmark (2K traces, 19 scenarios)
  • βœ… Learned router skeleton (trainable, not yet trained on real data)
  • βœ… Deployment guide, model card, technical report
  • βœ… Gradio dashboard code (not yet deployed)

Phase 1: Learned Router (Immediate Priority)

Goal: Replace heuristic router with classifier trained on real traces.

Why This Is #1

The ablation study shows the model router is the most critical module. A trained classifier could:

  • Increase savings from 28% to 35–40%
  • Reduce false escalations by 50%
  • Enable task-specific routing (code β†’ Claude, reasoning β†’ o3-mini)

Implementation

  1. Collect 10K+ real traces with full telemetry
  2. Extract (request_features, optimal_tier) pairs
  3. Train simple logistic regression / small neural classifier
  4. Or: Train with GRPO using cost-adjusted reward (BAAR-style boundary-guided routing)
  5. A/B test against heuristic router
  6. Fall back to heuristic when confidence < 0.7

Estimated effort: 2–3 days
Expected impact: +7–12pp cost savings, <1pp quality regression


Phase 2: Real Interactive Benchmark

Goal: Evaluate ACO against real agent tasks with actual model calls.

Why Synthetic Is Not Enough

Our synthetic benchmark assumes fixed success probabilities per tier. Real-world model behavior:

  • Is non-stationary (models improve, new models release)
  • Depends on prompt engineering, not just model strength
  • Has provider-specific quirks (Claude vs GPT vs DeepSeek)
  • Is affected by rate limits, timeouts, transient failures

Implementation

  1. Coding benchmark: Integrate with SWE-bench lite (500 tasks)
    • Run with cheap model first, escalate on failure
    • Measure: pass@1, LLM calls, cost, time
  2. Tool-use benchmark: Integrate with BFCL (2,000 function-calling tasks)
    • Measure: tool accuracy, missed tools, cost
  3. Research benchmark: 100 real research questions
    • Run with retrieval + cheap model vs retrieval + frontier
    • Human evaluation: source quality, hallucination, coverage
  4. Long-horizon benchmark: 50 multi-step tasks (WebArena-style)
    • Measure: task completion, cost growth over steps, cache hit rate

Estimated effort: 1 week
Expected impact: Calibrate all module thresholds, discover edge cases


Phase 3: Online Learning Loop

Goal: Update routing probabilities, tool gate thresholds, and doom detector thresholds from live trace feedback.

Why Static Policies Fail

  • Model capabilities improve (GPT-4o β†’ GPT-5)
  • New cheap models release (GPT-4o-mini β†’ even cheaper)
  • Task mix changes over time
  • User behavior shifts

Implementation

  1. Trace ingestion pipeline: Collect traces from production runs
  2. Outcome labeling: Success/failure/escalation labels from user feedback
  3. Online update: Update router classifier weights weekly
  4. Thompson sampling: Explore new routing decisions with small probability
  5. Drift detection: Alert when success rate drops >5pp for a task type

Estimated effort: 1 week
Expected impact: Maintains 28%+ savings as models/task mix evolve


Phase 4: Verifier Cascading

Goal: Use cheap verifier first, escalate to expensive verifier only on disagreement.

Current State

  • Verifier budgeter decides WHETHER to verify
  • When it decides YES, it always uses the same verifier model

Improvement

  • Tier 1: Simple regex/rule-based checks (free)
  • Tier 2: Cheap model verifier (GPT-4o-mini, $0.15/M tok)
  • Tier 3: Expensive verifier (GPT-4o, $2.5/M tok) β€” only when tier 2 flags issue
  • Consensus mode: Run cheap + medium verifier, escalate if disagree

Estimated impact: 60–80% verifier cost reduction on low-risk tasks


Phase 5: Cross-Provider Cost Optimization

Goal: Route to cheapest provider offering adequate model tier.

Providers with Similar-Tier Models

Tier OpenAI Anthropic DeepSeek Google Together Fireworks
2 (Cheap) GPT-4o-mini Haiku DeepSeek-V3 Gemini-Flash Llama-3.1-8B Mixtral-8x7B
3 (Medium) GPT-4o Sonnet DeepSeek-V3 Gemini-Pro Llama-3.1-70B Qwen-2.5-72B
4 (Frontier) o1 Opus β€” Gemini-Ultra Llama-3.1-405B β€”

Implementation

  1. Maintain provider pricing API (auto-fetch current prices)
  2. Add provider latency/availability monitoring
  3. Route to cheapest available tier-adequate provider
  4. Fallback chain: primary β†’ secondary β†’ tertiary
  5. Cache routing decisions per provider for stability

Estimated impact: Additional 5–10% cost reduction on multi-provider setups


Phase 6: KV Cache Sharing

Goal: Share prefix KV caches across concurrent agent runs using identical system prompts.

How It Works

  • Many agent runs share the same system prompt + tool descriptions
  • vLLM and SGLang support prefix caching / KV cache sharing
  • Running N agents concurrently β†’ cache system prompt once, reuse N-1 times

Implementation

  1. Integrate with vLLM/SGLang backend for local models
  2. Group agent runs by identical prefix hash
  3. Pre-fill shared prefix once, append per-run suffix
  4. Track cache hit rate per prefix group
  5. Apply to multi-tenant agent deployments

Estimated impact: 20–40% cost reduction on concurrent agent farms


Phase 7: Speculative Agent Actions

Goal: Generate next N actions with cheap model, validate with frontier only on divergence.

How It Works

  1. Cheap model generates next action sequence (plan + tool calls)
  2. Frontier model validates only the divergent or high-risk actions
  3. If cheap model plan matches frontier with >0.9 similarity β†’ use cheap
  4. If divergence > threshold β†’ regenerate with frontier

Use Cases

  • Multi-step coding workflows (cheap generates plan, frontier validates critical steps)
  • Research workflows (cheap suggests search queries, frontier validates synthesis)
  • Tool-heavy workflows (cheap predicts tool sequence, frontier validates data transformations)

Estimated impact: 15–25% cost reduction on multi-step tasks


Phase 8: Confidence Calibration with Process Reward Models

Goal: Train a per-step success predictor for dynamic compute allocation.

Current State

  • Router uses task-level difficulty classification
  • Does not adapt compute within a task based on step-level confidence

Improvement

  1. Train a small PRM (Process Reward Model) on agent traces
  2. At each step, PRM predicts P(success | current state)
  3. If P(success) < 0.5 β†’ escalate model, retrieve more context, or call verifier
  4. If P(success) > 0.95 β†’ use cheaper model for next step
  5. Dynamically allocate compute based on real-time trajectory quality

Estimated impact: 10–15% cost reduction with quality preservation


Phase 9: Human-in-the-Loop Integration

Goal: Learn from human corrections to improve routing and reduce future mistakes.

Implementation

  1. When human corrects agent output β†’ label the trace
  2. If human says "should have used stronger model" β†’ update routing probabilities
  3. If human says "didn't need to call that tool" β†’ update tool gate thresholds
  4. If human says "stopped too early" β†’ update doom detector thresholds
  5. Feed corrections into online learning loop (Phase 3)

Estimated impact: Reduces false-DONE rate and missed escalation rate by 30–50%


Phase 10: Meta-Learning Across Tasks

Goal: Learn task-specific optimal policies from a small number of examples.

How It Works

  • New task type appears (e.g., "medical diagnosis assistant")
  • ACO has no prior traces for this task type
  • Meta-learner transfers policies from similar task types (e.g., legal β†’ medical, both high-risk)
  • Few-shot calibrates thresholds from first 10–20 traces

Implementation

  1. Embed task types in semantic space
  2. Find k-nearest task types with sufficient trace history
  3. Transfer router weights, tool gate thresholds, verifier rules
  4. Bayesian update with new task traces
  5. Converge to task-specific policy within 50 traces

Estimated impact: Reduces cold-start period from 100 traces to 20 traces


Summary: Priority Ranking

Phase Impact Effort Priority
1. Learned Router ⭐⭐⭐⭐⭐ Medium #1
2. Real Benchmark ⭐⭐⭐⭐⭐ High #2
3. Online Learning ⭐⭐⭐⭐⭐ High #3
4. Verifier Cascading ⭐⭐⭐⭐ Low #4
5. Cross-Provider ⭐⭐⭐⭐ Medium #5
6. KV Cache Sharing ⭐⭐⭐ High #6
7. Speculative Actions ⭐⭐⭐⭐ High #7
8. PRM Calibration ⭐⭐⭐⭐ High #8
9. Human-in-the-Loop ⭐⭐⭐ Medium #9
10. Meta-Learning ⭐⭐⭐ High #10

Success Metrics for Each Phase

Track these metrics for every phase:

  1. Cost per successful task (primary)
  2. Cost per artifact (secondary)
  3. Task success rate (must not regress)
  4. False-DONE rate (must not increase)
  5. Unsafe cheap-model miss rate (must be <2%)
  6. Missed escalation rate (must be <5%)
  7. Cache hit rate (target >60%)
  8. Tool call efficiency (used/called ratio >80%)
  9. Verifier pass rate (target >85%)
  10. Latency per task (must not increase >20%)

Last updated: 2025-07-05