narcolepticchicken
/

agent-cost-optimizer

Safetensors

Model card Files Files and versions

xet

Community

narcolepticchicken commited on about 11 hours ago

Commit

db1085e

verified ·

1 Parent(s): e4cea93

Upload docs/ROADMAP.md

Browse files

Files changed (1) hide show

docs/ROADMAP.md +78 -256

docs/ROADMAP.md CHANGED Viewed

@@ -1,256 +1,78 @@
-# Agent Cost Optimizer — Roadmap
-## Current Status (v1.0)
-- ✅ 10 core modules implemented and benchmarked
-- ✅ 28% cost reduction at iso-quality (94.3% success rate)
-- ✅ Synthetic benchmark (2K traces, 19 scenarios)
-- ✅ Learned router skeleton (trainable, not yet trained on real data)
-- ✅ Deployment guide, model card, technical report
-- ✅ Gradio dashboard code (not yet deployed)
----
-## Phase 1: Learned Router (Immediate Priority)
-**Goal:** Replace heuristic router with classifier trained on real traces.
-### Why This Is #1
-The ablation study shows the model router is the most critical module. A trained classifier could:
-- Increase savings from 28% to 35–40%
-- Reduce false escalations by 50%
-- Enable task-specific routing (code → Claude, reasoning → o3-mini)
-### Implementation
-1. Collect 10K+ real traces with full telemetry
-2. Extract (request_features, optimal_tier) pairs
-3. Train simple logistic regression / small neural classifier
-4. Or: Train with GRPO using cost-adjusted reward (BAAR-style boundary-guided routing)
-5. A/B test against heuristic router
-6. Fall back to heuristic when confidence < 0.7
-**Estimated effort:** 2–3 days
-**Expected impact:** +7–12pp cost savings, <1pp quality regression
----
-## Phase 2: Real Interactive Benchmark
-**Goal:** Evaluate ACO against real agent tasks with actual model calls.
-### Why Synthetic Is Not Enough
-Our synthetic benchmark assumes fixed success probabilities per tier. Real-world model behavior:
-- Is non-stationary (models improve, new models release)
-- Depends on prompt engineering, not just model strength
-- Has provider-specific quirks (Claude vs GPT vs DeepSeek)
-- Is affected by rate limits, timeouts, transient failures
-### Implementation
-1. **Coding benchmark:** Integrate with SWE-bench lite (500 tasks)
-   - Run with cheap model first, escalate on failure
-   - Measure: pass@1, LLM calls, cost, time
-2. **Tool-use benchmark:** Integrate with BFCL (2,000 function-calling tasks)
-   - Measure: tool accuracy, missed tools, cost
-3. **Research benchmark:** 100 real research questions
-   - Run with retrieval + cheap model vs retrieval + frontier
-   - Human evaluation: source quality, hallucination, coverage
-4. **Long-horizon benchmark:** 50 multi-step tasks (WebArena-style)
-   - Measure: task completion, cost growth over steps, cache hit rate
-**Estimated effort:** 1 week
-**Expected impact:** Calibrate all module thresholds, discover edge cases
----
-## Phase 3: Online Learning Loop
-**Goal:** Update routing probabilities, tool gate thresholds, and doom detector thresholds from live trace feedback.
-### Why Static Policies Fail
-- Model capabilities improve (GPT-4o → GPT-5)
-- New cheap models release (GPT-4o-mini → even cheaper)
-- Task mix changes over time
-- User behavior shifts
-### Implementation
-1. **Trace ingestion pipeline:** Collect traces from production runs
-2. **Outcome labeling:** Success/failure/escalation labels from user feedback
-3. **Online update:** Update router classifier weights weekly
-4. **Thompson sampling:** Explore new routing decisions with small probability
-5. **Drift detection:** Alert when success rate drops >5pp for a task type
-**Estimated effort:** 1 week
-**Expected impact:** Maintains 28%+ savings as models/task mix evolve
----
-## Phase 4: Verifier Cascading
-**Goal:** Use cheap verifier first, escalate to expensive verifier only on disagreement.
-### Current State
-- Verifier budgeter decides WHETHER to verify
-- When it decides YES, it always uses the same verifier model
-### Improvement
-- **Tier 1:** Simple regex/rule-based checks (free)
-- **Tier 2:** Cheap model verifier (GPT-4o-mini, $0.15/M tok)
-- **Tier 3:** Expensive verifier (GPT-4o, $2.5/M tok) — only when tier 2 flags issue
-- **Consensus mode:** Run cheap + medium verifier, escalate if disagree
-**Estimated impact:** 60–80% verifier cost reduction on low-risk tasks
----
-## Phase 5: Cross-Provider Cost Optimization
-**Goal:** Route to cheapest provider offering adequate model tier.
-### Providers with Similar-Tier Models
-| Tier | OpenAI | Anthropic | DeepSeek | Google | Together | Fireworks |
-|------|--------|-----------|----------|--------|----------|-----------|
-| 2 (Cheap) | GPT-4o-mini | Haiku | DeepSeek-V3 | Gemini-Flash | Llama-3.1-8B | Mixtral-8x7B |
-| 3 (Medium) | GPT-4o | Sonnet | DeepSeek-V3 | Gemini-Pro | Llama-3.1-70B | Qwen-2.5-72B |
-| 4 (Frontier) | o1 | Opus | — | Gemini-Ultra | Llama-3.1-405B | — |
-### Implementation
-1. Maintain provider pricing API (auto-fetch current prices)
-2. Add provider latency/availability monitoring
-3. Route to cheapest available tier-adequate provider
-4. Fallback chain: primary → secondary → tertiary
-5. Cache routing decisions per provider for stability
-**Estimated impact:** Additional 5–10% cost reduction on multi-provider setups
----
-## Phase 6: KV Cache Sharing
-**Goal:** Share prefix KV caches across concurrent agent runs using identical system prompts.
-### How It Works
-- Many agent runs share the same system prompt + tool descriptions
-- vLLM and SGLang support prefix caching / KV cache sharing
-- Running N agents concurrently → cache system prompt once, reuse N-1 times
-### Implementation
-1. Integrate with vLLM/SGLang backend for local models
-2. Group agent runs by identical prefix hash
-3. Pre-fill shared prefix once, append per-run suffix
-4. Track cache hit rate per prefix group
-5. Apply to multi-tenant agent deployments
-**Estimated impact:** 20–40% cost reduction on concurrent agent farms
----
-## Phase 7: Speculative Agent Actions
-**Goal:** Generate next N actions with cheap model, validate with frontier only on divergence.
-### How It Works
-1. Cheap model generates next action sequence (plan + tool calls)
-2. Frontier model validates only the *divergent* or *high-risk* actions
-3. If cheap model plan matches frontier with >0.9 similarity → use cheap
-4. If divergence > threshold → regenerate with frontier
-### Use Cases
-- Multi-step coding workflows (cheap generates plan, frontier validates critical steps)
-- Research workflows (cheap suggests search queries, frontier validates synthesis)
-- Tool-heavy workflows (cheap predicts tool sequence, frontier validates data transformations)
-**Estimated impact:** 15–25% cost reduction on multi-step tasks
----
-## Phase 8: Confidence Calibration with Process Reward Models
-**Goal:** Train a per-step success predictor for dynamic compute allocation.
-### Current State
-- Router uses task-level difficulty classification
-- Does not adapt compute within a task based on step-level confidence
-### Improvement
-1. Train a small PRM (Process Reward Model) on agent traces
-2. At each step, PRM predicts P(success | current state)
-3. If P(success) < 0.5 → escalate model, retrieve more context, or call verifier
-4. If P(success) > 0.95 → use cheaper model for next step
-5. Dynamically allocate compute based on real-time trajectory quality
-**Estimated impact:** 10–15% cost reduction with quality preservation
----
-## Phase 9: Human-in-the-Loop Integration
-**Goal:** Learn from human corrections to improve routing and reduce future mistakes.
-### Implementation
-1. When human corrects agent output → label the trace
-2. If human says "should have used stronger model" → update routing probabilities
-3. If human says "didn't need to call that tool" → update tool gate thresholds
-4. If human says "stopped too early" → update doom detector thresholds
-5. Feed corrections into online learning loop (Phase 3)
-**Estimated impact:** Reduces false-DONE rate and missed escalation rate by 30–50%
----
-## Phase 10: Meta-Learning Across Tasks
-**Goal:** Learn task-specific optimal policies from a small number of examples.
-### How It Works
-- New task type appears (e.g., "medical diagnosis assistant")
-- ACO has no prior traces for this task type
-- Meta-learner transfers policies from similar task types (e.g., legal → medical, both high-risk)
-- Few-shot calibrates thresholds from first 10–20 traces
-### Implementation
-1. Embed task types in semantic space
-2. Find k-nearest task types with sufficient trace history
-3. Transfer router weights, tool gate thresholds, verifier rules
-4. Bayesian update with new task traces
-5. Converge to task-specific policy within 50 traces
-**Estimated impact:** Reduces cold-start period from 100 traces to 20 traces
----
-## Summary: Priority Ranking
-| Phase | Impact | Effort | Priority |
-|-------|--------|--------|----------|
-| 1. Learned Router | ⭐⭐⭐⭐⭐ | Medium | **#1** |
-| 2. Real Benchmark | ⭐⭐⭐⭐⭐ | High | #2 |
-| 3. Online Learning | ⭐⭐⭐⭐⭐ | High | #3 |
-| 4. Verifier Cascading | ⭐⭐⭐⭐ | Low | #4 |
-| 5. Cross-Provider | ⭐⭐⭐⭐ | Medium | #5 |
-| 6. KV Cache Sharing | ⭐⭐⭐ | High | #6 |
-| 7. Speculative Actions | ⭐⭐⭐⭐ | High | #7 |
-| 8. PRM Calibration | ⭐⭐⭐⭐ | High | #8 |
-| 9. Human-in-the-Loop | ⭐⭐⭐ | Medium | #9 |
-| 10. Meta-Learning | ⭐⭐⭐ | High | #10 |
----
-## Success Metrics for Each Phase
-Track these metrics for every phase:
-1. **Cost per successful task** (primary)
-2. **Cost per artifact** (secondary)
-3. **Task success rate** (must not regress)
-4. **False-DONE rate** (must not increase)
-5. **Unsafe cheap-model miss rate** (must be <2%)
-6. **Missed escalation rate** (must be <5%)
-7. **Cache hit rate** (target >60%)
-8. **Tool call efficiency** (used/called ratio >80%)
-9. **Verifier pass rate** (target >85%)
-10. **Latency per task** (must not increase >20%)
----
-*Last updated: 2025-07-05*

+# ACO Roadmap
+## Completed (v1-v11)
+- [x] Normalized trace schema
+- [x] Synthetic trace generator (10K traces)
+- [x] Cost telemetry collector
+- [x] Task cost classifier
+- [x] Model cascade router (XGBoost per-tier)
+- [x] Context budgeter
+- [x] Cache-aware prompt layout
+- [x] Tool-use cost gate
+- [x] Verifier budgeter
+- [x] Retry/recovery optimizer
+- [x] Meta-tool miner
+- [x] Early termination detector
+- [x] Execution-feedback router (entropy cascade)
+- [x] Per-step routing
+- [x] Real benchmark evaluation (SWE-bench, BFCL)
+- [x] Ablation study on real data
+- [x] Literature review
+- [x] Deployment guide
+- [x] Technical blog post
+- [x] Final report
+- [x] Model cards
+## In Progress
+- [ ] Fine-tuned DistilBERT router (cloud job training on SPROUT)
+- [ ] Gradio dashboard with real benchmark numbers
+## Next Priority (CPU-friendly)
+- [ ] Conformal calibration of escalation thresholds
+- [ ] Cost-quality Pareto frontier visualization
+- [ ] JSON schema validation for traces
+- [ ] Unit tests for all 11 modules
+- [ ] Integration test suite
+- [ ] Example notebooks
+- [ ] Provider adapter examples (OpenAI, Anthropic, local)
+- [ ] Config file validator
+- [ ] CLI improvements (batch routing, cost estimation)
+## Next Priority (GPU needed)
+- [ ] Execution-feedback with real model logprobs
+- [ ] Best-of-N cheap sampling with reward model
+- [ ] Fine-tuned BERT per-step router
+- [ ] Process reward model for selective verification
+- [ ] Real agent benchmarks (SWE-bench Live, WebArena)
+## Long-term
+- [ ] Learned context selector (vs heuristic budgeter)
+- [ ] Workflow mining from real traces
+- [ ] Online learning from new traces
+- [ ] Multi-agent cost optimization
+- [ ] Provider-aware routing (cost/latency/availability)
+- [ ] Budget-constrained decoding
+- [ ] Cross-task transfer learning
+## Known Limitations
+- Router trained on SPROUT + SWE-Router only (need more domains)
+- Execution feedback uses simulated logprobs (need real model outputs)
+- No conformal guarantees on quality (hand-tuned thresholds)
+- Per-step routing not yet integrated with v11 XGBoost
+- Cache-aware layout not benchmarked with real providers
+- No real agent harness integration tested end-to-end
+## Headroom
+Oracle on SWE-bench shows 80.3% cost reduction is achievable. v11 achieves 36.9%. The remaining 43.4% comes from:
+- Better per-step routing (~10%)
+- Real execution feedback (~10%)
+- Best-of-N cheap sampling (~8%)
+- Conformal calibration (~5%)
+- More training data from more domains (~10%)