narcolepticchicken commited on
Commit
db1085e
Β·
verified Β·
1 Parent(s): e4cea93

Upload docs/ROADMAP.md

Browse files
Files changed (1) hide show
  1. docs/ROADMAP.md +78 -256
docs/ROADMAP.md CHANGED
@@ -1,256 +1,78 @@
1
- # Agent Cost Optimizer β€” Roadmap
2
-
3
- ## Current Status (v1.0)
4
-
5
- - βœ… 10 core modules implemented and benchmarked
6
- - βœ… 28% cost reduction at iso-quality (94.3% success rate)
7
- - βœ… Synthetic benchmark (2K traces, 19 scenarios)
8
- - βœ… Learned router skeleton (trainable, not yet trained on real data)
9
- - βœ… Deployment guide, model card, technical report
10
- - βœ… Gradio dashboard code (not yet deployed)
11
-
12
- ---
13
-
14
- ## Phase 1: Learned Router (Immediate Priority)
15
-
16
- **Goal:** Replace heuristic router with classifier trained on real traces.
17
-
18
- ### Why This Is #1
19
- The ablation study shows the model router is the most critical module. A trained classifier could:
20
- - Increase savings from 28% to 35–40%
21
- - Reduce false escalations by 50%
22
- - Enable task-specific routing (code β†’ Claude, reasoning β†’ o3-mini)
23
-
24
- ### Implementation
25
- 1. Collect 10K+ real traces with full telemetry
26
- 2. Extract (request_features, optimal_tier) pairs
27
- 3. Train simple logistic regression / small neural classifier
28
- 4. Or: Train with GRPO using cost-adjusted reward (BAAR-style boundary-guided routing)
29
- 5. A/B test against heuristic router
30
- 6. Fall back to heuristic when confidence < 0.7
31
-
32
- **Estimated effort:** 2–3 days
33
- **Expected impact:** +7–12pp cost savings, <1pp quality regression
34
-
35
- ---
36
-
37
- ## Phase 2: Real Interactive Benchmark
38
-
39
- **Goal:** Evaluate ACO against real agent tasks with actual model calls.
40
-
41
- ### Why Synthetic Is Not Enough
42
- Our synthetic benchmark assumes fixed success probabilities per tier. Real-world model behavior:
43
- - Is non-stationary (models improve, new models release)
44
- - Depends on prompt engineering, not just model strength
45
- - Has provider-specific quirks (Claude vs GPT vs DeepSeek)
46
- - Is affected by rate limits, timeouts, transient failures
47
-
48
- ### Implementation
49
- 1. **Coding benchmark:** Integrate with SWE-bench lite (500 tasks)
50
- - Run with cheap model first, escalate on failure
51
- - Measure: pass@1, LLM calls, cost, time
52
- 2. **Tool-use benchmark:** Integrate with BFCL (2,000 function-calling tasks)
53
- - Measure: tool accuracy, missed tools, cost
54
- 3. **Research benchmark:** 100 real research questions
55
- - Run with retrieval + cheap model vs retrieval + frontier
56
- - Human evaluation: source quality, hallucination, coverage
57
- 4. **Long-horizon benchmark:** 50 multi-step tasks (WebArena-style)
58
- - Measure: task completion, cost growth over steps, cache hit rate
59
-
60
- **Estimated effort:** 1 week
61
- **Expected impact:** Calibrate all module thresholds, discover edge cases
62
-
63
- ---
64
-
65
- ## Phase 3: Online Learning Loop
66
-
67
- **Goal:** Update routing probabilities, tool gate thresholds, and doom detector thresholds from live trace feedback.
68
-
69
- ### Why Static Policies Fail
70
- - Model capabilities improve (GPT-4o β†’ GPT-5)
71
- - New cheap models release (GPT-4o-mini β†’ even cheaper)
72
- - Task mix changes over time
73
- - User behavior shifts
74
-
75
- ### Implementation
76
- 1. **Trace ingestion pipeline:** Collect traces from production runs
77
- 2. **Outcome labeling:** Success/failure/escalation labels from user feedback
78
- 3. **Online update:** Update router classifier weights weekly
79
- 4. **Thompson sampling:** Explore new routing decisions with small probability
80
- 5. **Drift detection:** Alert when success rate drops >5pp for a task type
81
-
82
- **Estimated effort:** 1 week
83
- **Expected impact:** Maintains 28%+ savings as models/task mix evolve
84
-
85
- ---
86
-
87
- ## Phase 4: Verifier Cascading
88
-
89
- **Goal:** Use cheap verifier first, escalate to expensive verifier only on disagreement.
90
-
91
- ### Current State
92
- - Verifier budgeter decides WHETHER to verify
93
- - When it decides YES, it always uses the same verifier model
94
-
95
- ### Improvement
96
- - **Tier 1:** Simple regex/rule-based checks (free)
97
- - **Tier 2:** Cheap model verifier (GPT-4o-mini, $0.15/M tok)
98
- - **Tier 3:** Expensive verifier (GPT-4o, $2.5/M tok) β€” only when tier 2 flags issue
99
- - **Consensus mode:** Run cheap + medium verifier, escalate if disagree
100
-
101
- **Estimated impact:** 60–80% verifier cost reduction on low-risk tasks
102
-
103
- ---
104
-
105
- ## Phase 5: Cross-Provider Cost Optimization
106
-
107
- **Goal:** Route to cheapest provider offering adequate model tier.
108
-
109
- ### Providers with Similar-Tier Models
110
-
111
- | Tier | OpenAI | Anthropic | DeepSeek | Google | Together | Fireworks |
112
- |------|--------|-----------|----------|--------|----------|-----------|
113
- | 2 (Cheap) | GPT-4o-mini | Haiku | DeepSeek-V3 | Gemini-Flash | Llama-3.1-8B | Mixtral-8x7B |
114
- | 3 (Medium) | GPT-4o | Sonnet | DeepSeek-V3 | Gemini-Pro | Llama-3.1-70B | Qwen-2.5-72B |
115
- | 4 (Frontier) | o1 | Opus | β€” | Gemini-Ultra | Llama-3.1-405B | β€” |
116
-
117
- ### Implementation
118
- 1. Maintain provider pricing API (auto-fetch current prices)
119
- 2. Add provider latency/availability monitoring
120
- 3. Route to cheapest available tier-adequate provider
121
- 4. Fallback chain: primary β†’ secondary β†’ tertiary
122
- 5. Cache routing decisions per provider for stability
123
-
124
- **Estimated impact:** Additional 5–10% cost reduction on multi-provider setups
125
-
126
- ---
127
-
128
- ## Phase 6: KV Cache Sharing
129
-
130
- **Goal:** Share prefix KV caches across concurrent agent runs using identical system prompts.
131
-
132
- ### How It Works
133
- - Many agent runs share the same system prompt + tool descriptions
134
- - vLLM and SGLang support prefix caching / KV cache sharing
135
- - Running N agents concurrently β†’ cache system prompt once, reuse N-1 times
136
-
137
- ### Implementation
138
- 1. Integrate with vLLM/SGLang backend for local models
139
- 2. Group agent runs by identical prefix hash
140
- 3. Pre-fill shared prefix once, append per-run suffix
141
- 4. Track cache hit rate per prefix group
142
- 5. Apply to multi-tenant agent deployments
143
-
144
- **Estimated impact:** 20–40% cost reduction on concurrent agent farms
145
-
146
- ---
147
-
148
- ## Phase 7: Speculative Agent Actions
149
-
150
- **Goal:** Generate next N actions with cheap model, validate with frontier only on divergence.
151
-
152
- ### How It Works
153
- 1. Cheap model generates next action sequence (plan + tool calls)
154
- 2. Frontier model validates only the *divergent* or *high-risk* actions
155
- 3. If cheap model plan matches frontier with >0.9 similarity β†’ use cheap
156
- 4. If divergence > threshold β†’ regenerate with frontier
157
-
158
- ### Use Cases
159
- - Multi-step coding workflows (cheap generates plan, frontier validates critical steps)
160
- - Research workflows (cheap suggests search queries, frontier validates synthesis)
161
- - Tool-heavy workflows (cheap predicts tool sequence, frontier validates data transformations)
162
-
163
- **Estimated impact:** 15–25% cost reduction on multi-step tasks
164
-
165
- ---
166
-
167
- ## Phase 8: Confidence Calibration with Process Reward Models
168
-
169
- **Goal:** Train a per-step success predictor for dynamic compute allocation.
170
-
171
- ### Current State
172
- - Router uses task-level difficulty classification
173
- - Does not adapt compute within a task based on step-level confidence
174
-
175
- ### Improvement
176
- 1. Train a small PRM (Process Reward Model) on agent traces
177
- 2. At each step, PRM predicts P(success | current state)
178
- 3. If P(success) < 0.5 β†’ escalate model, retrieve more context, or call verifier
179
- 4. If P(success) > 0.95 β†’ use cheaper model for next step
180
- 5. Dynamically allocate compute based on real-time trajectory quality
181
-
182
- **Estimated impact:** 10–15% cost reduction with quality preservation
183
-
184
- ---
185
-
186
- ## Phase 9: Human-in-the-Loop Integration
187
-
188
- **Goal:** Learn from human corrections to improve routing and reduce future mistakes.
189
-
190
- ### Implementation
191
- 1. When human corrects agent output β†’ label the trace
192
- 2. If human says "should have used stronger model" β†’ update routing probabilities
193
- 3. If human says "didn't need to call that tool" β†’ update tool gate thresholds
194
- 4. If human says "stopped too early" β†’ update doom detector thresholds
195
- 5. Feed corrections into online learning loop (Phase 3)
196
-
197
- **Estimated impact:** Reduces false-DONE rate and missed escalation rate by 30–50%
198
-
199
- ---
200
-
201
- ## Phase 10: Meta-Learning Across Tasks
202
-
203
- **Goal:** Learn task-specific optimal policies from a small number of examples.
204
-
205
- ### How It Works
206
- - New task type appears (e.g., "medical diagnosis assistant")
207
- - ACO has no prior traces for this task type
208
- - Meta-learner transfers policies from similar task types (e.g., legal β†’ medical, both high-risk)
209
- - Few-shot calibrates thresholds from first 10–20 traces
210
-
211
- ### Implementation
212
- 1. Embed task types in semantic space
213
- 2. Find k-nearest task types with sufficient trace history
214
- 3. Transfer router weights, tool gate thresholds, verifier rules
215
- 4. Bayesian update with new task traces
216
- 5. Converge to task-specific policy within 50 traces
217
-
218
- **Estimated impact:** Reduces cold-start period from 100 traces to 20 traces
219
-
220
- ---
221
-
222
- ## Summary: Priority Ranking
223
-
224
- | Phase | Impact | Effort | Priority |
225
- |-------|--------|--------|----------|
226
- | 1. Learned Router | ⭐⭐⭐⭐⭐ | Medium | **#1** |
227
- | 2. Real Benchmark | ⭐⭐⭐⭐⭐ | High | #2 |
228
- | 3. Online Learning | ⭐⭐⭐⭐⭐ | High | #3 |
229
- | 4. Verifier Cascading | ⭐⭐⭐⭐ | Low | #4 |
230
- | 5. Cross-Provider | ⭐⭐⭐⭐ | Medium | #5 |
231
- | 6. KV Cache Sharing | ⭐⭐⭐ | High | #6 |
232
- | 7. Speculative Actions | ⭐⭐⭐⭐ | High | #7 |
233
- | 8. PRM Calibration | ⭐⭐⭐⭐ | High | #8 |
234
- | 9. Human-in-the-Loop | ⭐⭐⭐ | Medium | #9 |
235
- | 10. Meta-Learning | ⭐⭐⭐ | High | #10 |
236
-
237
- ---
238
-
239
- ## Success Metrics for Each Phase
240
-
241
- Track these metrics for every phase:
242
-
243
- 1. **Cost per successful task** (primary)
244
- 2. **Cost per artifact** (secondary)
245
- 3. **Task success rate** (must not regress)
246
- 4. **False-DONE rate** (must not increase)
247
- 5. **Unsafe cheap-model miss rate** (must be <2%)
248
- 6. **Missed escalation rate** (must be <5%)
249
- 7. **Cache hit rate** (target >60%)
250
- 8. **Tool call efficiency** (used/called ratio >80%)
251
- 9. **Verifier pass rate** (target >85%)
252
- 10. **Latency per task** (must not increase >20%)
253
-
254
- ---
255
-
256
- *Last updated: 2025-07-05*
 
1
+ # ACO Roadmap
2
+
3
+ ## Completed (v1-v11)
4
+
5
+ - [x] Normalized trace schema
6
+ - [x] Synthetic trace generator (10K traces)
7
+ - [x] Cost telemetry collector
8
+ - [x] Task cost classifier
9
+ - [x] Model cascade router (XGBoost per-tier)
10
+ - [x] Context budgeter
11
+ - [x] Cache-aware prompt layout
12
+ - [x] Tool-use cost gate
13
+ - [x] Verifier budgeter
14
+ - [x] Retry/recovery optimizer
15
+ - [x] Meta-tool miner
16
+ - [x] Early termination detector
17
+ - [x] Execution-feedback router (entropy cascade)
18
+ - [x] Per-step routing
19
+ - [x] Real benchmark evaluation (SWE-bench, BFCL)
20
+ - [x] Ablation study on real data
21
+ - [x] Literature review
22
+ - [x] Deployment guide
23
+ - [x] Technical blog post
24
+ - [x] Final report
25
+ - [x] Model cards
26
+
27
+ ## In Progress
28
+
29
+ - [ ] Fine-tuned DistilBERT router (cloud job training on SPROUT)
30
+ - [ ] Gradio dashboard with real benchmark numbers
31
+
32
+ ## Next Priority (CPU-friendly)
33
+
34
+ - [ ] Conformal calibration of escalation thresholds
35
+ - [ ] Cost-quality Pareto frontier visualization
36
+ - [ ] JSON schema validation for traces
37
+ - [ ] Unit tests for all 11 modules
38
+ - [ ] Integration test suite
39
+ - [ ] Example notebooks
40
+ - [ ] Provider adapter examples (OpenAI, Anthropic, local)
41
+ - [ ] Config file validator
42
+ - [ ] CLI improvements (batch routing, cost estimation)
43
+
44
+ ## Next Priority (GPU needed)
45
+
46
+ - [ ] Execution-feedback with real model logprobs
47
+ - [ ] Best-of-N cheap sampling with reward model
48
+ - [ ] Fine-tuned BERT per-step router
49
+ - [ ] Process reward model for selective verification
50
+ - [ ] Real agent benchmarks (SWE-bench Live, WebArena)
51
+
52
+ ## Long-term
53
+
54
+ - [ ] Learned context selector (vs heuristic budgeter)
55
+ - [ ] Workflow mining from real traces
56
+ - [ ] Online learning from new traces
57
+ - [ ] Multi-agent cost optimization
58
+ - [ ] Provider-aware routing (cost/latency/availability)
59
+ - [ ] Budget-constrained decoding
60
+ - [ ] Cross-task transfer learning
61
+
62
+ ## Known Limitations
63
+
64
+ - Router trained on SPROUT + SWE-Router only (need more domains)
65
+ - Execution feedback uses simulated logprobs (need real model outputs)
66
+ - No conformal guarantees on quality (hand-tuned thresholds)
67
+ - Per-step routing not yet integrated with v11 XGBoost
68
+ - Cache-aware layout not benchmarked with real providers
69
+ - No real agent harness integration tested end-to-end
70
+
71
+ ## Headroom
72
+
73
+ Oracle on SWE-bench shows 80.3% cost reduction is achievable. v11 achieves 36.9%. The remaining 43.4% comes from:
74
+ - Better per-step routing (~10%)
75
+ - Real execution feedback (~10%)
76
+ - Best-of-N cheap sampling (~8%)
77
+ - Conformal calibration (~5%)
78
+ - More training data from more domains (~10%)