narcolepticchicken commited on
Commit
318d1cd
Β·
verified Β·
1 Parent(s): ff228b5

Upload docs/ROADMAP.md

Browse files
Files changed (1) hide show
  1. docs/ROADMAP.md +256 -0
docs/ROADMAP.md ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Agent Cost Optimizer β€” Roadmap
2
+
3
+ ## Current Status (v1.0)
4
+
5
+ - βœ… 10 core modules implemented and benchmarked
6
+ - βœ… 28% cost reduction at iso-quality (94.3% success rate)
7
+ - βœ… Synthetic benchmark (2K traces, 19 scenarios)
8
+ - βœ… Learned router skeleton (trainable, not yet trained on real data)
9
+ - βœ… Deployment guide, model card, technical report
10
+ - βœ… Gradio dashboard code (not yet deployed)
11
+
12
+ ---
13
+
14
+ ## Phase 1: Learned Router (Immediate Priority)
15
+
16
+ **Goal:** Replace heuristic router with classifier trained on real traces.
17
+
18
+ ### Why This Is #1
19
+ The ablation study shows the model router is the most critical module. A trained classifier could:
20
+ - Increase savings from 28% to 35–40%
21
+ - Reduce false escalations by 50%
22
+ - Enable task-specific routing (code β†’ Claude, reasoning β†’ o3-mini)
23
+
24
+ ### Implementation
25
+ 1. Collect 10K+ real traces with full telemetry
26
+ 2. Extract (request_features, optimal_tier) pairs
27
+ 3. Train simple logistic regression / small neural classifier
28
+ 4. Or: Train with GRPO using cost-adjusted reward (BAAR-style boundary-guided routing)
29
+ 5. A/B test against heuristic router
30
+ 6. Fall back to heuristic when confidence < 0.7
31
+
32
+ **Estimated effort:** 2–3 days
33
+ **Expected impact:** +7–12pp cost savings, <1pp quality regression
34
+
35
+ ---
36
+
37
+ ## Phase 2: Real Interactive Benchmark
38
+
39
+ **Goal:** Evaluate ACO against real agent tasks with actual model calls.
40
+
41
+ ### Why Synthetic Is Not Enough
42
+ Our synthetic benchmark assumes fixed success probabilities per tier. Real-world model behavior:
43
+ - Is non-stationary (models improve, new models release)
44
+ - Depends on prompt engineering, not just model strength
45
+ - Has provider-specific quirks (Claude vs GPT vs DeepSeek)
46
+ - Is affected by rate limits, timeouts, transient failures
47
+
48
+ ### Implementation
49
+ 1. **Coding benchmark:** Integrate with SWE-bench lite (500 tasks)
50
+ - Run with cheap model first, escalate on failure
51
+ - Measure: pass@1, LLM calls, cost, time
52
+ 2. **Tool-use benchmark:** Integrate with BFCL (2,000 function-calling tasks)
53
+ - Measure: tool accuracy, missed tools, cost
54
+ 3. **Research benchmark:** 100 real research questions
55
+ - Run with retrieval + cheap model vs retrieval + frontier
56
+ - Human evaluation: source quality, hallucination, coverage
57
+ 4. **Long-horizon benchmark:** 50 multi-step tasks (WebArena-style)
58
+ - Measure: task completion, cost growth over steps, cache hit rate
59
+
60
+ **Estimated effort:** 1 week
61
+ **Expected impact:** Calibrate all module thresholds, discover edge cases
62
+
63
+ ---
64
+
65
+ ## Phase 3: Online Learning Loop
66
+
67
+ **Goal:** Update routing probabilities, tool gate thresholds, and doom detector thresholds from live trace feedback.
68
+
69
+ ### Why Static Policies Fail
70
+ - Model capabilities improve (GPT-4o β†’ GPT-5)
71
+ - New cheap models release (GPT-4o-mini β†’ even cheaper)
72
+ - Task mix changes over time
73
+ - User behavior shifts
74
+
75
+ ### Implementation
76
+ 1. **Trace ingestion pipeline:** Collect traces from production runs
77
+ 2. **Outcome labeling:** Success/failure/escalation labels from user feedback
78
+ 3. **Online update:** Update router classifier weights weekly
79
+ 4. **Thompson sampling:** Explore new routing decisions with small probability
80
+ 5. **Drift detection:** Alert when success rate drops >5pp for a task type
81
+
82
+ **Estimated effort:** 1 week
83
+ **Expected impact:** Maintains 28%+ savings as models/task mix evolve
84
+
85
+ ---
86
+
87
+ ## Phase 4: Verifier Cascading
88
+
89
+ **Goal:** Use cheap verifier first, escalate to expensive verifier only on disagreement.
90
+
91
+ ### Current State
92
+ - Verifier budgeter decides WHETHER to verify
93
+ - When it decides YES, it always uses the same verifier model
94
+
95
+ ### Improvement
96
+ - **Tier 1:** Simple regex/rule-based checks (free)
97
+ - **Tier 2:** Cheap model verifier (GPT-4o-mini, $0.15/M tok)
98
+ - **Tier 3:** Expensive verifier (GPT-4o, $2.5/M tok) β€” only when tier 2 flags issue
99
+ - **Consensus mode:** Run cheap + medium verifier, escalate if disagree
100
+
101
+ **Estimated impact:** 60–80% verifier cost reduction on low-risk tasks
102
+
103
+ ---
104
+
105
+ ## Phase 5: Cross-Provider Cost Optimization
106
+
107
+ **Goal:** Route to cheapest provider offering adequate model tier.
108
+
109
+ ### Providers with Similar-Tier Models
110
+
111
+ | Tier | OpenAI | Anthropic | DeepSeek | Google | Together | Fireworks |
112
+ |------|--------|-----------|----------|--------|----------|-----------|
113
+ | 2 (Cheap) | GPT-4o-mini | Haiku | DeepSeek-V3 | Gemini-Flash | Llama-3.1-8B | Mixtral-8x7B |
114
+ | 3 (Medium) | GPT-4o | Sonnet | DeepSeek-V3 | Gemini-Pro | Llama-3.1-70B | Qwen-2.5-72B |
115
+ | 4 (Frontier) | o1 | Opus | β€” | Gemini-Ultra | Llama-3.1-405B | β€” |
116
+
117
+ ### Implementation
118
+ 1. Maintain provider pricing API (auto-fetch current prices)
119
+ 2. Add provider latency/availability monitoring
120
+ 3. Route to cheapest available tier-adequate provider
121
+ 4. Fallback chain: primary β†’ secondary β†’ tertiary
122
+ 5. Cache routing decisions per provider for stability
123
+
124
+ **Estimated impact:** Additional 5–10% cost reduction on multi-provider setups
125
+
126
+ ---
127
+
128
+ ## Phase 6: KV Cache Sharing
129
+
130
+ **Goal:** Share prefix KV caches across concurrent agent runs using identical system prompts.
131
+
132
+ ### How It Works
133
+ - Many agent runs share the same system prompt + tool descriptions
134
+ - vLLM and SGLang support prefix caching / KV cache sharing
135
+ - Running N agents concurrently β†’ cache system prompt once, reuse N-1 times
136
+
137
+ ### Implementation
138
+ 1. Integrate with vLLM/SGLang backend for local models
139
+ 2. Group agent runs by identical prefix hash
140
+ 3. Pre-fill shared prefix once, append per-run suffix
141
+ 4. Track cache hit rate per prefix group
142
+ 5. Apply to multi-tenant agent deployments
143
+
144
+ **Estimated impact:** 20–40% cost reduction on concurrent agent farms
145
+
146
+ ---
147
+
148
+ ## Phase 7: Speculative Agent Actions
149
+
150
+ **Goal:** Generate next N actions with cheap model, validate with frontier only on divergence.
151
+
152
+ ### How It Works
153
+ 1. Cheap model generates next action sequence (plan + tool calls)
154
+ 2. Frontier model validates only the *divergent* or *high-risk* actions
155
+ 3. If cheap model plan matches frontier with >0.9 similarity β†’ use cheap
156
+ 4. If divergence > threshold β†’ regenerate with frontier
157
+
158
+ ### Use Cases
159
+ - Multi-step coding workflows (cheap generates plan, frontier validates critical steps)
160
+ - Research workflows (cheap suggests search queries, frontier validates synthesis)
161
+ - Tool-heavy workflows (cheap predicts tool sequence, frontier validates data transformations)
162
+
163
+ **Estimated impact:** 15–25% cost reduction on multi-step tasks
164
+
165
+ ---
166
+
167
+ ## Phase 8: Confidence Calibration with Process Reward Models
168
+
169
+ **Goal:** Train a per-step success predictor for dynamic compute allocation.
170
+
171
+ ### Current State
172
+ - Router uses task-level difficulty classification
173
+ - Does not adapt compute within a task based on step-level confidence
174
+
175
+ ### Improvement
176
+ 1. Train a small PRM (Process Reward Model) on agent traces
177
+ 2. At each step, PRM predicts P(success | current state)
178
+ 3. If P(success) < 0.5 β†’ escalate model, retrieve more context, or call verifier
179
+ 4. If P(success) > 0.95 β†’ use cheaper model for next step
180
+ 5. Dynamically allocate compute based on real-time trajectory quality
181
+
182
+ **Estimated impact:** 10–15% cost reduction with quality preservation
183
+
184
+ ---
185
+
186
+ ## Phase 9: Human-in-the-Loop Integration
187
+
188
+ **Goal:** Learn from human corrections to improve routing and reduce future mistakes.
189
+
190
+ ### Implementation
191
+ 1. When human corrects agent output β†’ label the trace
192
+ 2. If human says "should have used stronger model" β†’ update routing probabilities
193
+ 3. If human says "didn't need to call that tool" β†’ update tool gate thresholds
194
+ 4. If human says "stopped too early" β†’ update doom detector thresholds
195
+ 5. Feed corrections into online learning loop (Phase 3)
196
+
197
+ **Estimated impact:** Reduces false-DONE rate and missed escalation rate by 30–50%
198
+
199
+ ---
200
+
201
+ ## Phase 10: Meta-Learning Across Tasks
202
+
203
+ **Goal:** Learn task-specific optimal policies from a small number of examples.
204
+
205
+ ### How It Works
206
+ - New task type appears (e.g., "medical diagnosis assistant")
207
+ - ACO has no prior traces for this task type
208
+ - Meta-learner transfers policies from similar task types (e.g., legal β†’ medical, both high-risk)
209
+ - Few-shot calibrates thresholds from first 10–20 traces
210
+
211
+ ### Implementation
212
+ 1. Embed task types in semantic space
213
+ 2. Find k-nearest task types with sufficient trace history
214
+ 3. Transfer router weights, tool gate thresholds, verifier rules
215
+ 4. Bayesian update with new task traces
216
+ 5. Converge to task-specific policy within 50 traces
217
+
218
+ **Estimated impact:** Reduces cold-start period from 100 traces to 20 traces
219
+
220
+ ---
221
+
222
+ ## Summary: Priority Ranking
223
+
224
+ | Phase | Impact | Effort | Priority |
225
+ |-------|--------|--------|----------|
226
+ | 1. Learned Router | ⭐⭐⭐⭐⭐ | Medium | **#1** |
227
+ | 2. Real Benchmark | ⭐⭐⭐⭐⭐ | High | #2 |
228
+ | 3. Online Learning | ⭐⭐⭐⭐⭐ | High | #3 |
229
+ | 4. Verifier Cascading | ⭐⭐⭐⭐ | Low | #4 |
230
+ | 5. Cross-Provider | ⭐⭐⭐⭐ | Medium | #5 |
231
+ | 6. KV Cache Sharing | ⭐⭐⭐ | High | #6 |
232
+ | 7. Speculative Actions | ⭐⭐⭐⭐ | High | #7 |
233
+ | 8. PRM Calibration | ⭐⭐⭐⭐ | High | #8 |
234
+ | 9. Human-in-the-Loop | ⭐⭐⭐ | Medium | #9 |
235
+ | 10. Meta-Learning | ⭐⭐⭐ | High | #10 |
236
+
237
+ ---
238
+
239
+ ## Success Metrics for Each Phase
240
+
241
+ Track these metrics for every phase:
242
+
243
+ 1. **Cost per successful task** (primary)
244
+ 2. **Cost per artifact** (secondary)
245
+ 3. **Task success rate** (must not regress)
246
+ 4. **False-DONE rate** (must not increase)
247
+ 5. **Unsafe cheap-model miss rate** (must be <2%)
248
+ 6. **Missed escalation rate** (must be <5%)
249
+ 7. **Cache hit rate** (target >60%)
250
+ 8. **Tool call efficiency** (used/called ratio >80%)
251
+ 9. **Verifier pass rate** (target >85%)
252
+ 10. **Latency per task** (must not increase >20%)
253
+
254
+ ---
255
+
256
+ *Last updated: 2025-07-05*