| # ACO Roadmap |
|
|
| ## Completed (v1-v11) |
|
|
| - [x] Normalized trace schema |
| - [x] Synthetic trace generator (10K traces) |
| - [x] Cost telemetry collector |
| - [x] Task cost classifier |
| - [x] Model cascade router (XGBoost per-tier) |
| - [x] Context budgeter |
| - [x] Cache-aware prompt layout |
| - [x] Tool-use cost gate |
| - [x] Verifier budgeter |
| - [x] Retry/recovery optimizer |
| - [x] Meta-tool miner |
| - [x] Early termination detector |
| - [x] Execution-feedback router (entropy cascade) |
| - [x] Per-step routing |
| - [x] Real benchmark evaluation (SWE-bench, BFCL) |
| - [x] Ablation study on real data |
| - [x] Literature review |
| - [x] Deployment guide |
| - [x] Technical blog post |
| - [x] Final report |
| - [x] Model cards |
|
|
| ## In Progress |
|
|
| - [ ] Fine-tuned DistilBERT router (cloud job training on SPROUT) |
| - [ ] Gradio dashboard with real benchmark numbers |
|
|
| ## Next Priority (CPU-friendly) |
|
|
| - [ ] Conformal calibration of escalation thresholds |
| - [ ] Cost-quality Pareto frontier visualization |
| - [ ] JSON schema validation for traces |
| - [ ] Unit tests for all 11 modules |
| - [ ] Integration test suite |
| - [ ] Example notebooks |
| - [ ] Provider adapter examples (OpenAI, Anthropic, local) |
| - [ ] Config file validator |
| - [ ] CLI improvements (batch routing, cost estimation) |
|
|
| ## Next Priority (GPU needed) |
|
|
| - [ ] Execution-feedback with real model logprobs |
| - [ ] Best-of-N cheap sampling with reward model |
| - [ ] Fine-tuned BERT per-step router |
| - [ ] Process reward model for selective verification |
| - [ ] Real agent benchmarks (SWE-bench Live, WebArena) |
|
|
| ## Long-term |
|
|
| - [ ] Learned context selector (vs heuristic budgeter) |
| - [ ] Workflow mining from real traces |
| - [ ] Online learning from new traces |
| - [ ] Multi-agent cost optimization |
| - [ ] Provider-aware routing (cost/latency/availability) |
| - [ ] Budget-constrained decoding |
| - [ ] Cross-task transfer learning |
|
|
| ## Known Limitations |
|
|
| - Router trained on SPROUT + SWE-Router only (need more domains) |
| - Execution feedback uses simulated logprobs (need real model outputs) |
| - No conformal guarantees on quality (hand-tuned thresholds) |
| - Per-step routing not yet integrated with v11 XGBoost |
| - Cache-aware layout not benchmarked with real providers |
| - No real agent harness integration tested end-to-end |
|
|
| ## Headroom |
|
|
| Oracle on SWE-bench shows 80.3% cost reduction is achievable. v11 achieves 36.9%. The remaining 43.4% comes from: |
| - Better per-step routing (~10%) |
| - Real execution feedback (~10%) |
| - Best-of-N cheap sampling (~8%) |
| - Conformal calibration (~5%) |
| - More training data from more domains (~10%) |
|
|