# ACO Roadmap ## Completed (v1-v11) - [x] Normalized trace schema - [x] Synthetic trace generator (10K traces) - [x] Cost telemetry collector - [x] Task cost classifier - [x] Model cascade router (XGBoost per-tier) - [x] Context budgeter - [x] Cache-aware prompt layout - [x] Tool-use cost gate - [x] Verifier budgeter - [x] Retry/recovery optimizer - [x] Meta-tool miner - [x] Early termination detector - [x] Execution-feedback router (entropy cascade) - [x] Per-step routing - [x] Real benchmark evaluation (SWE-bench, BFCL) - [x] Ablation study on real data - [x] Literature review - [x] Deployment guide - [x] Technical blog post - [x] Final report - [x] Model cards ## In Progress - [ ] Fine-tuned DistilBERT router (cloud job training on SPROUT) - [ ] Gradio dashboard with real benchmark numbers ## Next Priority (CPU-friendly) - [ ] Conformal calibration of escalation thresholds - [ ] Cost-quality Pareto frontier visualization - [ ] JSON schema validation for traces - [ ] Unit tests for all 11 modules - [ ] Integration test suite - [ ] Example notebooks - [ ] Provider adapter examples (OpenAI, Anthropic, local) - [ ] Config file validator - [ ] CLI improvements (batch routing, cost estimation) ## Next Priority (GPU needed) - [ ] Execution-feedback with real model logprobs - [ ] Best-of-N cheap sampling with reward model - [ ] Fine-tuned BERT per-step router - [ ] Process reward model for selective verification - [ ] Real agent benchmarks (SWE-bench Live, WebArena) ## Long-term - [ ] Learned context selector (vs heuristic budgeter) - [ ] Workflow mining from real traces - [ ] Online learning from new traces - [ ] Multi-agent cost optimization - [ ] Provider-aware routing (cost/latency/availability) - [ ] Budget-constrained decoding - [ ] Cross-task transfer learning ## Known Limitations - Router trained on SPROUT + SWE-Router only (need more domains) - Execution feedback uses simulated logprobs (need real model outputs) - No conformal guarantees on quality (hand-tuned thresholds) - Per-step routing not yet integrated with v11 XGBoost - Cache-aware layout not benchmarked with real providers - No real agent harness integration tested end-to-end ## Headroom Oracle on SWE-bench shows 80.3% cost reduction is achievable. v11 achieves 36.9%. The remaining 43.4% comes from: - Better per-step routing (~10%) - Real execution feedback (~10%) - Best-of-N cheap sampling (~8%) - Conformal calibration (~5%) - More training data from more domains (~10%)