ACO Roadmap
Completed (v1-v11)
- Normalized trace schema
- Synthetic trace generator (10K traces)
- Cost telemetry collector
- Task cost classifier
- Model cascade router (XGBoost per-tier)
- Context budgeter
- Cache-aware prompt layout
- Tool-use cost gate
- Verifier budgeter
- Retry/recovery optimizer
- Meta-tool miner
- Early termination detector
- Execution-feedback router (entropy cascade)
- Per-step routing
- Real benchmark evaluation (SWE-bench, BFCL)
- Ablation study on real data
- Literature review
- Deployment guide
- Technical blog post
- Final report
- Model cards
In Progress
- Fine-tuned DistilBERT router (cloud job training on SPROUT)
- Gradio dashboard with real benchmark numbers
Next Priority (CPU-friendly)
- Conformal calibration of escalation thresholds
- Cost-quality Pareto frontier visualization
- JSON schema validation for traces
- Unit tests for all 11 modules
- Integration test suite
- Example notebooks
- Provider adapter examples (OpenAI, Anthropic, local)
- Config file validator
- CLI improvements (batch routing, cost estimation)
Next Priority (GPU needed)
- Execution-feedback with real model logprobs
- Best-of-N cheap sampling with reward model
- Fine-tuned BERT per-step router
- Process reward model for selective verification
- Real agent benchmarks (SWE-bench Live, WebArena)
Long-term
- Learned context selector (vs heuristic budgeter)
- Workflow mining from real traces
- Online learning from new traces
- Multi-agent cost optimization
- Provider-aware routing (cost/latency/availability)
- Budget-constrained decoding
- Cross-task transfer learning
Known Limitations
- Router trained on SPROUT + SWE-Router only (need more domains)
- Execution feedback uses simulated logprobs (need real model outputs)
- No conformal guarantees on quality (hand-tuned thresholds)
- Per-step routing not yet integrated with v11 XGBoost
- Cache-aware layout not benchmarked with real providers
- No real agent harness integration tested end-to-end
Headroom
Oracle on SWE-bench shows 80.3% cost reduction is achievable. v11 achieves 36.9%. The remaining 43.4% comes from:
- Better per-step routing (~10%)
- Real execution feedback (~10%)
- Best-of-N cheap sampling (~8%)
- Conformal calibration (~5%)
- More training data from more domains (~10%)