agent-cost-optimizer / docs /ROADMAP.md
narcolepticchicken's picture
Upload docs/ROADMAP.md
db1085e verified

ACO Roadmap

Completed (v1-v11)

  • Normalized trace schema
  • Synthetic trace generator (10K traces)
  • Cost telemetry collector
  • Task cost classifier
  • Model cascade router (XGBoost per-tier)
  • Context budgeter
  • Cache-aware prompt layout
  • Tool-use cost gate
  • Verifier budgeter
  • Retry/recovery optimizer
  • Meta-tool miner
  • Early termination detector
  • Execution-feedback router (entropy cascade)
  • Per-step routing
  • Real benchmark evaluation (SWE-bench, BFCL)
  • Ablation study on real data
  • Literature review
  • Deployment guide
  • Technical blog post
  • Final report
  • Model cards

In Progress

  • Fine-tuned DistilBERT router (cloud job training on SPROUT)
  • Gradio dashboard with real benchmark numbers

Next Priority (CPU-friendly)

  • Conformal calibration of escalation thresholds
  • Cost-quality Pareto frontier visualization
  • JSON schema validation for traces
  • Unit tests for all 11 modules
  • Integration test suite
  • Example notebooks
  • Provider adapter examples (OpenAI, Anthropic, local)
  • Config file validator
  • CLI improvements (batch routing, cost estimation)

Next Priority (GPU needed)

  • Execution-feedback with real model logprobs
  • Best-of-N cheap sampling with reward model
  • Fine-tuned BERT per-step router
  • Process reward model for selective verification
  • Real agent benchmarks (SWE-bench Live, WebArena)

Long-term

  • Learned context selector (vs heuristic budgeter)
  • Workflow mining from real traces
  • Online learning from new traces
  • Multi-agent cost optimization
  • Provider-aware routing (cost/latency/availability)
  • Budget-constrained decoding
  • Cross-task transfer learning

Known Limitations

  • Router trained on SPROUT + SWE-Router only (need more domains)
  • Execution feedback uses simulated logprobs (need real model outputs)
  • No conformal guarantees on quality (hand-tuned thresholds)
  • Per-step routing not yet integrated with v11 XGBoost
  • Cache-aware layout not benchmarked with real providers
  • No real agent harness integration tested end-to-end

Headroom

Oracle on SWE-bench shows 80.3% cost reduction is achievable. v11 achieves 36.9%. The remaining 43.4% comes from:

  • Better per-step routing (~10%)
  • Real execution feedback (~10%)
  • Best-of-N cheap sampling (~8%)
  • Conformal calibration (~5%)
  • More training data from more domains (~10%)