narcolepticchicken
/

agent-cost-optimizer

Model card Files Files and versions

agent-cost-optimizer / docs /ROADMAP.md

narcolepticchicken's picture

narcolepticchicken

Upload docs/ROADMAP.md

db1085e verified about 14 hours ago

|

history blame contribute delete

2.49 kB

	# ACO Roadmap

	## Completed (v1-v11)

	- [x] Normalized trace schema
	- [x] Synthetic trace generator (10K traces)
	- [x] Cost telemetry collector
	- [x] Task cost classifier
	- [x] Model cascade router (XGBoost per-tier)
	- [x] Context budgeter
	- [x] Cache-aware prompt layout
	- [x] Tool-use cost gate
	- [x] Verifier budgeter
	- [x] Retry/recovery optimizer
	- [x] Meta-tool miner
	- [x] Early termination detector
	- [x] Execution-feedback router (entropy cascade)
	- [x] Per-step routing
	- [x] Real benchmark evaluation (SWE-bench, BFCL)
	- [x] Ablation study on real data
	- [x] Literature review
	- [x] Deployment guide
	- [x] Technical blog post
	- [x] Final report
	- [x] Model cards

	## In Progress

	- [ ] Fine-tuned DistilBERT router (cloud job training on SPROUT)
	- [ ] Gradio dashboard with real benchmark numbers

	## Next Priority (CPU-friendly)

	- [ ] Conformal calibration of escalation thresholds
	- [ ] Cost-quality Pareto frontier visualization
	- [ ] JSON schema validation for traces
	- [ ] Unit tests for all 11 modules
	- [ ] Integration test suite
	- [ ] Example notebooks
	- [ ] Provider adapter examples (OpenAI, Anthropic, local)
	- [ ] Config file validator
	- [ ] CLI improvements (batch routing, cost estimation)

	## Next Priority (GPU needed)

	- [ ] Execution-feedback with real model logprobs
	- [ ] Best-of-N cheap sampling with reward model
	- [ ] Fine-tuned BERT per-step router
	- [ ] Process reward model for selective verification
	- [ ] Real agent benchmarks (SWE-bench Live, WebArena)

	## Long-term

	- [ ] Learned context selector (vs heuristic budgeter)
	- [ ] Workflow mining from real traces
	- [ ] Online learning from new traces
	- [ ] Multi-agent cost optimization
	- [ ] Provider-aware routing (cost/latency/availability)
	- [ ] Budget-constrained decoding
	- [ ] Cross-task transfer learning

	## Known Limitations

	- Router trained on SPROUT + SWE-Router only (need more domains)
	- Execution feedback uses simulated logprobs (need real model outputs)
	- No conformal guarantees on quality (hand-tuned thresholds)
	- Per-step routing not yet integrated with v11 XGBoost
	- Cache-aware layout not benchmarked with real providers
	- No real agent harness integration tested end-to-end

	## Headroom

	Oracle on SWE-bench shows 80.3% cost reduction is achievable. v11 achieves 36.9%. The remaining 43.4% comes from:
	- Better per-step routing (~10%)
	- Real execution feedback (~10%)
	- Best-of-N cheap sampling (~8%)
	- Conformal calibration (~5%)
	- More training data from more domains (~10%)