| # ACO: Agent Cost Optimizer |
|
|
| A universal control layer that reduces autonomous agent cost while preserving task quality. |
|
|
| ## Quick Results (SWE-bench, 500 coding tasks, 8 real models) |
|
|
| | Policy | Success | Cost/Task | CostRed | |
| |--------|---------|-----------|---------| |
| | Oracle | 87.0% | $0.062 | 80.3% | |
| | **v10+feedback** | **84.8%** | **$0.201** | **36.4%** | |
| | v10 direct | 76.6% | $0.188 | 40.7% | |
| | Always frontier | 78.2% | $0.317 | baseline | |
| | Always cheap | 63.2% | $0.014 | 95.5% | |
|
|
| **Key finding: v10+feedback strictly dominates always-frontier** β lower cost AND higher quality. This is not a cost-quality tradeoff. |
|
|
| ## BERT Router Results |
|
|
| DistilBERT was fine-tuned on SPROUT for binary classification. The binary classifier fails for tier routing β it ignores tier prefixes and predicts P(success) β 89.5% for all tiers, routing everything to the cheapest model. |
|
|
| A 5-class retraining is in progress (job `69fd8cccaff1cd33e8f30714`). |
|
|
| ## 11 Modules |
|
|
| 1. Cost Telemetry Collector β `aco/telemetry.py` |
| 2. Task Cost Classifier β `aco/classifier.py` |
| 3. Model Cascade Router (XGBoost + isotonic) β `aco/router_v10.py` |
| 4. Execution-Feedback Router (entropy cascade) β `aco/execution_feedback.py` |
| 5. Context Budgeter β `aco/context_budgeter.py` |
| 6. Cache-Aware Prompt Layout β `aco/cache_layout.py` |
| 7. Tool-Use Cost Gate β `aco/tool_gate.py` |
| 8. Verifier Budgeter β `aco/verifier_budgeter.py` |
| 9. Retry/Recovery Optimizer β `aco/retry_optimizer.py` |
| 10. Meta-Tool Miner β `aco/meta_tool_miner.py` |
| 11. Doom Detector β `aco/doom_detector.py` |
|
|
| ## New Modules (this session) |
|
|
| - **Conformal Calibration** β `aco/conformal.py` β RouteNLP-style distribution-free escalation guarantees |
| - **Pareto Frontier** β `aco/pareto.py` β RouterBench NDCH + RouteLLM CPT/APGR metrics |
| - **Integration Test** β `tests/test_integration.py` β Full pipeline test |
|
|
| ## Key Takeaway |
|
|
| Training on real execution data matters more than architecture. v8 trained on synthetic data *increased* cost by 11.6%. v10 trained on 500 real SWE-Router outcomes *saved* 36.4%. Same XGBoost, same features. |
|
|
| ## Documentation |
|
|
| - [Final Report](docs/final_report_v2.md) |
| - [Pareto Frontier Report](docs/pareto_frontier_report.md) |
| - [Conformal Calibration Report](docs/conformal_report.md) |
| - [BERT Eval Report](docs/bert_eval_report.md) |
| - [Literature Review](docs/literature_review.md) |
| - [Deployment Guide](docs/deployment_guide.md) |
| - [Technical Blog](docs/technical_blog.md) |
| - [Roadmap](docs/ROADMAP.md) |
|
|
| ## Links |
|
|
| - **Model**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer) |
| - **Dataset**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces) |
| - **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard) |
|
|