File size: 2,874 Bytes
0b9f3e3 cb22ae6 e4cea93 0b9f3e3 cb22ae6 0b9f3e3 b503472 0b9f3e3 18115de 0b9f3e3 18115de 0b9f3e3 81a993a 0b9f3e3 81a993a 0b9f3e3 81a993a 0b9f3e3 7ab8b6e 0b9f3e3 cb22ae6 0b9f3e3 cb22ae6 0b9f3e3 a8770ba b503472 cb22ae6 b503472 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | # ACO: Agent Cost Optimizer
A universal control layer that reduces autonomous agent cost while preserving task quality.
## Quick Results (SWE-bench, 500 coding tasks, 8 real models)
| Policy | Success | Cost/Task | CostRed |
|--------|---------|-----------|---------|
| Oracle | 87.0% | $0.062 | 80.3% |
| **v10+feedback** | **84.8%** | **$0.201** | **36.4%** |
| v10 direct | 76.6% | $0.188 | 40.7% |
| Always frontier | 78.2% | $0.317 | baseline |
| Always cheap | 63.2% | $0.014 | 95.5% |
**Key finding: v10+feedback strictly dominates always-frontier** β lower cost AND higher quality. This is not a cost-quality tradeoff.
## BERT Router Results
DistilBERT was fine-tuned on SPROUT for binary classification. The binary classifier fails for tier routing β it ignores tier prefixes and predicts P(success) β 89.5% for all tiers, routing everything to the cheapest model.
A 5-class retraining is in progress (job `69fd8cccaff1cd33e8f30714`).
## 11 Modules
1. Cost Telemetry Collector β `aco/telemetry.py`
2. Task Cost Classifier β `aco/classifier.py`
3. Model Cascade Router (XGBoost + isotonic) β `aco/router_v10.py`
4. Execution-Feedback Router (entropy cascade) β `aco/execution_feedback.py`
5. Context Budgeter β `aco/context_budgeter.py`
6. Cache-Aware Prompt Layout β `aco/cache_layout.py`
7. Tool-Use Cost Gate β `aco/tool_gate.py`
8. Verifier Budgeter β `aco/verifier_budgeter.py`
9. Retry/Recovery Optimizer β `aco/retry_optimizer.py`
10. Meta-Tool Miner β `aco/meta_tool_miner.py`
11. Doom Detector β `aco/doom_detector.py`
## New Modules (this session)
- **Conformal Calibration** β `aco/conformal.py` β RouteNLP-style distribution-free escalation guarantees
- **Pareto Frontier** β `aco/pareto.py` β RouterBench NDCH + RouteLLM CPT/APGR metrics
- **Integration Test** β `tests/test_integration.py` β Full pipeline test
## Key Takeaway
Training on real execution data matters more than architecture. v8 trained on synthetic data *increased* cost by 11.6%. v10 trained on 500 real SWE-Router outcomes *saved* 36.4%. Same XGBoost, same features.
## Documentation
- [Final Report](docs/final_report_v2.md)
- [Pareto Frontier Report](docs/pareto_frontier_report.md)
- [Conformal Calibration Report](docs/conformal_report.md)
- [BERT Eval Report](docs/bert_eval_report.md)
- [Literature Review](docs/literature_review.md)
- [Deployment Guide](docs/deployment_guide.md)
- [Technical Blog](docs/technical_blog.md)
- [Roadmap](docs/ROADMAP.md)
## Links
- **Model**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
- **Dataset**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
- **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)
|