File size: 2,874 Bytes

0b9f3e3
 
 
 
 
cb22ae6
e4cea93
 
0b9f3e3
 
 
 
 
cb22ae6
0b9f3e3
b503472
0b9f3e3
18115de
0b9f3e3
18115de
0b9f3e3
81a993a
0b9f3e3
81a993a
0b9f3e3
 
 
 
 
 
 
 
 
 
 
81a993a
0b9f3e3
7ab8b6e
0b9f3e3
 
 
cb22ae6
0b9f3e3
cb22ae6
0b9f3e3
 
 
 
 
 
 
 
 
 
 
 
a8770ba
b503472
cb22ae6
b503472

# ACO: Agent Cost Optimizer

A universal control layer that reduces autonomous agent cost while preserving task quality.

## Quick Results (SWE-bench, 500 coding tasks, 8 real models)

| Policy | Success | Cost/Task | CostRed |
|--------|---------|-----------|---------|
| Oracle | 87.0% | $0.062 | 80.3% |
| **v10+feedback** | **84.8%** | **$0.201** | **36.4%** |
| v10 direct | 76.6% | $0.188 | 40.7% |
| Always frontier | 78.2% | $0.317 | baseline |
| Always cheap | 63.2% | $0.014 | 95.5% |

**Key finding: v10+feedback strictly dominates always-frontier** — lower cost AND higher quality. This is not a cost-quality tradeoff.

## BERT Router Results

DistilBERT was fine-tuned on SPROUT for binary classification. The binary classifier fails for tier routing — it ignores tier prefixes and predicts P(success) ≈ 89.5% for all tiers, routing everything to the cheapest model.

A 5-class retraining is in progress (job `69fd8cccaff1cd33e8f30714`).

## 11 Modules

1. Cost Telemetry Collector — `aco/telemetry.py`
2. Task Cost Classifier — `aco/classifier.py`
3. Model Cascade Router (XGBoost + isotonic) — `aco/router_v10.py`
4. Execution-Feedback Router (entropy cascade) — `aco/execution_feedback.py`
5. Context Budgeter — `aco/context_budgeter.py`
6. Cache-Aware Prompt Layout — `aco/cache_layout.py`
7. Tool-Use Cost Gate — `aco/tool_gate.py`
8. Verifier Budgeter — `aco/verifier_budgeter.py`
9. Retry/Recovery Optimizer — `aco/retry_optimizer.py`
10. Meta-Tool Miner — `aco/meta_tool_miner.py`
11. Doom Detector — `aco/doom_detector.py`

## New Modules (this session)

- **Conformal Calibration** — `aco/conformal.py` — RouteNLP-style distribution-free escalation guarantees
- **Pareto Frontier** — `aco/pareto.py` — RouterBench NDCH + RouteLLM CPT/APGR metrics
- **Integration Test** — `tests/test_integration.py` — Full pipeline test

## Key Takeaway

Training on real execution data matters more than architecture. v8 trained on synthetic data *increased* cost by 11.6%. v10 trained on 500 real SWE-Router outcomes *saved* 36.4%. Same XGBoost, same features.

## Documentation

- [Final Report](docs/final_report_v2.md)
- [Pareto Frontier Report](docs/pareto_frontier_report.md)
- [Conformal Calibration Report](docs/conformal_report.md)
- [BERT Eval Report](docs/bert_eval_report.md)
- [Literature Review](docs/literature_review.md)
- [Deployment Guide](docs/deployment_guide.md)
- [Technical Blog](docs/technical_blog.md)
- [Roadmap](docs/ROADMAP.md)

## Links

- **Model**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
- **Dataset**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
- **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)