File size: 2,874 Bytes
0b9f3e3
 
 
 
 
cb22ae6
e4cea93
 
0b9f3e3
 
 
 
 
cb22ae6
0b9f3e3
b503472
0b9f3e3
18115de
0b9f3e3
18115de
0b9f3e3
81a993a
0b9f3e3
81a993a
0b9f3e3
 
 
 
 
 
 
 
 
 
 
81a993a
0b9f3e3
7ab8b6e
0b9f3e3
 
 
cb22ae6
0b9f3e3
cb22ae6
0b9f3e3
 
 
 
 
 
 
 
 
 
 
 
a8770ba
b503472
cb22ae6
b503472
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# ACO: Agent Cost Optimizer

A universal control layer that reduces autonomous agent cost while preserving task quality.

## Quick Results (SWE-bench, 500 coding tasks, 8 real models)

| Policy | Success | Cost/Task | CostRed |
|--------|---------|-----------|---------|
| Oracle | 87.0% | $0.062 | 80.3% |
| **v10+feedback** | **84.8%** | **$0.201** | **36.4%** |
| v10 direct | 76.6% | $0.188 | 40.7% |
| Always frontier | 78.2% | $0.317 | baseline |
| Always cheap | 63.2% | $0.014 | 95.5% |

**Key finding: v10+feedback strictly dominates always-frontier** β€” lower cost AND higher quality. This is not a cost-quality tradeoff.

## BERT Router Results

DistilBERT was fine-tuned on SPROUT for binary classification. The binary classifier fails for tier routing β€” it ignores tier prefixes and predicts P(success) β‰ˆ 89.5% for all tiers, routing everything to the cheapest model.

A 5-class retraining is in progress (job `69fd8cccaff1cd33e8f30714`).

## 11 Modules

1. Cost Telemetry Collector β€” `aco/telemetry.py`
2. Task Cost Classifier β€” `aco/classifier.py`
3. Model Cascade Router (XGBoost + isotonic) β€” `aco/router_v10.py`
4. Execution-Feedback Router (entropy cascade) β€” `aco/execution_feedback.py`
5. Context Budgeter β€” `aco/context_budgeter.py`
6. Cache-Aware Prompt Layout β€” `aco/cache_layout.py`
7. Tool-Use Cost Gate β€” `aco/tool_gate.py`
8. Verifier Budgeter β€” `aco/verifier_budgeter.py`
9. Retry/Recovery Optimizer β€” `aco/retry_optimizer.py`
10. Meta-Tool Miner β€” `aco/meta_tool_miner.py`
11. Doom Detector β€” `aco/doom_detector.py`

## New Modules (this session)

- **Conformal Calibration** β€” `aco/conformal.py` β€” RouteNLP-style distribution-free escalation guarantees
- **Pareto Frontier** β€” `aco/pareto.py` β€” RouterBench NDCH + RouteLLM CPT/APGR metrics
- **Integration Test** β€” `tests/test_integration.py` β€” Full pipeline test

## Key Takeaway

Training on real execution data matters more than architecture. v8 trained on synthetic data *increased* cost by 11.6%. v10 trained on 500 real SWE-Router outcomes *saved* 36.4%. Same XGBoost, same features.

## Documentation

- [Final Report](docs/final_report_v2.md)
- [Pareto Frontier Report](docs/pareto_frontier_report.md)
- [Conformal Calibration Report](docs/conformal_report.md)
- [BERT Eval Report](docs/bert_eval_report.md)
- [Literature Review](docs/literature_review.md)
- [Deployment Guide](docs/deployment_guide.md)
- [Technical Blog](docs/technical_blog.md)
- [Roadmap](docs/ROADMAP.md)

## Links

- **Model**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
- **Dataset**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
- **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)