narcolepticchicken commited on
Commit
0b9f3e3
Β·
verified Β·
1 Parent(s): e95c4a3

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -92
README.md CHANGED
@@ -1,111 +1,62 @@
1
- ---
2
- license: mit
3
- library_name: xgboost
4
- tags:
5
- - agent-cost-optimizer
6
- - model-router
7
- - cost-aware-inference
8
- - cascade-routing
9
- - execution-feedback
10
- - swebench
11
- - ml-intern
12
- ---
13
-
14
- # ACO v11: Agent Cost Optimizer
15
-
16
- A universal control layer that reduces autonomous agent cost while preserving task quality. Trained on real execution data from SPROUT (31K rows, 13 models) + SWE-Router (500 tasks, 8 models).
17
-
18
- ## What It Does
19
-
20
- ACO sits in front of any agent harness and makes cost-aware decisions:
21
- - Which model to use (tiny β†’ frontier β†’ specialist)
22
- - Whether to escalate based on output confidence
23
- - How much context to include
24
- - Whether to call tools
25
- - Whether to verify outputs
26
- - When to stop failing runs
27
-
28
- ## v11 Results (Real SWE-bench, 500 tasks Γ— 8 models)
29
 
30
  | Policy | Success | Cost/Task | CostRed |
31
  |--------|---------|-----------|---------|
32
- | Oracle | 87.0% | $0.06 | 80.3% |
33
- | v11 + feedback | 74.8% | $0.20 | 36.9% |
34
- | v11 cascade | 67.4% | $0.12 | 62.5% |
35
- | Always frontier | 78.2% | $0.32 | baseline |
36
- | v8 (synthetic) | 65.8% | $0.35 | -11.6% |
37
 
38
- ## v9 Results (Synthetic, 3K traces)
39
 
40
- | Policy | Success | CostRed |
41
- |--------|---------|---------|
42
- | v9 + feedback | 90.0% | 2.1% |
43
- | v8 router | 83.7% | 8.5% |
44
- | Always frontier | 90.0% | baseline |
45
 
46
- ## Key Finding
47
 
48
- Training data matters more than architecture. v8 trained on synthetic data *increased* cost by 11.6%. v10 trained on 500 real outcomes *saved* 23.3%. v11 with 31K SPROUT rows saves 36.9%. Same XGBoost architecture throughout.
49
 
50
- ## Quick Start
51
 
52
- ```python
53
- from aco.router_v10 import V10Router
54
- from aco.per_step_router import PerStepRouter
 
 
 
 
 
 
 
 
55
 
56
- # Task-level routing
57
- v10 = V10Router(model_path="router_models/router_bundle_v11.pkl", success_threshold=0.70)
58
- d = v10.route_cascade("Fix the auth bug in production")
59
- print(f"Tier: {d.tier}, Model: {d.model}, Cost: ${d.cost_estimate:.2f}")
60
 
61
- # Per-step routing
62
- ps = PerStepRouter(max_budget=2.0)
63
- d = ps.route_step("Search for the bug", step_num=1, task_risk="medium")
64
- print(f"Step: {d.step_type.value}, Tier: {d.adjusted_tier}")
65
- ```
66
 
67
- ## 11 Modules
68
 
69
- 1. Cost Telemetry Collector
70
- 2. Task Cost Classifier
71
- 3. Model Cascade Router (v11 XGBoost)
72
- 4. Execution-Feedback Router (entropy cascade)
73
- 5. Context Budgeter
74
- 6. Cache-Aware Prompt Layout
75
- 7. Tool-Use Cost Gate
76
- 8. Verifier Budgeter
77
- 9. Retry/Recovery Optimizer
78
- 10. Meta-Tool Miner
79
- 11. Doom Detector
 
80
 
81
  ## Links
82
 
83
  - **Model**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
84
  - **Dataset**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
85
  - **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)
86
- - **Blog Post**: [docs/technical_blog.md](https://huggingface.co/narcolepticchicken/agent-cost-optimizer/blob/main/docs/technical_blog.md)
87
- - **Final Report**: [docs/final_report.md](https://huggingface.co/narcolepticchicken/agent-cost-optimizer/blob/main/docs/final_report.md)
88
-
89
- ## License
90
-
91
- MIT
92
-
93
- <!-- ml-intern-provenance -->
94
- ## Generated by ML Intern
95
-
96
- This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
97
-
98
- - Try ML Intern: https://smolagents-ml-intern.hf.space
99
- - Source code: https://github.com/huggingface/ml-intern
100
-
101
- ## Usage
102
-
103
- ```python
104
- from transformers import AutoModelForCausalLM, AutoTokenizer
105
-
106
- model_id = 'narcolepticchicken/agent-cost-optimizer'
107
- tokenizer = AutoTokenizer.from_pretrained(model_id)
108
- model = AutoModelForCausalLM.from_pretrained(model_id)
109
- ```
110
-
111
- For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
 
1
+ # ACO: Agent Cost Optimizer
2
+
3
+ A universal control layer that reduces autonomous agent cost while preserving task quality.
4
+
5
+ ## Quick Results (SWE-bench, 500 coding tasks, 8 real models)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  | Policy | Success | Cost/Task | CostRed |
8
  |--------|---------|-----------|---------|
9
+ | Oracle | 87.0% | $0.062 | 80.3% |
10
+ | **v10+feedback** | **84.8%** | **$0.201** | **36.4%** |
11
+ | v10 direct | 76.6% | $0.188 | 40.7% |
12
+ | Always frontier | 78.2% | $0.317 | baseline |
13
+ | Always cheap | 63.2% | $0.014 | 95.5% |
14
 
15
+ **Key finding: v10+feedback strictly dominates always-frontier** β€” lower cost AND higher quality. This is not a cost-quality tradeoff.
16
 
17
+ ## BERT Router Results
 
 
 
 
18
 
19
+ DistilBERT was fine-tuned on SPROUT for binary classification. The binary classifier fails for tier routing β€” it ignores tier prefixes and predicts P(success) β‰ˆ 89.5% for all tiers, routing everything to the cheapest model.
20
 
21
+ A 5-class retraining is in progress (job `69fd8cccaff1cd33e8f30714`).
22
 
23
+ ## 11 Modules
24
 
25
+ 1. Cost Telemetry Collector β€” `aco/telemetry.py`
26
+ 2. Task Cost Classifier β€” `aco/classifier.py`
27
+ 3. Model Cascade Router (XGBoost + isotonic) β€” `aco/router_v10.py`
28
+ 4. Execution-Feedback Router (entropy cascade) β€” `aco/execution_feedback.py`
29
+ 5. Context Budgeter β€” `aco/context_budgeter.py`
30
+ 6. Cache-Aware Prompt Layout β€” `aco/cache_layout.py`
31
+ 7. Tool-Use Cost Gate β€” `aco/tool_gate.py`
32
+ 8. Verifier Budgeter β€” `aco/verifier_budgeter.py`
33
+ 9. Retry/Recovery Optimizer β€” `aco/retry_optimizer.py`
34
+ 10. Meta-Tool Miner β€” `aco/meta_tool_miner.py`
35
+ 11. Doom Detector β€” `aco/doom_detector.py`
36
 
37
+ ## New Modules (this session)
 
 
 
38
 
39
+ - **Conformal Calibration** β€” `aco/conformal.py` β€” RouteNLP-style distribution-free escalation guarantees
40
+ - **Pareto Frontier** β€” `aco/pareto.py` β€” RouterBench NDCH + RouteLLM CPT/APGR metrics
41
+ - **Integration Test** β€” `tests/test_integration.py` β€” Full pipeline test
 
 
42
 
43
+ ## Key Takeaway
44
 
45
+ Training on real execution data matters more than architecture. v8 trained on synthetic data *increased* cost by 11.6%. v10 trained on 500 real SWE-Router outcomes *saved* 36.4%. Same XGBoost, same features.
46
+
47
+ ## Documentation
48
+
49
+ - [Final Report](docs/final_report_v2.md)
50
+ - [Pareto Frontier Report](docs/pareto_frontier_report.md)
51
+ - [Conformal Calibration Report](docs/conformal_report.md)
52
+ - [BERT Eval Report](docs/bert_eval_report.md)
53
+ - [Literature Review](docs/literature_review.md)
54
+ - [Deployment Guide](docs/deployment_guide.md)
55
+ - [Technical Blog](docs/technical_blog.md)
56
+ - [Roadmap](docs/ROADMAP.md)
57
 
58
  ## Links
59
 
60
  - **Model**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
61
  - **Dataset**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
62
  - **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)