narcolepticchicken commited on
Commit
e4cea93
·
verified ·
1 Parent(s): bd77292

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -65
README.md CHANGED
@@ -8,104 +8,83 @@ tags:
8
  - cascade-routing
9
  - execution-feedback
10
  - swebench
11
- - ml-intern
12
  ---
13
 
14
- # ACO v10: Agent Cost Optimizer
15
 
16
- A universal control layer that reduces the cost of autonomous agent runs while preserving task quality. **Trained on real execution data from 8 models across 500 SWE-bench tasks.**
17
 
18
- ## What's New in v10
19
 
20
- - **Real-data training**: XGBoost models trained on SWE-Router traces (500 tasks × 8 models = 4,000 real outcomes)
21
- - **v10 cascade**: 49.5% cost reduction at 67.8% success
22
- - **v10 + feedback**: 23.3% cost reduction at 73.8% success (only 4.4pp below frontier)
23
- - **Per-step routing**: Routes each agent step independently (search→cheap, edit→stronger)
24
- - **Execution feedback**: Uses cheap model output confidence to decide escalation
 
 
25
 
26
- ## Real-World Results
27
 
28
- ### SWE-bench (500 coding tasks, 8 models)
29
-
30
- | Policy | Success | Cost | CostRed |
31
- |--------|---------|------|---------|
32
  | Oracle | 87.0% | $0.06 | 80.3% |
33
- | **v10 feedback** | **73.8%** | **$0.24** | **23.3%** |
34
- | v10 cascade | 67.8% | $0.16 | 49.5% |
35
  | Always frontier | 78.2% | $0.32 | baseline |
36
  | v8 (synthetic) | 65.8% | $0.35 | -11.6% |
37
 
38
- ### BFCL v3 (82K traces, 108 models)
39
- - 84.1% of tasks solvable by cheaper models
40
- - 82.5% need only tier 1
41
 
42
- ### Synthetic Benchmark (3K traces)
43
- - v9 feedback: 90.0% success at 2.1% cost reduction (matches frontier)
44
- - v8 router: 83.7% success at 8.5% cost reduction
 
 
45
 
46
- ## 11 Modules
47
 
48
- 1. Cost Telemetry Collector
49
- 2. Task Cost Classifier
50
- 3. Model Cascade Router (v10: real-data trained)
51
- 4. Execution-Feedback Router (v9: output confidence cascade)
52
- 5. Context Budgeter
53
- 6. Cache-Aware Prompt Layout
54
- 7. Tool-Use Cost Gate
55
- 8. Verifier Budgeter
56
- 9. Retry/Recovery Optimizer
57
- 10. Meta-Tool Miner
58
- 11. Doom Detector
59
 
60
  ## Quick Start
61
 
62
  ```python
63
  from aco.router_v10 import V10Router
64
- v10 = V10Router(model_path="router_models/router_bundle_v10_fixed.pkl", success_threshold=0.70)
65
- decision = v10.route_cascade("Fix the auth bug in production")
66
- print(f"Tier: {decision.tier}, Model: {decision.model}, Cost: ${decision.cost_estimate:.2f}")
67
- ```
68
 
69
- ## Per-Step Routing
 
 
 
70
 
71
- ```python
72
- from aco.per_step_router import PerStepRouter
73
  ps = PerStepRouter(max_budget=2.0)
74
  d = ps.route_step("Search for the bug", step_num=1, task_risk="medium")
75
- print(f"Step type: {d.step_type.value}, Tier: {d.adjusted_tier}, Cost: ${d.cost_estimate:.2f}")
76
  ```
77
 
78
- ## Key Insight
79
 
80
- **Training on real execution data matters enormously.** The v8 router trained on synthetic data achieved -11.6% cost reduction on SWE-bench (it actually cost MORE than always-frontier). The v10 router trained on real SWE-Router data achieves 23.3% cost reduction at comparable quality. The gap: 34.9 percentage points.
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## Links
83
 
84
  - **Model**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
85
  - **Dataset**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
86
  - **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)
87
- - **SWE-Router data**: [SWE-Router/swebench-verified-*](https://huggingface.co/SWE-Router)
 
88
 
89
  ## License
90
 
91
  MIT
92
-
93
- <!-- ml-intern-provenance -->
94
- ## Generated by ML Intern
95
-
96
- This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
97
-
98
- - Try ML Intern: https://smolagents-ml-intern.hf.space
99
- - Source code: https://github.com/huggingface/ml-intern
100
-
101
- ## Usage
102
-
103
- ```python
104
- from transformers import AutoModelForCausalLM, AutoTokenizer
105
-
106
- model_id = 'narcolepticchicken/agent-cost-optimizer'
107
- tokenizer = AutoTokenizer.from_pretrained(model_id)
108
- model = AutoModelForCausalLM.from_pretrained(model_id)
109
- ```
110
-
111
- For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
 
8
  - cascade-routing
9
  - execution-feedback
10
  - swebench
 
11
  ---
12
 
13
+ # ACO v11: Agent Cost Optimizer
14
 
15
+ A universal control layer that reduces autonomous agent cost while preserving task quality. Trained on real execution data from SPROUT (31K rows, 13 models) + SWE-Router (500 tasks, 8 models).
16
 
17
+ ## What It Does
18
 
19
+ ACO sits in front of any agent harness and makes cost-aware decisions:
20
+ - Which model to use (tiny frontier → specialist)
21
+ - Whether to escalate based on output confidence
22
+ - How much context to include
23
+ - Whether to call tools
24
+ - Whether to verify outputs
25
+ - When to stop failing runs
26
 
27
+ ## v11 Results (Real SWE-bench, 500 tasks × 8 models)
28
 
29
+ | Policy | Success | Cost/Task | CostRed |
30
+ |--------|---------|-----------|---------|
 
 
31
  | Oracle | 87.0% | $0.06 | 80.3% |
32
+ | v11 + feedback | 74.8% | $0.20 | 36.9% |
33
+ | v11 cascade | 67.4% | $0.12 | 62.5% |
34
  | Always frontier | 78.2% | $0.32 | baseline |
35
  | v8 (synthetic) | 65.8% | $0.35 | -11.6% |
36
 
37
+ ## v9 Results (Synthetic, 3K traces)
 
 
38
 
39
+ | Policy | Success | CostRed |
40
+ |--------|---------|---------|
41
+ | v9 + feedback | 90.0% | 2.1% |
42
+ | v8 router | 83.7% | 8.5% |
43
+ | Always frontier | 90.0% | baseline |
44
 
45
+ ## Key Finding
46
 
47
+ Training data matters more than architecture. v8 trained on synthetic data *increased* cost by 11.6%. v10 trained on 500 real outcomes *saved* 23.3%. v11 with 31K SPROUT rows saves 36.9%. Same XGBoost architecture throughout.
 
 
 
 
 
 
 
 
 
 
48
 
49
  ## Quick Start
50
 
51
  ```python
52
  from aco.router_v10 import V10Router
53
+ from aco.per_step_router import PerStepRouter
 
 
 
54
 
55
+ # Task-level routing
56
+ v10 = V10Router(model_path="router_models/router_bundle_v11.pkl", success_threshold=0.70)
57
+ d = v10.route_cascade("Fix the auth bug in production")
58
+ print(f"Tier: {d.tier}, Model: {d.model}, Cost: ${d.cost_estimate:.2f}")
59
 
60
+ # Per-step routing
 
61
  ps = PerStepRouter(max_budget=2.0)
62
  d = ps.route_step("Search for the bug", step_num=1, task_risk="medium")
63
+ print(f"Step: {d.step_type.value}, Tier: {d.adjusted_tier}")
64
  ```
65
 
66
+ ## 11 Modules
67
 
68
+ 1. Cost Telemetry Collector
69
+ 2. Task Cost Classifier
70
+ 3. Model Cascade Router (v11 XGBoost)
71
+ 4. Execution-Feedback Router (entropy cascade)
72
+ 5. Context Budgeter
73
+ 6. Cache-Aware Prompt Layout
74
+ 7. Tool-Use Cost Gate
75
+ 8. Verifier Budgeter
76
+ 9. Retry/Recovery Optimizer
77
+ 10. Meta-Tool Miner
78
+ 11. Doom Detector
79
 
80
  ## Links
81
 
82
  - **Model**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
83
  - **Dataset**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
84
  - **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)
85
+ - **Blog Post**: [docs/technical_blog.md](https://huggingface.co/narcolepticchicken/agent-cost-optimizer/blob/main/docs/technical_blog.md)
86
+ - **Final Report**: [docs/final_report.md](https://huggingface.co/narcolepticchicken/agent-cost-optimizer/blob/main/docs/final_report.md)
87
 
88
  ## License
89
 
90
  MIT