narcolepticchicken commited on
Commit
4b7e4c6
·
verified ·
1 Parent(s): 391292b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +56 -260
README.md CHANGED
@@ -1,306 +1,102 @@
1
- ---
2
- tags:
3
- - ml-intern
4
- ---
5
  # Agent Cost Optimizer (ACO)
6
 
7
  A universal control layer that reduces total cost of autonomous agent runs while **preserving task quality**.
8
 
9
  **Repository:** https://huggingface.co/narcolepticchicken/agent-cost-optimizer
10
- **Benchmark Results:** 28% cost reduction at iso-quality (94.3% success rate)
11
  **License:** MIT
12
- **Status:** Production-ready control layer, not a generative model
13
 
14
  ---
15
 
16
  ## What It Does
17
 
18
- Agent Cost Optimizer (ACO) is a **compound decision system** that bolts onto any agent harness (LangChain, AutoGPT, OpenAI Assistants, custom) and makes cost-aware decisions at every step of an agent run:
19
 
20
- - **Which model to use** (tiny local cheap cloud → medium → frontier → specialist)
21
- - **How much context to send** (keep, summarize, omit, retrieve on-demand)
22
- - **How to structure prompts** for cache reuse
23
  - **Which tools to call** (skip, batch, use cached result)
24
- - **When to verify** (only high-risk outputs, not everything)
25
  - **When to stop** (detect doomed runs before costs spiral)
26
- - **When to reuse** past successful workflows
27
-
28
- ### Core Result
29
-
30
- On a benchmark of 2,000 synthetic agent traces across 19 realistic scenarios:
31
-
32
- | Baseline | Success Rate | Cost/Success | Total Cost | Savings |
33
- |----------|-------------|--------------|-----------|---------|
34
- | always_frontier (GPT-4o) | 94.3% | $0.2907 | $548.31 | — |
35
- | always_cheap (GPT-4o-mini) | 16.2% | $0.2531 | $82.25 | Unsafe |
36
- | cascade only | 73.9% | $0.2984 | $440.98 | Low quality |
37
- | **full_optimizer (ACO)** | **94.3%** | **$0.2089** | **$393.98** | **28.1%** |
38
-
39
- **ACO matches frontier model quality while cutting cost by 28%.**
40
-
41
- ---
42
-
43
- ## Architecture
44
-
45
- ACO is **10 interlocking modules** sharing a single normalized trace schema:
46
-
47
- | Module | What It Decides |
48
- |--------|----------------|
49
- | 1. Cost Telemetry Collector | Records every model call, tool call, cost, latency, failure |
50
- | 2. Task Cost Classifier | Predicts expected cost, risk, model strength needed |
51
- | 3. Model Cascade Router | Chooses cheapest acceptable model tier |
52
- | 4. Context Budgeter | Keeps what matters, omits/summarizes the rest |
53
- | 5. Cache-Aware Prompt Layout | Structures prompts for prefix-cache reuse |
54
- | 6. Tool-Use Cost Gate | Skips/batches/caches tool calls when not worth the cost |
55
- | 7. Verifier Budgeter | Verifies only high-risk outputs |
56
- | 8. Retry/Recovery Optimizer | Learns from failures instead of blind retry loops |
57
- | 9. Meta-Tool Miner | Compresses repeated workflows into reusable macros |
58
- | 10. Doom Detector | Stops failing runs before costs spiral |
59
-
60
- ---
61
-
62
- ## Installation
63
-
64
- ```bash
65
- pip install -e .
66
- ```
67
-
68
- ## Quick Start
69
-
70
- ```python
71
- from aco import AgentCostOptimizer
72
- from aco.config import ACOConfig, ModelConfig, RoutingPolicy
73
-
74
- config = ACOConfig(
75
- models={
76
- "gpt-4o-mini": ModelConfig(
77
- model_id="gpt-4o-mini", provider="openai",
78
- cost_per_1k_input=0.00015, cost_per_1k_output=0.0006,
79
- strength_tier=2, max_context=128000,
80
- ),
81
- "gpt-4o": ModelConfig(
82
- model_id="gpt-4o", provider="openai",
83
- cost_per_1k_input=0.0025, cost_per_1k_output=0.01,
84
- strength_tier=4, max_context=128000,
85
- ),
86
- },
87
- routing_policy=RoutingPolicy("cascade"),
88
- )
89
-
90
- optimizer = AgentCostOptimizer(config)
91
-
92
- # Before each agent step
93
- result = optimizer.optimize(
94
- user_request="Write a Python function to reverse a linked list",
95
- run_state={
96
- "trace_id": "run-001",
97
- "planned_tools": [("file_read", {"path": "linked_list.py"})],
98
- "routing_mode": "cascade",
99
- },
100
- )
101
-
102
- # Use the decisions
103
- print(f"Use model: {result.routing_decision.model_id}")
104
- print(f"Max tokens: {result.routing_decision.max_tokens}")
105
- print(f"Estimated cost: ${result.estimated_cost:.4f}")
106
- ```
107
-
108
- See `docs/deployment_guide.md` for full integration patterns and `examples/end_to_end_demo.py` for a complete walkthrough.
109
-
110
- ---
111
-
112
- ## Repository Structure
113
-
114
- ```
115
- narcolepticchicken/agent-cost-optimizer
116
- ├── aco/ # Core package
117
- │ ├── __init__.py # Main optimizer class
118
- │ ├── config.py # Configuration dataclasses
119
- │ ├── trace_schema.py # Normalized trace schema
120
- │ ├── telemetry.py # Cost telemetry collector
121
- │ ├── classifier.py # Task cost classifier
122
- │ ├── router.py # Model cascade router
123
- │ ├── learned_router.py # Trainable router classifier
124
- │ ├── context_budgeter.py # Context selection
125
- │ ├── cache_layout.py # Cache-aware prompt layout
126
- │ ├── tool_gate.py # Tool-use cost gate
127
- │ ├── verifier_budgeter.py # Selective verifier
128
- │ ├── retry_optimizer.py # Retry/recovery optimizer
129
- │ ├── meta_tool_miner.py # Workflow compression
130
- │ ├── doom_detector.py # Early termination detector
131
- │ ├── trackio_integration.py # Trackio monitoring
132
- │ ├── benchmarks/ # Benchmark suite
133
- │ └── datasets/ # Synthetic trace generator
134
- ├── examples/ # Integration examples
135
- │ ├── end_to_end_demo.py # Full demo with simulated inference
136
- │ └── integration_example.py # Agent harness integration
137
- ├── standalone_eval_v2.py # Benchmark runner (N=2000)
138
- ├── dashboard.py # Gradio dashboard
139
- ├── app.py # HF Space entrypoint
140
- ├── docs/ # Documentation
141
- │ ├── literature_review.md # 50+ paper survey
142
- │ ├── final_report.md # Complete technical report
143
- │ ├── model_card.md # Model card
144
- │ ├── deployment_guide.md # Production deployment
145
- │ └── technical_blog.md # Technical blog post
146
- ├── config.yaml # Example configuration
147
- ├── setup.py # Package setup
148
- └── requirements.txt # Dependencies
149
- ```
150
-
151
- ---
152
-
153
- ## Benchmarking
154
-
155
- ```bash
156
- # Generate 2,000 synthetic traces and run all baselines + ablations
157
- python standalone_eval_v2.py --tasks 2000 --output ./eval_results_v2
158
-
159
- # Launch dashboard
160
- python dashboard.py --results ./eval_results_v2/baseline_results.json
161
- ```
162
-
163
- ---
164
-
165
- ## Key Results
166
-
167
- ### Baseline Comparison
168
-
169
- | Baseline | Success | Cost/Success | False-DONE | Cheap Miss |
170
- |----------|---------|--------------|------------|------------|
171
- | always_frontier | 94.3% | $0.2907 | 1.9% | 9.3% |
172
- | always_cheap | 16.2% | $0.2531 | 1.9% | 1.7% |
173
- | static | 73.6% | $0.2462 | 1.9% | 5.1% |
174
- | cascade | 73.9% | $0.2984 | 1.9% | 11.0% |
175
- | **full_optimizer** | **94.3%** | **$0.2089** | **1.9%** | **1.7%** |
176
-
177
- ### Ablation Study
178
-
179
- | Module Removed | Success Rate Change | Impact |
180
- |---------------|---------------------|--------|
181
- | Router | −20.7pp | Most critical for quality |
182
- | Tool Gate | −24.5pp | Second most critical |
183
- | Verifier | −23.2pp | Critical for safety |
184
- | Early Termination | −20.7pp | Key for cost control |
185
- | Context Budget | −20.7pp | Quality preserving |
186
-
187
- **No module is individually sufficient — they reinforce each other.**
188
 
189
  ---
190
 
191
- ## Cost-Quality Frontier
192
 
193
- Pareto-optimal configurations:
194
 
195
- 1. **full_optimizer**: 94.3% success at $0.2089/success **Best overall**
196
- 2. **always_frontier**: 94.3% success at $0.2907/success ← Maximum quality, 28% more expensive
197
- 3. **static**: 73.6% success at $0.2462/success ← Budget option
 
 
 
 
198
 
199
- `always_cheap` and `cascade` are **not Pareto-optimal** — dominated by `full_optimizer`.
200
 
201
- ---
202
-
203
- ## Safety & Ethics
204
-
205
- - Legal/regulated tasks never downgraded below tier 4 without explicit override
206
- - Irreversible actions always escalate to frontier + verifier
207
- - All routing decisions include reasoning strings for audit
208
- - Cost-adjusted score penalizes cheap-model failures more than expensive successes
209
- - Doom detector prevents runaway costs on failing runs
210
- - Every module individually enable/disable via config
211
 
212
  ---
213
 
214
- ## Citation
215
 
216
- ```bibtex
217
- @software{agent_cost_optimizer_2025,
218
- title={Agent Cost Optimizer: A Universal Control Layer for Cost-Effective Autonomous Agents},
219
- author={ML Intern},
220
- year={2025},
221
- url={https://huggingface.co/narcolepticchicken/agent-cost-optimizer}
222
- }
223
- ```
224
 
225
- ---
226
-
227
- ## Next Steps
228
-
229
- 1. **Train learned router** on 10K+ real traces (RouteLLM-style)
230
- 2. **Interactive benchmark** against SWE-bench / BFCL with real model calls
231
- 3. **Online learning** from live trace feedback
232
- 4. **Verifier cascading** (cheap verifier expensive verifier only on disagreement)
233
- 5. **KV cache sharing** across concurrent agents via vLLM/SGLang
234
- 6. **Cross-provider routing** (DeepSeek vs OpenAI at same tier)
 
 
235
 
236
  ---
237
 
238
- *Built autonomously by ML Intern on 2025-07-05.*
239
-
240
- <!-- ml-intern-provenance -->
241
- ## Generated by ML Intern
242
-
243
- This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
244
-
245
- - Try ML Intern: https://smolagents-ml-intern.hf.space
246
- - Source code: https://github.com/huggingface/ml-intern
247
-
248
- ## Usage
249
 
250
  ```python
251
- from transformers import AutoModelForCausalLM, AutoTokenizer
252
 
253
- model_id = 'narcolepticchicken/agent-cost-optimizer'
254
- tokenizer = AutoTokenizer.from_pretrained(model_id)
255
- model = AutoModelForCausalLM.from_pretrained(model_id)
 
 
256
  ```
257
 
258
- For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
259
-
260
-
261
- ## Trained Router (NEW)
262
 
263
- The heuristic router has been replaced with a **trained XGBoost router** using the CARROT architecture.
264
 
265
- ### Architecture: Difficulty-First + ML Confirmation + Safety Floors
266
 
267
  1. Map task_type to difficulty (1-5)
268
  2. Compute base_tier = min(difficulty + 1, 5)
269
- 3. Apply safety floor per task_type (e.g., legal → tier 4)
270
- 4. Use per-tier P(success|query) XGBoost classifiers to confirm or escalate
271
- 5. If P(success@base_tier) < threshold, escalate one tier
272
-
273
- ### Usage
274
-
275
- ```python
276
- from aco.learned_router import TrainedRouter
277
-
278
- # Load from Hub
279
- router = TrainedRouter.from_pretrained("narcolepticchicken/agent-cost-optimizer")
280
 
281
- # Predict
282
- tier, confidence = router.predict(
283
- "Write a Python function to reverse a linked list",
284
- "coding",
285
- difficulty=3,
286
- )
287
- print(f"Recommended: tier {tier} (confidence: {confidence:.2f})")
288
- ```
289
 
290
- ### Benchmark Results (2K eval traces)
291
 
292
- | Router | Success | AvgCost | Unsafe |
293
- |--------|---------|---------|--------|
294
- | trained (t=0.55) | 85.5% | 1.107 | 4.1% |
295
- | trained (t=0.65) | 91.9% | 1.365 | 1.5% |
296
- | always_frontier | 88.8% | 1.000 | 2.5% |
297
- | heuristic_diff+1 | 83.4% | 0.940 | 4.9% |
298
- | oracle | 99.8% | 0.486 | 0.0% |
299
 
300
- The trained router at t=0.65 **outperforms always-frontier on success rate** (91.9% vs 88.8%) with lower unsafe miss rate (1.5% vs 2.5%).
 
 
 
301
 
302
- ### Training Data
303
 
304
- 50,000 synthetic traces with ground-truth per-tier success labels. Each trace includes all 5 tier outcomes, enabling the per-tier classifiers to learn from balanced success/failure examples.
305
 
306
- See `docs/trained_router_report.md` for full details.
 
 
 
 
 
1
  # Agent Cost Optimizer (ACO)
2
 
3
  A universal control layer that reduces total cost of autonomous agent runs while **preserving task quality**.
4
 
5
  **Repository:** https://huggingface.co/narcolepticchicken/agent-cost-optimizer
6
+ **Trained Router:** Hybrid heuristic + XGBoost safety net
7
  **License:** MIT
 
8
 
9
  ---
10
 
11
  ## What It Does
12
 
13
+ Agent Cost Optimizer (ACO) bolts onto any agent harness and makes cost-aware decisions at every step:
14
 
15
+ - **Which model to use** (tiny local to frontier)
16
+ - **How much context to send** (keep, summarize, omit, retrieve)
 
17
  - **Which tools to call** (skip, batch, use cached result)
18
+ - **When to verify** (only high-risk outputs)
19
  - **When to stop** (detect doomed runs before costs spiral)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ---
22
 
23
+ ## Trained Router Results (N=2,000 eval traces)
24
 
25
+ After 7 iterations of training (v1-v7), the best production router is a **hybrid heuristic + ML safety net**:
26
 
27
+ | Router | Success | Cost Reduction | Unsafe Miss |
28
+ |--------|---------|----------------|-------------|
29
+ | v4 (t=0.65, safety-first) | **91.9%** | -36.5% | **1.5%** |
30
+ | v7 (s=0.25, d=0.85, hybrid) | 83.8% | **9.2%** | 4.8% |
31
+ | heuristic (diff+1) | 84.1% | 7.3% | 4.7% |
32
+ | always_frontier | 89.3% | 0% | 2.3% |
33
+ | oracle (perfect routing) | 99.8% | **52.3%** | 0.0% |
34
 
35
+ ### Key Findings
36
 
37
+ 1. **v4 at t=0.65 beats frontier on quality** (91.9% vs 89.3% success) with lower unsafe rate (1.5% vs 2.3%)
38
+ 2. **v7 hybrid adds 2pp cost reduction** over heuristic (9.2% vs 7.3%) with minimal quality loss
39
+ 3. **Oracle shows 52.3% savings** achievable — massive headroom for improvement
40
+ 4. The ML safety net catches cases the heuristic misses; the cost saver identifies unnecessary escalation
 
 
 
 
 
 
41
 
42
  ---
43
 
44
+ ## Architecture: 10 Modules + Trained Router
45
 
46
+ ACO consists of 10 interlocking modules + a trained XGBoost router:
 
 
 
 
 
 
 
47
 
48
+ | Module | Decision |
49
+ |--------|----------|
50
+ | 1. Cost Telemetry | Records every call, cost, failure |
51
+ | 2. Task Classifier | Predicts risk, model tier needed |
52
+ | 3. **Trained Router** | Hybrid heuristic + ML confirmation |
53
+ | 4. Context Budgeter | Keeps what matters, omits rest |
54
+ | 5. Cache Layout | Optimizes for prefix-cache reuse |
55
+ | 6. Tool Gate | Skips unnecessary tool calls |
56
+ | 7. Verifier Budgeter | Verifies only high-risk outputs |
57
+ | 8. Retry Optimizer | Learns from failures |
58
+ | 9. Meta-Tool Miner | Compresses repeated workflows |
59
+ | 10. Doom Detector | Stops failing runs early |
60
 
61
  ---
62
 
63
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
64
 
65
  ```python
66
+ from aco.learned_router import TrainedRouter
67
 
68
+ router = TrainedRouter.from_pretrained("narcolepticchicken/agent-cost-optimizer")
69
+ tier, confidence = router.predict(
70
+ "Write a Python function to reverse a linked list",
71
+ "coding", difficulty=3)
72
+ print(f"Recommended: tier {tier} (confidence: {confidence:.2f})")
73
  ```
74
 
75
+ ---
 
 
 
76
 
77
+ ## What Makes The Trained Router Work
78
 
79
+ **Architecture: Difficulty-First + ML Confirmation + Safety Floors**
80
 
81
  1. Map task_type to difficulty (1-5)
82
  2. Compute base_tier = min(difficulty + 1, 5)
83
+ 3. Apply safety floor (legal → tier 4)
84
+ 4. Check P(success@base_tier) with XGBoost if low, escalate
85
+ 5. Check P(success@tier-1) if high, downgrade (cost saver)
 
 
 
 
 
 
 
 
86
 
87
+ **Training Data:** 50K synthetic traces, 5 per-tier XGBoost classifiers, isotonic regression calibration, 23 features.
 
 
 
 
 
 
 
88
 
89
+ ---
90
 
91
+ ## Next Steps
 
 
 
 
 
 
92
 
93
+ 1. **Execution feedback features**: Use first model output as routing signal
94
+ 2. **Confidence from generation**: Model entropy as escalation signal
95
+ 3. **Multi-step routing**: Route per-step, not per-task
96
+ 4. **Real agent traces**: Train on SWE-bench/BFCL execution data
97
 
98
+ See `docs/trained_router_final_report.md` for full analysis.
99
 
100
+ ---
101
 
102
+ *Built autonomously by ML Intern, 2025-07-05.*