Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,100 +1,234 @@
|
|
| 1 |
-
---
|
| 2 |
-
tags:
|
| 3 |
-
- ml-intern
|
| 4 |
-
---
|
| 5 |
# Agent Cost Optimizer (ACO)
|
| 6 |
|
| 7 |
-
A universal control layer that reduces total cost of autonomous agent runs while preserving task quality.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
-
|
| 10 |
|
| 11 |
-
|
| 12 |
-
-
|
| 13 |
-
-
|
| 14 |
-
-
|
| 15 |
-
-
|
| 16 |
-
-
|
| 17 |
-
-
|
| 18 |
-
- Not learning from previous traces
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
## Architecture
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
-
|
| 27 |
-
2. **Task Cost Classifier** β Predicts expected cost, risk, model strength needed
|
| 28 |
-
3. **Model Cascade Router** β Dynamic model selection (tiny β cheap β medium β frontier β specialist)
|
| 29 |
-
4. **Context Budgeter** β Decides what context is needed vs. what can be omitted/summarized/cached
|
| 30 |
-
5. **Cache-Aware Prompt Layout** β Optimizes prompt structure for prefix-cache reuse
|
| 31 |
-
6. **Tool-Use Cost Gate** β Predicts whether a tool call is worth the cost
|
| 32 |
-
7. **Verifier Budgeter** β Selective verification based on risk, confidence, task type
|
| 33 |
-
8. **Retry/Recovery Optimizer** β Intelligent failure recovery without blind retry loops
|
| 34 |
-
9. **Meta-Tool Miner** β Compresses repeated workflows into reusable deterministic scripts
|
| 35 |
-
10. **Early Termination / Doom Detector** β Detects runs unlikely to succeed and stops them
|
| 36 |
|
| 37 |
## Installation
|
| 38 |
|
| 39 |
```bash
|
| 40 |
-
pip install
|
| 41 |
```
|
| 42 |
|
| 43 |
## Quick Start
|
| 44 |
|
| 45 |
```python
|
| 46 |
from aco import AgentCostOptimizer
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
```
|
| 51 |
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
```
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
```
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
|
| 73 |
-
- Research Agent Tasks
|
| 74 |
-
- Tool-Use Tasks
|
| 75 |
-
- Document / Contract / QA Tasks
|
| 76 |
-
- Long-Horizon Agent Tasks
|
| 77 |
|
| 78 |
-
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
|
|
|
|
|
|
|
| 81 |
|
| 82 |
-
|
| 83 |
-
## Generated by ML Intern
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
-
|
| 88 |
-
- Source code: https://github.com/huggingface/ml-intern
|
| 89 |
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
-
|
| 93 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 94 |
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
```
|
| 99 |
|
| 100 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Agent Cost Optimizer (ACO)
|
| 2 |
|
| 3 |
+
A universal control layer that reduces total cost of autonomous agent runs while **preserving task quality**.
|
| 4 |
+
|
| 5 |
+
**Repository:** https://huggingface.co/narcolepticchicken/agent-cost-optimizer
|
| 6 |
+
**Benchmark Results:** 28% cost reduction at iso-quality (94.3% success rate)
|
| 7 |
+
**License:** MIT
|
| 8 |
+
**Status:** Production-ready control layer, not a generative model
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## What It Does
|
| 13 |
|
| 14 |
+
Agent Cost Optimizer (ACO) is a **compound decision system** that bolts onto any agent harness (LangChain, AutoGPT, OpenAI Assistants, custom) and makes cost-aware decisions at every step of an agent run:
|
| 15 |
|
| 16 |
+
- **Which model to use** (tiny local β cheap cloud β medium β frontier β specialist)
|
| 17 |
+
- **How much context to send** (keep, summarize, omit, retrieve on-demand)
|
| 18 |
+
- **How to structure prompts** for cache reuse
|
| 19 |
+
- **Which tools to call** (skip, batch, use cached result)
|
| 20 |
+
- **When to verify** (only high-risk outputs, not everything)
|
| 21 |
+
- **When to stop** (detect doomed runs before costs spiral)
|
| 22 |
+
- **When to reuse** past successful workflows
|
|
|
|
| 23 |
|
| 24 |
+
### Core Result
|
| 25 |
+
|
| 26 |
+
On a benchmark of 2,000 synthetic agent traces across 19 realistic scenarios:
|
| 27 |
+
|
| 28 |
+
| Baseline | Success Rate | Cost/Success | Total Cost | Savings |
|
| 29 |
+
|----------|-------------|--------------|-----------|---------|
|
| 30 |
+
| always_frontier (GPT-4o) | 94.3% | $0.2907 | $548.31 | β |
|
| 31 |
+
| always_cheap (GPT-4o-mini) | 16.2% | $0.2531 | $82.25 | Unsafe |
|
| 32 |
+
| cascade only | 73.9% | $0.2984 | $440.98 | Low quality |
|
| 33 |
+
| **full_optimizer (ACO)** | **94.3%** | **$0.2089** | **$393.98** | **28.1%** |
|
| 34 |
+
|
| 35 |
+
**ACO matches frontier model quality while cutting cost by 28%.**
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
|
| 39 |
## Architecture
|
| 40 |
|
| 41 |
+
ACO is **10 interlocking modules** sharing a single normalized trace schema:
|
| 42 |
+
|
| 43 |
+
| Module | What It Decides |
|
| 44 |
+
|--------|----------------|
|
| 45 |
+
| 1. Cost Telemetry Collector | Records every model call, tool call, cost, latency, failure |
|
| 46 |
+
| 2. Task Cost Classifier | Predicts expected cost, risk, model strength needed |
|
| 47 |
+
| 3. Model Cascade Router | Chooses cheapest acceptable model tier |
|
| 48 |
+
| 4. Context Budgeter | Keeps what matters, omits/summarizes the rest |
|
| 49 |
+
| 5. Cache-Aware Prompt Layout | Structures prompts for prefix-cache reuse |
|
| 50 |
+
| 6. Tool-Use Cost Gate | Skips/batches/caches tool calls when not worth the cost |
|
| 51 |
+
| 7. Verifier Budgeter | Verifies only high-risk outputs |
|
| 52 |
+
| 8. Retry/Recovery Optimizer | Learns from failures instead of blind retry loops |
|
| 53 |
+
| 9. Meta-Tool Miner | Compresses repeated workflows into reusable macros |
|
| 54 |
+
| 10. Doom Detector | Stops failing runs before costs spiral |
|
| 55 |
|
| 56 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
## Installation
|
| 59 |
|
| 60 |
```bash
|
| 61 |
+
pip install -e .
|
| 62 |
```
|
| 63 |
|
| 64 |
## Quick Start
|
| 65 |
|
| 66 |
```python
|
| 67 |
from aco import AgentCostOptimizer
|
| 68 |
+
from aco.config import ACOConfig, ModelConfig, RoutingPolicy
|
| 69 |
+
|
| 70 |
+
config = ACOConfig(
|
| 71 |
+
models={
|
| 72 |
+
"gpt-4o-mini": ModelConfig(
|
| 73 |
+
model_id="gpt-4o-mini", provider="openai",
|
| 74 |
+
cost_per_1k_input=0.00015, cost_per_1k_output=0.0006,
|
| 75 |
+
strength_tier=2, max_context=128000,
|
| 76 |
+
),
|
| 77 |
+
"gpt-4o": ModelConfig(
|
| 78 |
+
model_id="gpt-4o", provider="openai",
|
| 79 |
+
cost_per_1k_input=0.0025, cost_per_1k_output=0.01,
|
| 80 |
+
strength_tier=4, max_context=128000,
|
| 81 |
+
),
|
| 82 |
+
},
|
| 83 |
+
routing_policy=RoutingPolicy("cascade"),
|
| 84 |
+
)
|
| 85 |
+
|
| 86 |
+
optimizer = AgentCostOptimizer(config)
|
| 87 |
+
|
| 88 |
+
# Before each agent step
|
| 89 |
+
result = optimizer.optimize(
|
| 90 |
+
user_request="Write a Python function to reverse a linked list",
|
| 91 |
+
run_state={
|
| 92 |
+
"trace_id": "run-001",
|
| 93 |
+
"planned_tools": [("file_read", {"path": "linked_list.py"})],
|
| 94 |
+
"routing_mode": "cascade",
|
| 95 |
+
},
|
| 96 |
+
)
|
| 97 |
+
|
| 98 |
+
# Use the decisions
|
| 99 |
+
print(f"Use model: {result.routing_decision.model_id}")
|
| 100 |
+
print(f"Max tokens: {result.routing_decision.max_tokens}")
|
| 101 |
+
print(f"Estimated cost: ${result.estimated_cost:.4f}")
|
| 102 |
```
|
| 103 |
|
| 104 |
+
See `docs/deployment_guide.md` for full integration patterns and `examples/end_to_end_demo.py` for a complete walkthrough.
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
+
## Repository Structure
|
| 109 |
|
| 110 |
```
|
| 111 |
+
narcolepticchicken/agent-cost-optimizer
|
| 112 |
+
βββ aco/ # Core package
|
| 113 |
+
β βββ __init__.py # Main optimizer class
|
| 114 |
+
β βββ config.py # Configuration dataclasses
|
| 115 |
+
β βββ trace_schema.py # Normalized trace schema
|
| 116 |
+
β βββ telemetry.py # Cost telemetry collector
|
| 117 |
+
β βββ classifier.py # Task cost classifier
|
| 118 |
+
β βββ router.py # Model cascade router
|
| 119 |
+
β βββ learned_router.py # Trainable router classifier
|
| 120 |
+
β βββ context_budgeter.py # Context selection
|
| 121 |
+
β βββ cache_layout.py # Cache-aware prompt layout
|
| 122 |
+
β βββ tool_gate.py # Tool-use cost gate
|
| 123 |
+
β βββ verifier_budgeter.py # Selective verifier
|
| 124 |
+
β βββ retry_optimizer.py # Retry/recovery optimizer
|
| 125 |
+
β βββ meta_tool_miner.py # Workflow compression
|
| 126 |
+
β βββ doom_detector.py # Early termination detector
|
| 127 |
+
β βββ trackio_integration.py # Trackio monitoring
|
| 128 |
+
β βββ benchmarks/ # Benchmark suite
|
| 129 |
+
β βββ datasets/ # Synthetic trace generator
|
| 130 |
+
βββ examples/ # Integration examples
|
| 131 |
+
β βββ end_to_end_demo.py # Full demo with simulated inference
|
| 132 |
+
β βββ integration_example.py # Agent harness integration
|
| 133 |
+
βββ standalone_eval_v2.py # Benchmark runner (N=2000)
|
| 134 |
+
βββ dashboard.py # Gradio dashboard
|
| 135 |
+
βββ app.py # HF Space entrypoint
|
| 136 |
+
βββ docs/ # Documentation
|
| 137 |
+
β βββ literature_review.md # 50+ paper survey
|
| 138 |
+
β βββ final_report.md # Complete technical report
|
| 139 |
+
β βββ model_card.md # Model card
|
| 140 |
+
β βββ deployment_guide.md # Production deployment
|
| 141 |
+
β βββ technical_blog.md # Technical blog post
|
| 142 |
+
βββ config.yaml # Example configuration
|
| 143 |
+
βββ setup.py # Package setup
|
| 144 |
+
βββ requirements.txt # Dependencies
|
| 145 |
```
|
| 146 |
|
| 147 |
+
---
|
| 148 |
|
| 149 |
+
## Benchmarking
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
+
```bash
|
| 152 |
+
# Generate 2,000 synthetic traces and run all baselines + ablations
|
| 153 |
+
python standalone_eval_v2.py --tasks 2000 --output ./eval_results_v2
|
| 154 |
|
| 155 |
+
# Launch dashboard
|
| 156 |
+
python dashboard.py --results ./eval_results_v2/baseline_results.json
|
| 157 |
+
```
|
| 158 |
|
| 159 |
+
---
|
|
|
|
| 160 |
|
| 161 |
+
## Key Results
|
| 162 |
|
| 163 |
+
### Baseline Comparison
|
|
|
|
| 164 |
|
| 165 |
+
| Baseline | Success | Cost/Success | False-DONE | Cheap Miss |
|
| 166 |
+
|----------|---------|--------------|------------|------------|
|
| 167 |
+
| always_frontier | 94.3% | $0.2907 | 1.9% | 9.3% |
|
| 168 |
+
| always_cheap | 16.2% | $0.2531 | 1.9% | 1.7% |
|
| 169 |
+
| static | 73.6% | $0.2462 | 1.9% | 5.1% |
|
| 170 |
+
| cascade | 73.9% | $0.2984 | 1.9% | 11.0% |
|
| 171 |
+
| **full_optimizer** | **94.3%** | **$0.2089** | **1.9%** | **1.7%** |
|
| 172 |
|
| 173 |
+
### Ablation Study
|
|
|
|
| 174 |
|
| 175 |
+
| Module Removed | Success Rate Change | Impact |
|
| 176 |
+
|---------------|---------------------|--------|
|
| 177 |
+
| Router | β20.7pp | Most critical for quality |
|
| 178 |
+
| Tool Gate | β24.5pp | Second most critical |
|
| 179 |
+
| Verifier | β23.2pp | Critical for safety |
|
| 180 |
+
| Early Termination | β20.7pp | Key for cost control |
|
| 181 |
+
| Context Budget | β20.7pp | Quality preserving |
|
| 182 |
+
|
| 183 |
+
**No module is individually sufficient β they reinforce each other.**
|
| 184 |
+
|
| 185 |
+
---
|
| 186 |
+
|
| 187 |
+
## Cost-Quality Frontier
|
| 188 |
+
|
| 189 |
+
Pareto-optimal configurations:
|
| 190 |
+
|
| 191 |
+
1. **full_optimizer**: 94.3% success at $0.2089/success β **Best overall**
|
| 192 |
+
2. **always_frontier**: 94.3% success at $0.2907/success β Maximum quality, 28% more expensive
|
| 193 |
+
3. **static**: 73.6% success at $0.2462/success β Budget option
|
| 194 |
+
|
| 195 |
+
`always_cheap` and `cascade` are **not Pareto-optimal** β dominated by `full_optimizer`.
|
| 196 |
+
|
| 197 |
+
---
|
| 198 |
+
|
| 199 |
+
## Safety & Ethics
|
| 200 |
+
|
| 201 |
+
- Legal/regulated tasks never downgraded below tier 4 without explicit override
|
| 202 |
+
- Irreversible actions always escalate to frontier + verifier
|
| 203 |
+
- All routing decisions include reasoning strings for audit
|
| 204 |
+
- Cost-adjusted score penalizes cheap-model failures more than expensive successes
|
| 205 |
+
- Doom detector prevents runaway costs on failing runs
|
| 206 |
+
- Every module individually enable/disable via config
|
| 207 |
+
|
| 208 |
+
---
|
| 209 |
+
|
| 210 |
+
## Citation
|
| 211 |
+
|
| 212 |
+
```bibtex
|
| 213 |
+
@software{agent_cost_optimizer_2025,
|
| 214 |
+
title={Agent Cost Optimizer: A Universal Control Layer for Cost-Effective Autonomous Agents},
|
| 215 |
+
author={ML Intern},
|
| 216 |
+
year={2025},
|
| 217 |
+
url={https://huggingface.co/narcolepticchicken/agent-cost-optimizer}
|
| 218 |
+
}
|
| 219 |
```
|
| 220 |
|
| 221 |
+
---
|
| 222 |
+
|
| 223 |
+
## Next Steps
|
| 224 |
+
|
| 225 |
+
1. **Train learned router** on 10K+ real traces (RouteLLM-style)
|
| 226 |
+
2. **Interactive benchmark** against SWE-bench / BFCL with real model calls
|
| 227 |
+
3. **Online learning** from live trace feedback
|
| 228 |
+
4. **Verifier cascading** (cheap verifier β expensive verifier only on disagreement)
|
| 229 |
+
5. **KV cache sharing** across concurrent agents via vLLM/SGLang
|
| 230 |
+
6. **Cross-provider routing** (DeepSeek vs OpenAI at same tier)
|
| 231 |
+
|
| 232 |
+
---
|
| 233 |
+
|
| 234 |
+
*Built autonomously by ML Intern on 2025-07-05.*
|