narcolepticchicken
/

agent-cost-optimizer

Safetensors

Model card Files Files and versions

xet

Community

narcolepticchicken commited on about 20 hours ago

Commit

e4cea93

verified ·

1 Parent(s): bd77292

Upload README.md

Browse files

Files changed (1) hide show

README.md +44 -65

README.md CHANGED Viewed

@@ -8,104 +8,83 @@ tags:
 - cascade-routing
 - execution-feedback
 - swebench
-- ml-intern
 ---
-# ACO v10: Agent Cost Optimizer
-A universal control layer that reduces the cost of autonomous agent runs while preserving task quality. **Trained on real execution data from 8 models across 500 SWE-bench tasks.**
-## What's New in v10
-- **Real-data training**: XGBoost models trained on SWE-Router traces (500 tasks × 8 models = 4,000 real outcomes)
-- **v10 cascade**: 49.5% cost reduction at 67.8% success
-- **v10 + feedback**: 23.3% cost reduction at 73.8% success (only 4.4pp below frontier)
-- **Per-step routing**: Routes each agent step independently (search→cheap, edit→stronger)
-- **Execution feedback**: Uses cheap model output confidence to decide escalation
-## Real-World Results
-### SWE-bench (500 coding tasks, 8 models)
-| Policy | Success | Cost | CostRed |
-|--------|---------|------|---------|
 | Oracle | 87.0% | $0.06 | 80.3% |
-| **v10 feedback** | **73.8%** | **$0.24** | **23.3%** |
-| v10 cascade | 67.8% | $0.16 | 49.5% |
 | Always frontier | 78.2% | $0.32 | baseline |
 | v8 (synthetic) | 65.8% | $0.35 | -11.6% |
-### BFCL v3 (82K traces, 108 models)
-- 84.1% of tasks solvable by cheaper models
-- 82.5% need only tier 1
-### Synthetic Benchmark (3K traces)
-- v9 feedback: 90.0% success at 2.1% cost reduction (matches frontier)
-- v8 router: 83.7% success at 8.5% cost reduction
-## 11 Modules
-1. Cost Telemetry Collector
-2. Task Cost Classifier
-3. Model Cascade Router (v10: real-data trained)
-4. Execution-Feedback Router (v9: output confidence cascade)
-5. Context Budgeter
-6. Cache-Aware Prompt Layout
-7. Tool-Use Cost Gate
-8. Verifier Budgeter
-9. Retry/Recovery Optimizer
-10. Meta-Tool Miner
-11. Doom Detector
 ## Quick Start
 ```python
 from aco.router_v10 import V10Router
-v10 = V10Router(model_path="router_models/router_bundle_v10_fixed.pkl", success_threshold=0.70)
-decision = v10.route_cascade("Fix the auth bug in production")
-print(f"Tier: {decision.tier}, Model: {decision.model}, Cost: ${decision.cost_estimate:.2f}")
-```
-## Per-Step Routing
-```python
-from aco.per_step_router import PerStepRouter
 ps = PerStepRouter(max_budget=2.0)
 d = ps.route_step("Search for the bug", step_num=1, task_risk="medium")
-print(f"Step type: {d.step_type.value}, Tier: {d.adjusted_tier}, Cost: ${d.cost_estimate:.2f}")
 ```
-## Key Insight
-**Training on real execution data matters enormously.** The v8 router trained on synthetic data achieved -11.6% cost reduction on SWE-bench (it actually cost MORE than always-frontier). The v10 router trained on real SWE-Router data achieves 23.3% cost reduction at comparable quality. The gap: 34.9 percentage points.
 ## Links
 - **Model**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
 - **Dataset**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
 - **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)
-- **SWE-Router data**: [SWE-Router/swebench-verified-*](https://huggingface.co/SWE-Router)
 ## License
 MIT
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern
-## Usage
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = 'narcolepticchicken/agent-cost-optimizer'
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
-```
-For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.

 - cascade-routing
 - execution-feedback
 - swebench
 ---
+# ACO v11: Agent Cost Optimizer
+A universal control layer that reduces autonomous agent cost while preserving task quality. Trained on real execution data from SPROUT (31K rows, 13 models) + SWE-Router (500 tasks, 8 models).
+## What It Does
+ACO sits in front of any agent harness and makes cost-aware decisions:
+- Which model to use (tiny → frontier → specialist)
+- Whether to escalate based on output confidence
+- How much context to include
+- Whether to call tools
+- Whether to verify outputs
+- When to stop failing runs
+## v11 Results (Real SWE-bench, 500 tasks × 8 models)
+| Policy | Success | Cost/Task | CostRed |
+|--------|---------|-----------|---------|
 | Oracle | 87.0% | $0.06 | 80.3% |
+| v11 + feedback | 74.8% | $0.20 | 36.9% |
+| v11 cascade | 67.4% | $0.12 | 62.5% |
 | Always frontier | 78.2% | $0.32 | baseline |
 | v8 (synthetic) | 65.8% | $0.35 | -11.6% |
+## v9 Results (Synthetic, 3K traces)
+| Policy | Success | CostRed |
+|--------|---------|---------|
+| v9 + feedback | 90.0% | 2.1% |
+| v8 router | 83.7% | 8.5% |
+| Always frontier | 90.0% | baseline |
+## Key Finding
+Training data matters more than architecture. v8 trained on synthetic data *increased* cost by 11.6%. v10 trained on 500 real outcomes *saved* 23.3%. v11 with 31K SPROUT rows saves 36.9%. Same XGBoost architecture throughout.
 ## Quick Start
 ```python
 from aco.router_v10 import V10Router
+from aco.per_step_router import PerStepRouter
+# Task-level routing
+v10 = V10Router(model_path="router_models/router_bundle_v11.pkl", success_threshold=0.70)
+d = v10.route_cascade("Fix the auth bug in production")
+print(f"Tier: {d.tier}, Model: {d.model}, Cost: ${d.cost_estimate:.2f}")
+# Per-step routing
 ps = PerStepRouter(max_budget=2.0)
 d = ps.route_step("Search for the bug", step_num=1, task_risk="medium")
+print(f"Step: {d.step_type.value}, Tier: {d.adjusted_tier}")
 ```
+## 11 Modules
+1. Cost Telemetry Collector
+2. Task Cost Classifier
+3. Model Cascade Router (v11 XGBoost)
+4. Execution-Feedback Router (entropy cascade)
+5. Context Budgeter
+6. Cache-Aware Prompt Layout
+7. Tool-Use Cost Gate
+8. Verifier Budgeter
+9. Retry/Recovery Optimizer
+10. Meta-Tool Miner
+11. Doom Detector
 ## Links
 - **Model**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
 - **Dataset**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
 - **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)
+- **Blog Post**: [docs/technical_blog.md](https://huggingface.co/narcolepticchicken/agent-cost-optimizer/blob/main/docs/technical_blog.md)
+- **Final Report**: [docs/final_report.md](https://huggingface.co/narcolepticchicken/agent-cost-optimizer/blob/main/docs/final_report.md)
 ## License
 MIT