narcolepticchicken
/

agent-cost-optimizer

Safetensors

Model card Files Files and versions

xet

Community

narcolepticchicken commited on 1 day ago

Commit

4b7e4c6

verified ·

1 Parent(s): 391292b

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +56 -260

README.md CHANGED Viewed

@@ -1,306 +1,102 @@
----
-tags:
-- ml-intern
----
 # Agent Cost Optimizer (ACO)
 A universal control layer that reduces total cost of autonomous agent runs while **preserving task quality**.
 **Repository:** https://huggingface.co/narcolepticchicken/agent-cost-optimizer
-**Benchmark Results:** 28% cost reduction at iso-quality (94.3% success rate)
 **License:** MIT
-**Status:** Production-ready control layer, not a generative model
 ---
 ## What It Does
-Agent Cost Optimizer (ACO) is a **compound decision system** that bolts onto any agent harness (LangChain, AutoGPT, OpenAI Assistants, custom) and makes cost-aware decisions at every step of an agent run:
-- **Which model to use** (tiny local → cheap cloud → medium → frontier → specialist)
-- **How much context to send** (keep, summarize, omit, retrieve on-demand)
-- **How to structure prompts** for cache reuse
 - **Which tools to call** (skip, batch, use cached result)
-- **When to verify** (only high-risk outputs, not everything)
 - **When to stop** (detect doomed runs before costs spiral)
-- **When to reuse** past successful workflows
-### Core Result
-On a benchmark of 2,000 synthetic agent traces across 19 realistic scenarios:
-| Baseline | Success Rate | Cost/Success | Total Cost | Savings |
-|----------|-------------|--------------|-----------|---------|
-| always_frontier (GPT-4o) | 94.3% | $0.2907 | $548.31 | — |
-| always_cheap (GPT-4o-mini) | 16.2% | $0.2531 | $82.25 | Unsafe |
-| cascade only | 73.9% | $0.2984 | $440.98 | Low quality |
-| **full_optimizer (ACO)** | **94.3%** | **$0.2089** | **$393.98** | **28.1%** |
-**ACO matches frontier model quality while cutting cost by 28%.**
----
-## Architecture
-ACO is **10 interlocking modules** sharing a single normalized trace schema:
-| Module | What It Decides |
-|--------|----------------|
-| 1. Cost Telemetry Collector | Records every model call, tool call, cost, latency, failure |
-| 2. Task Cost Classifier | Predicts expected cost, risk, model strength needed |
-| 3. Model Cascade Router | Chooses cheapest acceptable model tier |
-| 4. Context Budgeter | Keeps what matters, omits/summarizes the rest |
-| 5. Cache-Aware Prompt Layout | Structures prompts for prefix-cache reuse |
-| 6. Tool-Use Cost Gate | Skips/batches/caches tool calls when not worth the cost |
-| 7. Verifier Budgeter | Verifies only high-risk outputs |
-| 8. Retry/Recovery Optimizer | Learns from failures instead of blind retry loops |
-| 9. Meta-Tool Miner | Compresses repeated workflows into reusable macros |
-| 10. Doom Detector | Stops failing runs before costs spiral |
----
-## Installation
-```bash
-pip install -e .
-```
-## Quick Start
-```python
-from aco import AgentCostOptimizer
-from aco.config import ACOConfig, ModelConfig, RoutingPolicy
-config = ACOConfig(
-    models={
-        "gpt-4o-mini": ModelConfig(
-            model_id="gpt-4o-mini", provider="openai",
-            cost_per_1k_input=0.00015, cost_per_1k_output=0.0006,
-            strength_tier=2, max_context=128000,
-        ),
-        "gpt-4o": ModelConfig(
-            model_id="gpt-4o", provider="openai",
-            cost_per_1k_input=0.0025, cost_per_1k_output=0.01,
-            strength_tier=4, max_context=128000,
-        ),
-    },
-    routing_policy=RoutingPolicy("cascade"),
-)
-optimizer = AgentCostOptimizer(config)
-# Before each agent step
-result = optimizer.optimize(
-    user_request="Write a Python function to reverse a linked list",
-    run_state={
-        "trace_id": "run-001",
-        "planned_tools": [("file_read", {"path": "linked_list.py"})],
-        "routing_mode": "cascade",
-    },
-)
-# Use the decisions
-print(f"Use model: {result.routing_decision.model_id}")
-print(f"Max tokens: {result.routing_decision.max_tokens}")
-print(f"Estimated cost: ${result.estimated_cost:.4f}")
-```
-See `docs/deployment_guide.md` for full integration patterns and `examples/end_to_end_demo.py` for a complete walkthrough.
----
-## Repository Structure
-```
-narcolepticchicken/agent-cost-optimizer
-├── aco/                          # Core package
-│   ├── __init__.py               # Main optimizer class
-│   ├── config.py                 # Configuration dataclasses
-│   ├── trace_schema.py           # Normalized trace schema
-│   ├── telemetry.py              # Cost telemetry collector
-│   ├── classifier.py             # Task cost classifier
-│   ├── router.py                 # Model cascade router
-│   ├── learned_router.py         # Trainable router classifier
-│   ├── context_budgeter.py       # Context selection
-│   ├── cache_layout.py           # Cache-aware prompt layout
-│   ├── tool_gate.py              # Tool-use cost gate
-│   ├── verifier_budgeter.py      # Selective verifier
-│   ├── retry_optimizer.py        # Retry/recovery optimizer
-│   ├── meta_tool_miner.py        # Workflow compression
-│   ├── doom_detector.py          # Early termination detector
-│   ├── trackio_integration.py    # Trackio monitoring
-│   ├── benchmarks/               # Benchmark suite
-│   └── datasets/                 # Synthetic trace generator
-├── examples/                     # Integration examples
-│   ├── end_to_end_demo.py        # Full demo with simulated inference
-│   └── integration_example.py    # Agent harness integration
-├── standalone_eval_v2.py         # Benchmark runner (N=2000)
-├── dashboard.py                  # Gradio dashboard
-├── app.py                        # HF Space entrypoint
-├── docs/                         # Documentation
-│   ├── literature_review.md      # 50+ paper survey
-│   ├── final_report.md           # Complete technical report
-│   ├── model_card.md             # Model card
-│   ├── deployment_guide.md       # Production deployment
-│   └── technical_blog.md         # Technical blog post
-├── config.yaml                   # Example configuration
-├── setup.py                      # Package setup
-└── requirements.txt              # Dependencies
-```
----
-## Benchmarking
-```bash
-# Generate 2,000 synthetic traces and run all baselines + ablations
-python standalone_eval_v2.py --tasks 2000 --output ./eval_results_v2
-# Launch dashboard
-python dashboard.py --results ./eval_results_v2/baseline_results.json
-```
----
-## Key Results
-### Baseline Comparison
-| Baseline | Success | Cost/Success | False-DONE | Cheap Miss |
-|----------|---------|--------------|------------|------------|
-| always_frontier | 94.3% | $0.2907 | 1.9% | 9.3% |
-| always_cheap | 16.2% | $0.2531 | 1.9% | 1.7% |
-| static | 73.6% | $0.2462 | 1.9% | 5.1% |
-| cascade | 73.9% | $0.2984 | 1.9% | 11.0% |
-| **full_optimizer** | **94.3%** | **$0.2089** | **1.9%** | **1.7%** |
-### Ablation Study
-| Module Removed | Success Rate Change | Impact |
-|---------------|---------------------|--------|
-| Router | −20.7pp | Most critical for quality |
-| Tool Gate | −24.5pp | Second most critical |
-| Verifier | −23.2pp | Critical for safety |
-| Early Termination | −20.7pp | Key for cost control |
-| Context Budget | −20.7pp | Quality preserving |
-**No module is individually sufficient — they reinforce each other.**
 ---
-## Cost-Quality Frontier
-Pareto-optimal configurations:
-1. **full_optimizer**: 94.3% success at $0.2089/success ← **Best overall**
-2. **always_frontier**: 94.3% success at $0.2907/success ← Maximum quality, 28% more expensive
-3. **static**: 73.6% success at $0.2462/success ← Budget option
-`always_cheap` and `cascade` are **not Pareto-optimal** — dominated by `full_optimizer`.
----
-## Safety & Ethics
-- Legal/regulated tasks never downgraded below tier 4 without explicit override
-- Irreversible actions always escalate to frontier + verifier
-- All routing decisions include reasoning strings for audit
-- Cost-adjusted score penalizes cheap-model failures more than expensive successes
-- Doom detector prevents runaway costs on failing runs
-- Every module individually enable/disable via config
 ---
-## Citation
-```bibtex
-@software{agent_cost_optimizer_2025,
-  title={Agent Cost Optimizer: A Universal Control Layer for Cost-Effective Autonomous Agents},
-  author={ML Intern},
-  year={2025},
-  url={https://huggingface.co/narcolepticchicken/agent-cost-optimizer}
-}
-```
----
-## Next Steps
-1. **Train learned router** on 10K+ real traces (RouteLLM-style)
-2. **Interactive benchmark** against SWE-bench / BFCL with real model calls
-3. **Online learning** from live trace feedback
-4. **Verifier cascading** (cheap verifier → expensive verifier only on disagreement)
-5. **KV cache sharing** across concurrent agents via vLLM/SGLang
-6. **Cross-provider routing** (DeepSeek vs OpenAI at same tier)
 ---
-*Built autonomously by ML Intern on 2025-07-05.*
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern
-## Usage
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = 'narcolepticchicken/agent-cost-optimizer'
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
 ```
-For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
-## Trained Router (NEW)
-The heuristic router has been replaced with a **trained XGBoost router** using the CARROT architecture.
-### Architecture: Difficulty-First + ML Confirmation + Safety Floors
 1. Map task_type to difficulty (1-5)
 2. Compute base_tier = min(difficulty + 1, 5)
-3. Apply safety floor per task_type (e.g., legal → tier 4)
-4. Use per-tier P(success|query) XGBoost classifiers to confirm or escalate
-5. If P(success@base_tier) < threshold, escalate one tier
-### Usage
-```python
-from aco.learned_router import TrainedRouter
-# Load from Hub
-router = TrainedRouter.from_pretrained("narcolepticchicken/agent-cost-optimizer")
-# Predict
-tier, confidence = router.predict(
-    "Write a Python function to reverse a linked list",
-    "coding",
-    difficulty=3,
-)
-print(f"Recommended: tier {tier} (confidence: {confidence:.2f})")
-```
-### Benchmark Results (2K eval traces)
-| Router | Success | AvgCost | Unsafe |
-|--------|---------|---------|--------|
-| trained (t=0.55) | 85.5% | 1.107 | 4.1% |
-| trained (t=0.65) | 91.9% | 1.365 | 1.5% |
-| always_frontier | 88.8% | 1.000 | 2.5% |
-| heuristic_diff+1 | 83.4% | 0.940 | 4.9% |
-| oracle | 99.8% | 0.486 | 0.0% |
-The trained router at t=0.65 **outperforms always-frontier on success rate** (91.9% vs 88.8%) with lower unsafe miss rate (1.5% vs 2.5%).
-### Training Data
-50,000 synthetic traces with ground-truth per-tier success labels. Each trace includes all 5 tier outcomes, enabling the per-tier classifiers to learn from balanced success/failure examples.
-See `docs/trained_router_report.md` for full details.

 # Agent Cost Optimizer (ACO)
 A universal control layer that reduces total cost of autonomous agent runs while **preserving task quality**.
 **Repository:** https://huggingface.co/narcolepticchicken/agent-cost-optimizer
+**Trained Router:** Hybrid heuristic + XGBoost safety net
 **License:** MIT
 ---
 ## What It Does
+Agent Cost Optimizer (ACO) bolts onto any agent harness and makes cost-aware decisions at every step:
+- **Which model to use** (tiny local to frontier)
+- **How much context to send** (keep, summarize, omit, retrieve)
 - **Which tools to call** (skip, batch, use cached result)
+- **When to verify** (only high-risk outputs)
 - **When to stop** (detect doomed runs before costs spiral)
 ---
+## Trained Router Results (N=2,000 eval traces)
+After 7 iterations of training (v1-v7), the best production router is a **hybrid heuristic + ML safety net**:
+| Router | Success | Cost Reduction | Unsafe Miss |
+|--------|---------|----------------|-------------|
+| v4 (t=0.65, safety-first) | **91.9%** | -36.5% | **1.5%** |
+| v7 (s=0.25, d=0.85, hybrid) | 83.8% | **9.2%** | 4.8% |
+| heuristic (diff+1) | 84.1% | 7.3% | 4.7% |
+| always_frontier | 89.3% | 0% | 2.3% |
+| oracle (perfect routing) | 99.8% | **52.3%** | 0.0% |
+### Key Findings
+1. **v4 at t=0.65 beats frontier on quality** (91.9% vs 89.3% success) with lower unsafe rate (1.5% vs 2.3%)
+2. **v7 hybrid adds 2pp cost reduction** over heuristic (9.2% vs 7.3%) with minimal quality loss
+3. **Oracle shows 52.3% savings** achievable — massive headroom for improvement
+4. The ML safety net catches cases the heuristic misses; the cost saver identifies unnecessary escalation
 ---
+## Architecture: 10 Modules + Trained Router
+ACO consists of 10 interlocking modules + a trained XGBoost router:
+| Module | Decision |
+|--------|----------|
+| 1. Cost Telemetry | Records every call, cost, failure |
+| 2. Task Classifier | Predicts risk, model tier needed |
+| 3. **Trained Router** | Hybrid heuristic + ML confirmation |
+| 4. Context Budgeter | Keeps what matters, omits rest |
+| 5. Cache Layout | Optimizes for prefix-cache reuse |
+| 6. Tool Gate | Skips unnecessary tool calls |
+| 7. Verifier Budgeter | Verifies only high-risk outputs |
+| 8. Retry Optimizer | Learns from failures |
+| 9. Meta-Tool Miner | Compresses repeated workflows |
+| 10. Doom Detector | Stops failing runs early |
 ---
+## Quick Start
 ```python
+from aco.learned_router import TrainedRouter
+router = TrainedRouter.from_pretrained("narcolepticchicken/agent-cost-optimizer")
+tier, confidence = router.predict(
+    "Write a Python function to reverse a linked list",
+    "coding", difficulty=3)
+print(f"Recommended: tier {tier} (confidence: {confidence:.2f})")
 ```
+---
+## What Makes The Trained Router Work
+**Architecture: Difficulty-First + ML Confirmation + Safety Floors**
 1. Map task_type to difficulty (1-5)
 2. Compute base_tier = min(difficulty + 1, 5)
+3. Apply safety floor (legal → tier 4)
+4. Check P(success@base_tier) with XGBoost — if low, escalate
+5. Check P(success@tier-1) — if high, downgrade (cost saver)
+**Training Data:** 50K synthetic traces, 5 per-tier XGBoost classifiers, isotonic regression calibration, 23 features.
+---
+## Next Steps
+1. **Execution feedback features**: Use first model output as routing signal
+2. **Confidence from generation**: Model entropy as escalation signal
+3. **Multi-step routing**: Route per-step, not per-task
+4. **Real agent traces**: Train on SWE-bench/BFCL execution data
+See `docs/trained_router_final_report.md` for full analysis.
+---
+*Built autonomously by ML Intern, 2025-07-05.*