| # Agent Cost Optimizer - Deployment Guide |
|
|
| ## Overview |
|
|
| The Agent Cost Optimizer (ACO) is a control layer that sits **in front of, around, or inside** any agent harness. It does not replace your agent β it optimizes how your agent runs. |
|
|
| ## Installation |
|
|
| ```bash |
| # Clone the repository |
| git clone https://huggingface.co/narcolepticchicken/agent-cost-optimizer |
| cd agent-cost-optimizer |
| |
| # Install dependencies |
| pip install -e . |
| |
| # Optional: Gradio dashboard |
| pip install gradio |
| |
| # Optional: Trackio monitoring |
| pip install trackio |
| ``` |
|
|
| ## Quick Start |
|
|
| ```python |
| from aco import AgentCostOptimizer |
| from aco.config import ACOConfig, ModelConfig, RoutingPolicy |
| |
| # 1. Define your available models with real pricing |
| config = ACOConfig( |
| models={ |
| "gpt-4o-mini": ModelConfig( |
| model_id="gpt-4o-mini", provider="openai", |
| cost_per_1k_input=0.00015, cost_per_1k_output=0.0006, |
| strength_tier=2, max_context=128000, |
| ), |
| "gpt-4o": ModelConfig( |
| model_id="gpt-4o", provider="openai", |
| cost_per_1k_input=0.0025, cost_per_1k_output=0.01, |
| strength_tier=4, max_context=128000, |
| ), |
| "deepseek-chat": ModelConfig( |
| model_id="deepseek-chat", provider="deepseek", |
| cost_per_1k_input=0.00014, cost_per_1k_output=0.00028, |
| strength_tier=3, max_context=64000, |
| cache_discount_rate=0.5, |
| ), |
| }, |
| routing_policy=RoutingPolicy("cascade"), |
| ) |
| |
| # 2. Initialize optimizer |
| optimizer = AgentCostOptimizer(config) |
| |
| # 3. Before each agent step, call optimize() |
| request = "Write a Python function to reverse a linked list" |
| run_state = { |
| "trace_id": "run-001", |
| "planned_tools": [("file_read", {"path": "linked_list.py"})], |
| "previous_tool_calls": [], |
| "current_cost": 0.0, |
| "step_number": 1, |
| "total_steps": 3, |
| "is_irreversible": False, |
| "routing_mode": "cascade", |
| } |
| |
| result = optimizer.optimize(request, run_state) |
| |
| # 4. Use the decisions |
| print(f"Use model: {result.routing_decision.model_id}") |
| print(f"Max tokens: {result.routing_decision.max_tokens}") |
| print(f"Temperature: {result.routing_decision.temperature}") |
| print(f"Estimated cost: ${result.estimated_cost:.4f}") |
| |
| # 5. After execution, record actual costs |
| optimizer.record_step( |
| trace_id=result.trace_id, |
| model_call=ModelCall( |
| model_id=result.routing_decision.model_id, |
| provider=result.routing_decision.provider, |
| input_tokens=2000, |
| output_tokens=800, |
| latency_ms=1200, |
| ), |
| tool_calls=[ToolCall(tool_name="file_read", tool_input={"path": "linked_list.py"}, |
| tool_cost=0.001, tool_latency_ms=300)], |
| context_size_tokens=2500, |
| step_outcome=Outcome.SUCCESS, |
| ) |
| |
| # 6. Finalize trace |
| optimizer.finalize_trace(result.trace_id, outcome=Outcome.SUCCESS) |
| ``` |
|
|
| ## Configuration |
|
|
| ### Model Tiers |
|
|
| | Tier | Typical Models | Cost | Strength | When to Use | |
| |------|---------------|------|----------|-------------| |
| | 1 | Local Qwen-0.5B, Phi-1 | Near-zero | 35% | Factual QA, simple extraction | |
| | 2 | GPT-4o-mini, Claude-3.5-Haiku, DeepSeek | $0.15/M tok | 55% | Drafting, classification, parsing | |
| | 3 | Claude-3.5-Sonnet, DeepSeek-V2 | $1.5-3/M tok | 80% | Coding, reasoning, research | |
| | 4 | GPT-4o, Claude-3-Opus | $2.5-5/M tok | 93% | Complex analysis, legal, creative | |
| | 5 | o1, o3-mini, specialist | $3-15/M tok | 97% | Math, safety-critical, adversarial | |
|
|
| ### Routing Modes |
|
|
| - **`cheapest`**: Always use lowest-cost model (dangerous, only for internal tools) |
| - **`strongest`**: Always use frontier (expensive, maximum quality) |
| - **`cascade`**: Try cheap first, escalate on low confidence |
| - **`risk_based`**: Route by predicted task risk |
| - **`adaptive`**: Learn from trace history |
| |
| ## Integration Patterns |
| |
| ### Pattern A: Front Proxy (Pre-Step) |
| ``` |
| User Request β ACO.optimize() β [Decisions] β Agent Harness β LLM API |
| ``` |
| |
| ### Pattern B: Around Wrapper (Pre + Post) |
| ``` |
| User Request β ACO.optimize() β Agent Step β ACO.record_step() β Next Step |
| ``` |
| |
| ### Pattern C: Inside Agent (Per-Step) |
| ``` |
| Agent Loop: |
| if step == 0: ACO.optimize() |
| else: ACO.reassess() # mid-run adjustment |
| ``` |
| |
| ## Benchmarking Your Own Traces |
| |
| ```bash |
| # Generate benchmark |
| python -m aco.benchmark --tasks 1000 --output ./results |
| |
| # Compare baselines |
| python -m aco.benchmark --compare always_frontier always_cheap cascade full_optimizer |
| |
| # Run ablation study |
| python -m aco.benchmark --ablate all |
| ``` |
| |
| ## Dashboard |
| |
| ```bash |
| # Launch Gradio dashboard |
| python dashboard.py --results ./eval_results_v2/baseline_results.json |
| ``` |
| |
| ## Trackio Integration |
| |
| ```python |
| from aco.trackio_integration import ACOTrackioLogger |
| |
| logger = ACOTrackioLogger(project="aco-production", space_id="your-space") |
| |
| # Inside your agent loop |
| logger.log_decision(run_id, decision, cost, success) |
| logger.alert(run_id, "Cost spike", f"Step {step} cost ${cost:.3f}", "WARN") |
| ``` |
| |
| ## Multi-Provider Setup |
| |
| ```python |
| config = ACOConfig( |
| models={ |
| "gpt-4o": ModelConfig(..., provider="openai", api_key_env="OPENAI_API_KEY"), |
| "claude-3.5-sonnet": ModelConfig(..., provider="anthropic", api_key_env="ANTHROPIC_API_KEY"), |
| "deepseek-chat": ModelConfig(..., provider="deepseek", api_key_env="DEEPSEEK_API_KEY"), |
| "local-qwen": ModelConfig(..., provider="local", base_url="http://localhost:8000/v1"), |
| } |
| ) |
| ``` |
| |
| ## Safety Rules |
| |
| 1. **Legal/regulated tasks never go below tier 4** without explicit override |
| 2. **Tool calls marked `requires_verification` always get a verifier** |
| 3. **Irreversible actions trigger automatic frontier escalation** |
| 4. **All routing decisions include reasoning strings for audit** |
| 5. **Doom detector stops runs where cost exceeds 3x estimate** |
| |
| ## Performance Tuning |
| |
| | Parameter | Default | Tune When... | |
| |-----------|---------|-------------| |
| | `doom_max_cost_ratio` | 3.0 | Runs often terminate too early | |
| | `doom_no_progress_steps` | 5 | Long-horizon tasks get killed | |
| | `verifier_confidence_threshold` | 0.7 | Too many/few verifiers | |
| | `max_context_fraction` | 0.8 | Context truncation issues | |
| | `cache_prefix_max_tokens` | 8000 | Cache hit rate low | |
| |
| ## Monitoring |
| |
| Track these metrics in production: |
| - Cost per successful task (primary) |
| - Cost per artifact (secondary) |
| - Task success rate by tier |
| - Cache hit rate |
| - Tool call efficiency (used vs called) |
| - Verifier pass rate |
| - Retry rate |
| - False-DONE rate |
| - Escalation rate |
| - Doom detector precision/recall |
| |