agent-cost-optimizer / docs /deployment_guide.md
narcolepticchicken's picture
Upload docs/deployment_guide.md
a0449c9 verified
|
raw
history blame
6.47 kB
# Agent Cost Optimizer - Deployment Guide
## Overview
The Agent Cost Optimizer (ACO) is a control layer that sits **in front of, around, or inside** any agent harness. It does not replace your agent β€” it optimizes how your agent runs.
## Installation
```bash
# Clone the repository
git clone https://huggingface.co/narcolepticchicken/agent-cost-optimizer
cd agent-cost-optimizer
# Install dependencies
pip install -e .
# Optional: Gradio dashboard
pip install gradio
# Optional: Trackio monitoring
pip install trackio
```
## Quick Start
```python
from aco import AgentCostOptimizer
from aco.config import ACOConfig, ModelConfig, RoutingPolicy
# 1. Define your available models with real pricing
config = ACOConfig(
models={
"gpt-4o-mini": ModelConfig(
model_id="gpt-4o-mini", provider="openai",
cost_per_1k_input=0.00015, cost_per_1k_output=0.0006,
strength_tier=2, max_context=128000,
),
"gpt-4o": ModelConfig(
model_id="gpt-4o", provider="openai",
cost_per_1k_input=0.0025, cost_per_1k_output=0.01,
strength_tier=4, max_context=128000,
),
"deepseek-chat": ModelConfig(
model_id="deepseek-chat", provider="deepseek",
cost_per_1k_input=0.00014, cost_per_1k_output=0.00028,
strength_tier=3, max_context=64000,
cache_discount_rate=0.5,
),
},
routing_policy=RoutingPolicy("cascade"),
)
# 2. Initialize optimizer
optimizer = AgentCostOptimizer(config)
# 3. Before each agent step, call optimize()
request = "Write a Python function to reverse a linked list"
run_state = {
"trace_id": "run-001",
"planned_tools": [("file_read", {"path": "linked_list.py"})],
"previous_tool_calls": [],
"current_cost": 0.0,
"step_number": 1,
"total_steps": 3,
"is_irreversible": False,
"routing_mode": "cascade",
}
result = optimizer.optimize(request, run_state)
# 4. Use the decisions
print(f"Use model: {result.routing_decision.model_id}")
print(f"Max tokens: {result.routing_decision.max_tokens}")
print(f"Temperature: {result.routing_decision.temperature}")
print(f"Estimated cost: ${result.estimated_cost:.4f}")
# 5. After execution, record actual costs
optimizer.record_step(
trace_id=result.trace_id,
model_call=ModelCall(
model_id=result.routing_decision.model_id,
provider=result.routing_decision.provider,
input_tokens=2000,
output_tokens=800,
latency_ms=1200,
),
tool_calls=[ToolCall(tool_name="file_read", tool_input={"path": "linked_list.py"},
tool_cost=0.001, tool_latency_ms=300)],
context_size_tokens=2500,
step_outcome=Outcome.SUCCESS,
)
# 6. Finalize trace
optimizer.finalize_trace(result.trace_id, outcome=Outcome.SUCCESS)
```
## Configuration
### Model Tiers
| Tier | Typical Models | Cost | Strength | When to Use |
|------|---------------|------|----------|-------------|
| 1 | Local Qwen-0.5B, Phi-1 | Near-zero | 35% | Factual QA, simple extraction |
| 2 | GPT-4o-mini, Claude-3.5-Haiku, DeepSeek | $0.15/M tok | 55% | Drafting, classification, parsing |
| 3 | Claude-3.5-Sonnet, DeepSeek-V2 | $1.5-3/M tok | 80% | Coding, reasoning, research |
| 4 | GPT-4o, Claude-3-Opus | $2.5-5/M tok | 93% | Complex analysis, legal, creative |
| 5 | o1, o3-mini, specialist | $3-15/M tok | 97% | Math, safety-critical, adversarial |
### Routing Modes
- **`cheapest`**: Always use lowest-cost model (dangerous, only for internal tools)
- **`strongest`**: Always use frontier (expensive, maximum quality)
- **`cascade`**: Try cheap first, escalate on low confidence
- **`risk_based`**: Route by predicted task risk
- **`adaptive`**: Learn from trace history
## Integration Patterns
### Pattern A: Front Proxy (Pre-Step)
```
User Request β†’ ACO.optimize() β†’ [Decisions] β†’ Agent Harness β†’ LLM API
```
### Pattern B: Around Wrapper (Pre + Post)
```
User Request β†’ ACO.optimize() β†’ Agent Step β†’ ACO.record_step() β†’ Next Step
```
### Pattern C: Inside Agent (Per-Step)
```
Agent Loop:
if step == 0: ACO.optimize()
else: ACO.reassess() # mid-run adjustment
```
## Benchmarking Your Own Traces
```bash
# Generate benchmark
python -m aco.benchmark --tasks 1000 --output ./results
# Compare baselines
python -m aco.benchmark --compare always_frontier always_cheap cascade full_optimizer
# Run ablation study
python -m aco.benchmark --ablate all
```
## Dashboard
```bash
# Launch Gradio dashboard
python dashboard.py --results ./eval_results_v2/baseline_results.json
```
## Trackio Integration
```python
from aco.trackio_integration import ACOTrackioLogger
logger = ACOTrackioLogger(project="aco-production", space_id="your-space")
# Inside your agent loop
logger.log_decision(run_id, decision, cost, success)
logger.alert(run_id, "Cost spike", f"Step {step} cost ${cost:.3f}", "WARN")
```
## Multi-Provider Setup
```python
config = ACOConfig(
models={
"gpt-4o": ModelConfig(..., provider="openai", api_key_env="OPENAI_API_KEY"),
"claude-3.5-sonnet": ModelConfig(..., provider="anthropic", api_key_env="ANTHROPIC_API_KEY"),
"deepseek-chat": ModelConfig(..., provider="deepseek", api_key_env="DEEPSEEK_API_KEY"),
"local-qwen": ModelConfig(..., provider="local", base_url="http://localhost:8000/v1"),
}
)
```
## Safety Rules
1. **Legal/regulated tasks never go below tier 4** without explicit override
2. **Tool calls marked `requires_verification` always get a verifier**
3. **Irreversible actions trigger automatic frontier escalation**
4. **All routing decisions include reasoning strings for audit**
5. **Doom detector stops runs where cost exceeds 3x estimate**
## Performance Tuning
| Parameter | Default | Tune When... |
|-----------|---------|-------------|
| `doom_max_cost_ratio` | 3.0 | Runs often terminate too early |
| `doom_no_progress_steps` | 5 | Long-horizon tasks get killed |
| `verifier_confidence_threshold` | 0.7 | Too many/few verifiers |
| `max_context_fraction` | 0.8 | Context truncation issues |
| `cache_prefix_max_tokens` | 8000 | Cache hit rate low |
## Monitoring
Track these metrics in production:
- Cost per successful task (primary)
- Cost per artifact (secondary)
- Task success rate by tier
- Cache hit rate
- Tool call efficiency (used vs called)
- Verifier pass rate
- Retry rate
- False-DONE rate
- Escalation rate
- Doom detector precision/recall