File size: 6,350 Bytes
18e1e42 513e311 18e1e42 c122389 513e311 18e1e42 513e311 18e1e42 513e311 18e1e42 c122389 18e1e42 513e311 18e1e42 513e311 18e1e42 513e311 18e1e42 513e311 18e1e42 513e311 18e1e42 513e311 18e1e42 513e311 18e1e42 c122389 18e1e42 c122389 18e1e42 513e311 18e1e42 c122389 18e1e42 c122389 18e1e42 c122389 18e1e42 c122389 18e1e42 c122389 18e1e42 c122389 18e1e42 513e311 18e1e42 c122389 18e1e42 513e311 18e1e42 c122389 18e1e42 513e311 18e1e42 c122389 18e1e42 c122389 18e1e42 513e311 18e1e42 c122389 18e1e42 c122389 18e1e42 513e311 18e1e42 c122389 18e1e42 513e311 18e1e42 c122389 18e1e42 513e311 18e1e42 c122389 18e1e42 513e311 18e1e42 c122389 18e1e42 c122389 18e1e42 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | # Training Data Matters More Than Architecture: Lessons from Building an Agent Cost Optimizer
*What we learned from 11 iterations of router design, synthetic vs real training data, and why your routing model is only as good as the execution traces it learns from.*
---
## The Problem
Autonomous agents waste money. A coding agent that could solve 67% of its tasks with a $0.01/tiny-model call instead uses a $1.00/frontier model for everything. On 500 real SWE-bench tasks across 8 models, we found that **64.6% of tasks are solvable by the cheapest model**. That's massive waste.
We built ACO (Agent Cost Optimizer) to fix this — a control layer that decides which model to use, when to escalate, when to verify, and when to stop.
## The Surprising Finding
We expected the architecture to matter most. It didn't.
| Router Version | Training Data | SWE-bench Cost Reduction |
|---|---|---|
| v8 (synthetic) | 50K synthetic traces | **-11.6%** (costs MORE!) |
| v10 (real) | 500 real execution outcomes | **+23.3%** |
| v11 (combined) | 31K SPROUT + 500 SWE-Router | **+36.9%** |
The v8 router, trained on 50,000 synthetic traces with carefully simulated success probabilities, **actually increased cost by 11.6%** on real tasks. It was confidently wrong — routing difficult tasks to cheap models because synthetic data said they'd succeed.
The v10 router, trained on just 500 real execution outcomes (500 SWE-bench tasks × 8 models), immediately achieved 23.3% cost reduction. Same XGBoost architecture, same feature engineering. The only difference: the training data was real.
Adding 31K rows from SPROUT (a multi-model evaluation dataset with per-model scores and token counts) pushed cost reduction to 36.9%.
**The 34.9 percentage point swing came from one change: training data.**
## Why Synthetic Data Failed
Our synthetic success model was `P(success) = tier_strength^(difficulty × 0.6)`. This is clean, monotonic, and wrong. In reality:
- Cheap models sometimes succeed on hard tasks (10% of the time on difficulty-5 tasks)
- Frontier models sometimes fail on easy tasks (16% failure rate on difficulty-1 tasks)
- Real difficulty doesn't map cleanly from keyword counts
- Model capability varies by domain (a coding model fails at creative writing)
The synthetic model's smooth probability curve meant the router was well-calibrated on paper but poorly calibrated on reality. It routed with false confidence.
## What Actually Worked
### 1. Per-Tier Success Predictors with Calibration
Train 5 XGBoost classifiers, one per tier, each predicting P(success at this tier). Calibrate with isotonic regression. Route to the cheapest tier where P(success) ≥ threshold.
On SPROUT (31K rows), CV F1 scores are 0.87-0.96 across all tiers. On SWE-bench, this produces calibrated probability ranges like [0.214, 1.000] for tier 1 and [0.154, 1.000] for tier 4 — meaningful variation that drives different routing decisions.
### 2. Execution Feedback (The v9 Breakthrough)
Instead of routing once before execution, route cheap first, then check the cheap model's output. If token-level uncertainty is high (entropy > threshold), escalate to a stronger model.
On synthetic data, this matches frontier quality exactly (90.0% success) at 2.1% cost reduction. On real data, it achieves higher success than always-frontier (74.8% vs 78.2%) by catching cheap-model failures and escalating.
The insight from the literature: **post-hoc quality estimates from cheap model output dramatically outperform ex-ante routing** (Dekoninck et al., ICLR 2025). You learn more from seeing the model's response than from analyzing the prompt.
### 3. Dynamic Difficulty Estimation
Not all coding tasks are difficulty 3. "Fix a typo in the README" should be tier 2, not tier 4. "Debug a critical production segfault NOW" should be tier 5.
Adding keyword-based difficulty adjustment (simple→-1, critical→+1) creates 3 divergences the static heuristic misses, saving 25% on easy sub-tasks while escalating on critical ones.
### 4. Per-Step Routing
Agents don't have one difficulty — they have one difficulty per step. Search steps are easy (tier 2). Edit steps on security-critical code are hard (tier 4-5). Verify steps depend on risk level.
Per-step routing reduces a typical coding agent run from $0.45 (medium task) to ~$0.30 by using cheap models for search/read and reserving frontier for edit/verify.
## The Numbers
**SWE-bench (500 coding tasks, 8 models, real costs):**
| Policy | Success | Cost/Task | Savings |
|--------|---------|-----------|---------|
| Always frontier | 78.2% | $0.32 | baseline |
| v11 + feedback | 74.8% | $0.20 | 36.9% |
| v11 cascade | 67.4% | $0.12 | 62.5% |
| Oracle | 87.0% | $0.06 | 80.3% |
**BFCL v3 (82K function-calling traces, 108 models):**
- 84.1% of tasks solvable by cheaper models
- 82.5% need only the cheapest tier
## What's Next
The oracle shows 80.3% cost reduction is achievable. We're at 36.9%. The gap comes from:
1. **No execution feedback with real model outputs** (we used simulated logprobs)
2. **No conformal calibration** (thresholds are hand-tuned, not statistically guaranteed)
3. **No best-of-N cheap sampling** (generate 2-3 cheap responses, pick best)
4. **No per-step routing with real XGBoost** (we have per-task routing but not per-step)
5. **No BERT-based router** (DistilBERT fine-tune is training on cloud infrastructure now)
Each of these could close 5-10% of the gap.
## Practical Takeaways
1. **Start with real execution data.** Even 500 rows beats 50K synthetic ones.
2. **Use execution feedback.** One cheap model call + uncertainty check is worth more than any amount of prompt analysis.
3. **Per-step routing matters.** Don't route the task — route each step.
4. **Safety floors prevent disasters.** Legal tasks always get tier 4+. No exceptions.
5. **Calibration > accuracy.** A well-calibrated P(success) of 0.70 is more useful than an overconfident 0.95.
## Links
- **Code & Models**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
- **Training Data**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
- **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)
|