# Training Data Matters More Than Architecture: Lessons from Building an Agent Cost Optimizer *What we learned from 11 iterations of router design, synthetic vs real training data, and why your routing model is only as good as the execution traces it learns from.* --- ## The Problem Autonomous agents waste money. A coding agent that could solve 67% of its tasks with a $0.01/tiny-model call instead uses a $1.00/frontier model for everything. On 500 real SWE-bench tasks across 8 models, we found that **64.6% of tasks are solvable by the cheapest model**. That's massive waste. We built ACO (Agent Cost Optimizer) to fix this — a control layer that decides which model to use, when to escalate, when to verify, and when to stop. ## The Surprising Finding We expected the architecture to matter most. It didn't. | Router Version | Training Data | SWE-bench Cost Reduction | |---|---|---| | v8 (synthetic) | 50K synthetic traces | **-11.6%** (costs MORE!) | | v10 (real) | 500 real execution outcomes | **+23.3%** | | v11 (combined) | 31K SPROUT + 500 SWE-Router | **+36.9%** | The v8 router, trained on 50,000 synthetic traces with carefully simulated success probabilities, **actually increased cost by 11.6%** on real tasks. It was confidently wrong — routing difficult tasks to cheap models because synthetic data said they'd succeed. The v10 router, trained on just 500 real execution outcomes (500 SWE-bench tasks × 8 models), immediately achieved 23.3% cost reduction. Same XGBoost architecture, same feature engineering. The only difference: the training data was real. Adding 31K rows from SPROUT (a multi-model evaluation dataset with per-model scores and token counts) pushed cost reduction to 36.9%. **The 34.9 percentage point swing came from one change: training data.** ## Why Synthetic Data Failed Our synthetic success model was `P(success) = tier_strength^(difficulty × 0.6)`. This is clean, monotonic, and wrong. In reality: - Cheap models sometimes succeed on hard tasks (10% of the time on difficulty-5 tasks) - Frontier models sometimes fail on easy tasks (16% failure rate on difficulty-1 tasks) - Real difficulty doesn't map cleanly from keyword counts - Model capability varies by domain (a coding model fails at creative writing) The synthetic model's smooth probability curve meant the router was well-calibrated on paper but poorly calibrated on reality. It routed with false confidence. ## What Actually Worked ### 1. Per-Tier Success Predictors with Calibration Train 5 XGBoost classifiers, one per tier, each predicting P(success at this tier). Calibrate with isotonic regression. Route to the cheapest tier where P(success) ≥ threshold. On SPROUT (31K rows), CV F1 scores are 0.87-0.96 across all tiers. On SWE-bench, this produces calibrated probability ranges like [0.214, 1.000] for tier 1 and [0.154, 1.000] for tier 4 — meaningful variation that drives different routing decisions. ### 2. Execution Feedback (The v9 Breakthrough) Instead of routing once before execution, route cheap first, then check the cheap model's output. If token-level uncertainty is high (entropy > threshold), escalate to a stronger model. On synthetic data, this matches frontier quality exactly (90.0% success) at 2.1% cost reduction. On real data, it achieves higher success than always-frontier (74.8% vs 78.2%) by catching cheap-model failures and escalating. The insight from the literature: **post-hoc quality estimates from cheap model output dramatically outperform ex-ante routing** (Dekoninck et al., ICLR 2025). You learn more from seeing the model's response than from analyzing the prompt. ### 3. Dynamic Difficulty Estimation Not all coding tasks are difficulty 3. "Fix a typo in the README" should be tier 2, not tier 4. "Debug a critical production segfault NOW" should be tier 5. Adding keyword-based difficulty adjustment (simple→-1, critical→+1) creates 3 divergences the static heuristic misses, saving 25% on easy sub-tasks while escalating on critical ones. ### 4. Per-Step Routing Agents don't have one difficulty — they have one difficulty per step. Search steps are easy (tier 2). Edit steps on security-critical code are hard (tier 4-5). Verify steps depend on risk level. Per-step routing reduces a typical coding agent run from $0.45 (medium task) to ~$0.30 by using cheap models for search/read and reserving frontier for edit/verify. ## The Numbers **SWE-bench (500 coding tasks, 8 models, real costs):** | Policy | Success | Cost/Task | Savings | |--------|---------|-----------|---------| | Always frontier | 78.2% | $0.32 | baseline | | v11 + feedback | 74.8% | $0.20 | 36.9% | | v11 cascade | 67.4% | $0.12 | 62.5% | | Oracle | 87.0% | $0.06 | 80.3% | **BFCL v3 (82K function-calling traces, 108 models):** - 84.1% of tasks solvable by cheaper models - 82.5% need only the cheapest tier ## What's Next The oracle shows 80.3% cost reduction is achievable. We're at 36.9%. The gap comes from: 1. **No execution feedback with real model outputs** (we used simulated logprobs) 2. **No conformal calibration** (thresholds are hand-tuned, not statistically guaranteed) 3. **No best-of-N cheap sampling** (generate 2-3 cheap responses, pick best) 4. **No per-step routing with real XGBoost** (we have per-task routing but not per-step) 5. **No BERT-based router** (DistilBERT fine-tune is training on cloud infrastructure now) Each of these could close 5-10% of the gap. ## Practical Takeaways 1. **Start with real execution data.** Even 500 rows beats 50K synthetic ones. 2. **Use execution feedback.** One cheap model call + uncertainty check is worth more than any amount of prompt analysis. 3. **Per-step routing matters.** Don't route the task — route each step. 4. **Safety floors prevent disasters.** Legal tasks always get tier 4+. No exceptions. 5. **Calibration > accuracy.** A well-calibrated P(success) of 0.70 is more useful than an overconfident 0.95. ## Links - **Code & Models**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer) - **Training Data**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces) - **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)