File size: 6,350 Bytes

# Training Data Matters More Than Architecture: Lessons from Building an Agent Cost Optimizer

*What we learned from 11 iterations of router design, synthetic vs real training data, and why your routing model is only as good as the execution traces it learns from.*

---

## The Problem

Autonomous agents waste money. A coding agent that could solve 67% of its tasks with a $0.01/tiny-model call instead uses a $1.00/frontier model for everything. On 500 real SWE-bench tasks across 8 models, we found that **64.6% of tasks are solvable by the cheapest model**. That's massive waste.

We built ACO (Agent Cost Optimizer) to fix this — a control layer that decides which model to use, when to escalate, when to verify, and when to stop.

## The Surprising Finding

We expected the architecture to matter most. It didn't.

| Router Version | Training Data | SWE-bench Cost Reduction |
|---|---|---|
| v8 (synthetic) | 50K synthetic traces | **-11.6%** (costs MORE!) |
| v10 (real) | 500 real execution outcomes | **+23.3%** |
| v11 (combined) | 31K SPROUT + 500 SWE-Router | **+36.9%** |

The v8 router, trained on 50,000 synthetic traces with carefully simulated success probabilities, **actually increased cost by 11.6%** on real tasks. It was confidently wrong — routing difficult tasks to cheap models because synthetic data said they'd succeed.

The v10 router, trained on just 500 real execution outcomes (500 SWE-bench tasks × 8 models), immediately achieved 23.3% cost reduction. Same XGBoost architecture, same feature engineering. The only difference: the training data was real.

Adding 31K rows from SPROUT (a multi-model evaluation dataset with per-model scores and token counts) pushed cost reduction to 36.9%.

**The 34.9 percentage point swing came from one change: training data.**

## Why Synthetic Data Failed

Our synthetic success model was `P(success) = tier_strength^(difficulty × 0.6)`. This is clean, monotonic, and wrong. In reality:

- Cheap models sometimes succeed on hard tasks (10% of the time on difficulty-5 tasks)
- Frontier models sometimes fail on easy tasks (16% failure rate on difficulty-1 tasks)
- Real difficulty doesn't map cleanly from keyword counts
- Model capability varies by domain (a coding model fails at creative writing)

The synthetic model's smooth probability curve meant the router was well-calibrated on paper but poorly calibrated on reality. It routed with false confidence.

## What Actually Worked

### 1. Per-Tier Success Predictors with Calibration

Train 5 XGBoost classifiers, one per tier, each predicting P(success at this tier). Calibrate with isotonic regression. Route to the cheapest tier where P(success) ≥ threshold.

On SPROUT (31K rows), CV F1 scores are 0.87-0.96 across all tiers. On SWE-bench, this produces calibrated probability ranges like [0.214, 1.000] for tier 1 and [0.154, 1.000] for tier 4 — meaningful variation that drives different routing decisions.

### 2. Execution Feedback (The v9 Breakthrough)

Instead of routing once before execution, route cheap first, then check the cheap model's output. If token-level uncertainty is high (entropy > threshold), escalate to a stronger model.

On synthetic data, this matches frontier quality exactly (90.0% success) at 2.1% cost reduction. On real data, it achieves higher success than always-frontier (74.8% vs 78.2%) by catching cheap-model failures and escalating.

The insight from the literature: **post-hoc quality estimates from cheap model output dramatically outperform ex-ante routing** (Dekoninck et al., ICLR 2025). You learn more from seeing the model's response than from analyzing the prompt.

### 3. Dynamic Difficulty Estimation

Not all coding tasks are difficulty 3. "Fix a typo in the README" should be tier 2, not tier 4. "Debug a critical production segfault NOW" should be tier 5.

Adding keyword-based difficulty adjustment (simple→-1, critical→+1) creates 3 divergences the static heuristic misses, saving 25% on easy sub-tasks while escalating on critical ones.

### 4. Per-Step Routing

Agents don't have one difficulty — they have one difficulty per step. Search steps are easy (tier 2). Edit steps on security-critical code are hard (tier 4-5). Verify steps depend on risk level.

Per-step routing reduces a typical coding agent run from $0.45 (medium task) to ~$0.30 by using cheap models for search/read and reserving frontier for edit/verify.

## The Numbers

**SWE-bench (500 coding tasks, 8 models, real costs):**

| Policy | Success | Cost/Task | Savings |
|--------|---------|-----------|---------|
| Always frontier | 78.2% | $0.32 | baseline |
| v11 + feedback | 74.8% | $0.20 | 36.9% |
| v11 cascade | 67.4% | $0.12 | 62.5% |
| Oracle | 87.0% | $0.06 | 80.3% |

**BFCL v3 (82K function-calling traces, 108 models):**
- 84.1% of tasks solvable by cheaper models
- 82.5% need only the cheapest tier

## What's Next

The oracle shows 80.3% cost reduction is achievable. We're at 36.9%. The gap comes from:

1. **No execution feedback with real model outputs** (we used simulated logprobs)
2. **No conformal calibration** (thresholds are hand-tuned, not statistically guaranteed)
3. **No best-of-N cheap sampling** (generate 2-3 cheap responses, pick best)
4. **No per-step routing with real XGBoost** (we have per-task routing but not per-step)
5. **No BERT-based router** (DistilBERT fine-tune is training on cloud infrastructure now)

Each of these could close 5-10% of the gap.

## Practical Takeaways

1. **Start with real execution data.** Even 500 rows beats 50K synthetic ones.
2. **Use execution feedback.** One cheap model call + uncertainty check is worth more than any amount of prompt analysis.
3. **Per-step routing matters.** Don't route the task — route each step.
4. **Safety floors prevent disasters.** Legal tasks always get tier 4+. No exceptions.
5. **Calibration > accuracy.** A well-calibrated P(success) of 0.70 is more useful than an overconfident 0.95.

## Links

- **Code & Models**: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
- **Training Data**: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
- **Dashboard**: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)