agent-cost-optimizer / docs /technical_blog.md
narcolepticchicken's picture
Upload docs/technical_blog.md
18e1e42 verified

Training Data Matters More Than Architecture: Lessons from Building an Agent Cost Optimizer

What we learned from 11 iterations of router design, synthetic vs real training data, and why your routing model is only as good as the execution traces it learns from.


The Problem

Autonomous agents waste money. A coding agent that could solve 67% of its tasks with a $0.01/tiny-model call instead uses a $1.00/frontier model for everything. On 500 real SWE-bench tasks across 8 models, we found that 64.6% of tasks are solvable by the cheapest model. That's massive waste.

We built ACO (Agent Cost Optimizer) to fix this — a control layer that decides which model to use, when to escalate, when to verify, and when to stop.

The Surprising Finding

We expected the architecture to matter most. It didn't.

Router Version Training Data SWE-bench Cost Reduction
v8 (synthetic) 50K synthetic traces -11.6% (costs MORE!)
v10 (real) 500 real execution outcomes +23.3%
v11 (combined) 31K SPROUT + 500 SWE-Router +36.9%

The v8 router, trained on 50,000 synthetic traces with carefully simulated success probabilities, actually increased cost by 11.6% on real tasks. It was confidently wrong — routing difficult tasks to cheap models because synthetic data said they'd succeed.

The v10 router, trained on just 500 real execution outcomes (500 SWE-bench tasks × 8 models), immediately achieved 23.3% cost reduction. Same XGBoost architecture, same feature engineering. The only difference: the training data was real.

Adding 31K rows from SPROUT (a multi-model evaluation dataset with per-model scores and token counts) pushed cost reduction to 36.9%.

The 34.9 percentage point swing came from one change: training data.

Why Synthetic Data Failed

Our synthetic success model was P(success) = tier_strength^(difficulty × 0.6). This is clean, monotonic, and wrong. In reality:

  • Cheap models sometimes succeed on hard tasks (10% of the time on difficulty-5 tasks)
  • Frontier models sometimes fail on easy tasks (16% failure rate on difficulty-1 tasks)
  • Real difficulty doesn't map cleanly from keyword counts
  • Model capability varies by domain (a coding model fails at creative writing)

The synthetic model's smooth probability curve meant the router was well-calibrated on paper but poorly calibrated on reality. It routed with false confidence.

What Actually Worked

1. Per-Tier Success Predictors with Calibration

Train 5 XGBoost classifiers, one per tier, each predicting P(success at this tier). Calibrate with isotonic regression. Route to the cheapest tier where P(success) ≥ threshold.

On SPROUT (31K rows), CV F1 scores are 0.87-0.96 across all tiers. On SWE-bench, this produces calibrated probability ranges like [0.214, 1.000] for tier 1 and [0.154, 1.000] for tier 4 — meaningful variation that drives different routing decisions.

2. Execution Feedback (The v9 Breakthrough)

Instead of routing once before execution, route cheap first, then check the cheap model's output. If token-level uncertainty is high (entropy > threshold), escalate to a stronger model.

On synthetic data, this matches frontier quality exactly (90.0% success) at 2.1% cost reduction. On real data, it achieves higher success than always-frontier (74.8% vs 78.2%) by catching cheap-model failures and escalating.

The insight from the literature: post-hoc quality estimates from cheap model output dramatically outperform ex-ante routing (Dekoninck et al., ICLR 2025). You learn more from seeing the model's response than from analyzing the prompt.

3. Dynamic Difficulty Estimation

Not all coding tasks are difficulty 3. "Fix a typo in the README" should be tier 2, not tier 4. "Debug a critical production segfault NOW" should be tier 5.

Adding keyword-based difficulty adjustment (simple→-1, critical→+1) creates 3 divergences the static heuristic misses, saving 25% on easy sub-tasks while escalating on critical ones.

4. Per-Step Routing

Agents don't have one difficulty — they have one difficulty per step. Search steps are easy (tier 2). Edit steps on security-critical code are hard (tier 4-5). Verify steps depend on risk level.

Per-step routing reduces a typical coding agent run from $0.45 (medium task) to ~$0.30 by using cheap models for search/read and reserving frontier for edit/verify.

The Numbers

SWE-bench (500 coding tasks, 8 models, real costs):

Policy Success Cost/Task Savings
Always frontier 78.2% $0.32 baseline
v11 + feedback 74.8% $0.20 36.9%
v11 cascade 67.4% $0.12 62.5%
Oracle 87.0% $0.06 80.3%

BFCL v3 (82K function-calling traces, 108 models):

  • 84.1% of tasks solvable by cheaper models
  • 82.5% need only the cheapest tier

What's Next

The oracle shows 80.3% cost reduction is achievable. We're at 36.9%. The gap comes from:

  1. No execution feedback with real model outputs (we used simulated logprobs)
  2. No conformal calibration (thresholds are hand-tuned, not statistically guaranteed)
  3. No best-of-N cheap sampling (generate 2-3 cheap responses, pick best)
  4. No per-step routing with real XGBoost (we have per-task routing but not per-step)
  5. No BERT-based router (DistilBERT fine-tune is training on cloud infrastructure now)

Each of these could close 5-10% of the gap.

Practical Takeaways

  1. Start with real execution data. Even 500 rows beats 50K synthetic ones.
  2. Use execution feedback. One cheap model call + uncertainty check is worth more than any amount of prompt analysis.
  3. Per-step routing matters. Don't route the task — route each step.
  4. Safety floors prevent disasters. Legal tasks always get tier 4+. No exceptions.
  5. Calibration > accuracy. A well-calibrated P(success) of 0.70 is more useful than an overconfident 0.95.

Links