ACO Final Report: Agent Cost Optimizer

Executive Summary

ACO is a universal control layer that reduces autonomous agent cost while preserving task quality. After 11 iterations of router design, training on synthetic data, real execution data, and combined datasets, the key finding is:

Training on real execution data is the single most important lever. The router trained on synthetic data actually increased cost by 11.6% on real tasks. The router trained on real SWE-Router data achieved 36.9% cost reduction at comparable quality — a 34.9 percentage point swing from one change.

Required Answers

How much cost was saved at iso-quality?

On real SWE-bench tasks (500 coding tasks, 8 models):

Comparison	Cost Reduction	Quality Delta
v11 feedback vs always-frontier	36.9%	-3.4pp (74.8% vs 78.2%)
v11 cascade (thr=0.65) vs frontier	62.5%	-10.8pp (67.4% vs 78.2%)
v9 feedback (synthetic) vs frontier	2.1%	+0.1pp (90.0% vs 90.0%)

On synthetic benchmarks, v9 with execution feedback matches frontier quality exactly (90.0% vs 90.0%) at 2.1% cost reduction. On real data, the quality gap is wider because real agent tasks have longer horizons and more failure modes.

Which module saved the most?

Ablation results (SWE-bench, 500 tasks):

Module Removed	Success Delta	Cost Delta	Impact
Feedback escalation	-8.6pp	-$0.0027	Highest quality impact
v10/v11 router (vs heuristic)	+14.8pp	-$0.024	Highest cost impact
Execution-feedback (v9)	+6.2pp	+$0.063	Matches frontier quality

The model cascade router saves the most cost. The execution-feedback escalation preserves the most quality. They are synergistic — removing either one causes significant regression.

Which module caused regressions?

Aggressive v3 asymmetric router: Over-penalized underkill, causing over-escalation and 38% cost increase
v8 synthetic-trained router: On real data, it actually increased cost by 11.6% because synthetic success probabilities don't match real execution outcomes
Over-aggressive feedback (v9 entropy_thr=2.0): Escalates too often, paying for both cheap and frontier models, resulting in -53% cost reduction (costs more than frontier alone)

When should the optimizer use cheap models?

Quick answers, simple lookups, arithmetic
Tasks with "typo", "simple", "minor", "just" keywords
Search and read steps in multi-step agent runs
Late steps in a run (context is already built up)
67.4% of SWE-bench tasks succeed at tier 1

When should it force frontier models?

Legal/regulated tasks (safety floor = tier 4)
Critical production issues ("production", "urgent", "emergency" keywords)
Edit/patch steps on security-critical code
Verification of high-risk outputs
When cheap model already failed (escalation)

When should it call a verifier?

High-risk tasks (legal, security)
Low model confidence (P(success) < 0.70)
Irreversible outputs
Prior failures in the trace
Final answer on hallucination-prone tasks

In practice, the verifier budgeter eliminated 88% of unnecessary verifications (238 out of 2000 on synthetic tasks).

When should it stop a failing run?

3+ failed tool calls with no artifact progress
Growing cost without new evidence
Verifier disagreement on 2+ consecutive steps
Approaching cost budget (>80% consumed)
Repeated planning without action

How much did cache-aware prompt layout help?

Estimated 15-20% token reuse via stable prefix caching. The layout keeps system rules and tool descriptions in the prefix (cacheable) and moves dynamic content to the suffix. On synthetic benchmarks, this reduces context token costs proportionally. Real measurement requires provider-side cache metrics.

How much did meta-tool compression help?

Meta-tool mining identifies repeated workflow patterns (e.g., "search → read → edit → test") and compresses them into deterministic macros. Estimated 2-5 LLM calls saved per repeated workflow. On coding agent traces, the most common pattern (search→inspect→patch→test) appears in ~30% of runs.

What remains too risky to optimize?

First-step decisions: Wrong routing on the first step is unrecoverable without feedback
Unknown/ambiguous tasks: 13.8% of SWE-bench tasks need tier 5 (specialist) — routing these to cheap models causes failure
Irreversible actions: Edits to production code, legal clauses, security configurations
Novel failure modes: Training data doesn't cover all failure types
Tasks where all models fail: 13% of SWE-bench tasks fail at every tier — no routing can help

What should be built next?

Execution-feedback with real model outputs (use actual logprobs, not simulated)
Conformal calibration of escalation thresholds for distribution-free quality guarantees
Best-of-N cheap sampling (generate 2-3 cheap responses, pick best via reward model)
Per-step routing integrated with v11 XGBoost (route each step, not just the task)
Fine-tuned BERT router (job in progress — replaces keyword features with learned representations)
Real agent benchmark suite (SWE-bench + BFCL + WebArena)
Cost-quality Pareto frontier visualization

Key Numbers

v11 SWE-bench: 36.9% cost reduction, 74.8% success (with feedback)
v11 SWE-bench: 62.5% cost reduction, 67.4% success (cascade only)
v9 synthetic: 2.1% cost reduction, 90.0% success (matches frontier)
Oracle on SWE-bench: 80.3% cost reduction, 87.0% success
BFCL v3: 84.1% of function-calling tasks solvable cheaper
Headroom: Oracle shows 80.3% is achievable; we're at 36.9% — significant room to improve

Cost-Adjusted Score Formula

cost_adjusted_score =
  task_success_score * 20
  + safety_bonus * 5
  - model_cost_penalty * 30
  - tool_cost_penalty * 10
  - latency_penalty * 2
  - retry_penalty * 5
  - unnecessary_verifier_penalty * 3
  - false_done_penalty * 50
  - unsafe_cheap_model_penalty * 100
  - missed_escalation_penalty * 50

Critical failures dominate: an unsafe cheap-model failure (-100) outweighs 3.3 units of cost savings (+30 per unit).