Trained Router Final Report

The Honest Answer

After 7 iterations of router training (v1-v7), here is the complete picture:

What Works

Router	Success	AvgCost	CostRed	Unsafe
always_frontier	89.3%	1.0000	0%	2.3%
v4_prod_t0.65	91.9%	1.3650	-36.5%	1.5%
heuristic_diff+1	84.1%	0.9272	7.3%	4.7%
hybrid_v6_s0.40_d0.75	81.8%	0.8222	17.8%	5.9%
v7_s0.25_d0.85	83.8%	0.9084	9.2%	4.8%
oracle	99.8%	0.4769	52.3%	0.0%

What The Data Shows

The heuristic (difficulty+1) is already a strong baseline — 84.1% success at 7.3% cost reduction. The ML classifiers cannot consistently beat it because difficulty is the dominant predictive feature (12.2% importance).
The ML safety net adds value in two ways:
- Escalation path: v4 at t=0.65 achieves 91.9% success — 2.6pp above frontier — by escalating when P(success) is low. This prevents unsafe cheap-model failures.
- Cost saving path: v7 at d=0.85 achieves 9.2% cost reduction with only 0.3pp success loss vs heuristic.
The oracle shows 52.3% cost reduction is achievable — the gap between current routers and oracle shows there's massive room for improvement, but it requires better features than just text keywords and task type.

Why Pure ML Doesn't Beat The Heuristic

The per-tier P(success) classifiers have:

Tier 1: f1=0.48 (poor — success is only 22% of traces)
Tier 2: f1=0.56 (mediocre — success is 40% of traces)
Tier 3-5: f1=0.63-0.74 (decent — success is 70-95% of traces)

The classifiers struggle at low tiers because success at tier 1-2 is inherently rare (the model is weak). They can't reliably predict when a cheap model will succeed because the signal is weak.

Recommendation: Use Hybrid v7_s0.25_d0.85

This configuration:

Starts with the heuristic (difficulty + 1)
Escalates if P(success) < 0.25 (ML safety net)
Downgrades if P(success@tier-1) >= 0.85 (ML cost saver)

Results: 83.8% success, 9.2% cost reduction — a meaningful improvement over the heuristic (7.3% cost reduction) with minimal quality loss (0.3pp).

What Would Make The Router Significantly Better

Execution feedback features: Instead of predicting from text alone, use the first model call's output as a feature for subsequent routing. This is what BAAR (2026) does — profile with small model, then decide.
Confidence from generation: Get the model's own confidence (logprobs, entropy) as a routing signal. High entropy = need stronger model.
Retrieval-based features: Use retrieved similar-task traces as features. "Last time someone asked this, tier 3 failed, tier 4 succeeded."
Multi-step routing: Route per-step, not per-task. A task may start easy but get harder mid-execution.
Real agent traces: 50K synthetic traces don't capture real model behavior. Train on actual execution data from SWE-bench, BFCL, or production logs.