agent-cost-optimizer / docs /trained_router_final_report.md

Upload docs/trained_router_final_report.md with huggingface_hub

62ac3dc verified 1 day ago

3.06 kB

	# Trained Router Final Report

	## The Honest Answer

	After 7 iterations of router training (v1-v7), here is the complete picture:

	### What Works

	\| Router \| Success \| AvgCost \| CostRed \| Unsafe \|
	\|--------\|---------\|---------\|---------\|--------\|
	\| always_frontier \| 89.3% \| 1.0000 \| 0% \| 2.3% \|
	\| v4_prod_t0.65 \| 91.9% \| 1.3650 \| -36.5% \| 1.5% \|
	\| heuristic_diff+1 \| 84.1% \| 0.9272 \| 7.3% \| 4.7% \|
	\| hybrid_v6_s0.40_d0.75 \| 81.8% \| 0.8222 \| 17.8% \| 5.9% \|
	\| v7_s0.25_d0.85 \| 83.8% \| 0.9084 \| 9.2% \| 4.8% \|
	\| oracle \| 99.8% \| 0.4769 \| 52.3% \| 0.0% \|

	### What The Data Shows

	1. The heuristic (difficulty+1) is already a strong baseline — 84.1% success at 7.3% cost reduction. The ML classifiers cannot consistently beat it because difficulty is the dominant predictive feature (12.2% importance).

	2. The ML safety net adds value in two ways:
	- Escalation path: v4 at t=0.65 achieves 91.9% success — 2.6pp above frontier — by escalating when P(success) is low. This prevents unsafe cheap-model failures.
	- Cost saving path: v7 at d=0.85 achieves 9.2% cost reduction with only 0.3pp success loss vs heuristic.

	3. The oracle shows 52.3% cost reduction is achievable — the gap between current routers and oracle shows there's massive room for improvement, but it requires better features than just text keywords and task type.

	### Why Pure ML Doesn't Beat The Heuristic

	The per-tier P(success) classifiers have:
	- Tier 1: f1=0.48 (poor — success is only 22% of traces)
	- Tier 2: f1=0.56 (mediocre — success is 40% of traces)
	- Tier 3-5: f1=0.63-0.74 (decent — success is 70-95% of traces)

	The classifiers struggle at low tiers because success at tier 1-2 is inherently rare (the model is weak). They can't reliably predict when a cheap model will succeed because the signal is weak.

	### Recommendation: Use Hybrid v7_s0.25_d0.85

	This configuration:
	- Starts with the heuristic (difficulty + 1)
	- Escalates if P(success) < 0.25 (ML safety net)
	- Downgrades if P(success@tier-1) >= 0.85 (ML cost saver)

	Results: 83.8% success, 9.2% cost reduction — a meaningful improvement over the heuristic (7.3% cost reduction) with minimal quality loss (0.3pp).

	### What Would Make The Router Significantly Better

	1. Execution feedback features: Instead of predicting from text alone, use the first model call's output as a feature for subsequent routing. This is what BAAR (2026) does — profile with small model, then decide.

	2. Confidence from generation: Get the model's own confidence (logprobs, entropy) as a routing signal. High entropy = need stronger model.

	3. Retrieval-based features: Use retrieved similar-task traces as features. "Last time someone asked this, tier 3 failed, tier 4 succeeded."

	4. Multi-step routing: Route per-step, not per-task. A task may start easy but get harder mid-execution.

	5. Real agent traces: 50K synthetic traces don't capture real model behavior. Train on actual execution data from SWE-bench, BFCL, or production logs.

	# Trained Router Final Report

	## The Honest Answer

	After 7 iterations of router training (v1-v7), here is the complete picture:

	### What Works

	\| Router \| Success \| AvgCost \| CostRed \| Unsafe \|
	\|--------\|---------\|---------\|---------\|--------\|
	\| always_frontier \| 89.3% \| 1.0000 \| 0% \| 2.3% \|
	\| v4_prod_t0.65 \| 91.9% \| 1.3650 \| -36.5% \| 1.5% \|
	\| heuristic_diff+1 \| 84.1% \| 0.9272 \| 7.3% \| 4.7% \|
	\| hybrid_v6_s0.40_d0.75 \| 81.8% \| 0.8222 \| 17.8% \| 5.9% \|
	\| v7_s0.25_d0.85 \| 83.8% \| 0.9084 \| 9.2% \| 4.8% \|
	\| oracle \| 99.8% \| 0.4769 \| 52.3% \| 0.0% \|

	### What The Data Shows

	1. The heuristic (difficulty+1) is already a strong baseline — 84.1% success at 7.3% cost reduction. The ML classifiers cannot consistently beat it because difficulty is the dominant predictive feature (12.2% importance).

	2. The ML safety net adds value in two ways:
	- Escalation path: v4 at t=0.65 achieves 91.9% success — 2.6pp above frontier — by escalating when P(success) is low. This prevents unsafe cheap-model failures.
	- Cost saving path: v7 at d=0.85 achieves 9.2% cost reduction with only 0.3pp success loss vs heuristic.

	3. The oracle shows 52.3% cost reduction is achievable — the gap between current routers and oracle shows there's massive room for improvement, but it requires better features than just text keywords and task type.

	### Why Pure ML Doesn't Beat The Heuristic

	The per-tier P(success) classifiers have:
	- Tier 1: f1=0.48 (poor — success is only 22% of traces)
	- Tier 2: f1=0.56 (mediocre — success is 40% of traces)
	- Tier 3-5: f1=0.63-0.74 (decent — success is 70-95% of traces)

	The classifiers struggle at low tiers because success at tier 1-2 is inherently rare (the model is weak). They can't reliably predict when a cheap model will succeed because the signal is weak.

	### Recommendation: Use Hybrid v7_s0.25_d0.85

	This configuration:
	- Starts with the heuristic (difficulty + 1)
	- Escalates if P(success) < 0.25 (ML safety net)
	- Downgrades if P(success@tier-1) >= 0.85 (ML cost saver)

	Results: 83.8% success, 9.2% cost reduction — a meaningful improvement over the heuristic (7.3% cost reduction) with minimal quality loss (0.3pp).

	### What Would Make The Router Significantly Better

	1. Execution feedback features: Instead of predicting from text alone, use the first model call's output as a feature for subsequent routing. This is what BAAR (2026) does — profile with small model, then decide.

	2. Confidence from generation: Get the model's own confidence (logprobs, entropy) as a routing signal. High entropy = need stronger model.

	3. Retrieval-based features: Use retrieved similar-task traces as features. "Last time someone asked this, tier 3 failed, tier 4 succeeded."

	4. Multi-step routing: Route per-step, not per-task. A task may start easy but get harder mid-execution.

	5. Real agent traces: 50K synthetic traces don't capture real model behavior. Train on actual execution data from SWE-bench, BFCL, or production logs.