Upload docs/trained_router_final_report.md with huggingface_hub
Browse files
docs/trained_router_final_report.md
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Trained Router Final Report
|
| 2 |
+
|
| 3 |
+
## The Honest Answer
|
| 4 |
+
|
| 5 |
+
After 7 iterations of router training (v1-v7), here is the complete picture:
|
| 6 |
+
|
| 7 |
+
### What Works
|
| 8 |
+
|
| 9 |
+
| Router | Success | AvgCost | CostRed | Unsafe |
|
| 10 |
+
|--------|---------|---------|---------|--------|
|
| 11 |
+
| **always_frontier** | 89.3% | 1.0000 | 0% | 2.3% |
|
| 12 |
+
| **v4_prod_t0.65** | **91.9%** | 1.3650 | -36.5% | **1.5%** |
|
| 13 |
+
| **heuristic_diff+1** | 84.1% | 0.9272 | 7.3% | 4.7% |
|
| 14 |
+
| **hybrid_v6_s0.40_d0.75** | 81.8% | 0.8222 | 17.8% | 5.9% |
|
| 15 |
+
| **v7_s0.25_d0.85** | 83.8% | 0.9084 | 9.2% | 4.8% |
|
| 16 |
+
| **oracle** | 99.8% | 0.4769 | 52.3% | 0.0% |
|
| 17 |
+
|
| 18 |
+
### What The Data Shows
|
| 19 |
+
|
| 20 |
+
1. **The heuristic (difficulty+1) is already a strong baseline** — 84.1% success at 7.3% cost reduction. The ML classifiers cannot consistently beat it because difficulty is the dominant predictive feature (12.2% importance).
|
| 21 |
+
|
| 22 |
+
2. **The ML safety net adds value in two ways:**
|
| 23 |
+
- **Escalation path**: v4 at t=0.65 achieves 91.9% success — 2.6pp above frontier — by escalating when P(success) is low. This prevents unsafe cheap-model failures.
|
| 24 |
+
- **Cost saving path**: v7 at d=0.85 achieves 9.2% cost reduction with only 0.3pp success loss vs heuristic.
|
| 25 |
+
|
| 26 |
+
3. **The oracle shows 52.3% cost reduction is achievable** — the gap between current routers and oracle shows there's massive room for improvement, but it requires better features than just text keywords and task type.
|
| 27 |
+
|
| 28 |
+
### Why Pure ML Doesn't Beat The Heuristic
|
| 29 |
+
|
| 30 |
+
The per-tier P(success) classifiers have:
|
| 31 |
+
- Tier 1: f1=0.48 (poor — success is only 22% of traces)
|
| 32 |
+
- Tier 2: f1=0.56 (mediocre — success is 40% of traces)
|
| 33 |
+
- Tier 3-5: f1=0.63-0.74 (decent — success is 70-95% of traces)
|
| 34 |
+
|
| 35 |
+
The classifiers struggle at low tiers because success at tier 1-2 is inherently rare (the model is weak). They can't reliably predict when a cheap model will succeed because the signal is weak.
|
| 36 |
+
|
| 37 |
+
### Recommendation: Use Hybrid v7_s0.25_d0.85
|
| 38 |
+
|
| 39 |
+
This configuration:
|
| 40 |
+
- Starts with the heuristic (difficulty + 1)
|
| 41 |
+
- Escalates if P(success) < 0.25 (ML safety net)
|
| 42 |
+
- Downgrades if P(success@tier-1) >= 0.85 (ML cost saver)
|
| 43 |
+
|
| 44 |
+
Results: **83.8% success, 9.2% cost reduction** — a meaningful improvement over the heuristic (7.3% cost reduction) with minimal quality loss (0.3pp).
|
| 45 |
+
|
| 46 |
+
### What Would Make The Router Significantly Better
|
| 47 |
+
|
| 48 |
+
1. **Execution feedback features**: Instead of predicting from text alone, use the first model call's output as a feature for subsequent routing. This is what BAAR (2026) does — profile with small model, then decide.
|
| 49 |
+
|
| 50 |
+
2. **Confidence from generation**: Get the model's own confidence (logprobs, entropy) as a routing signal. High entropy = need stronger model.
|
| 51 |
+
|
| 52 |
+
3. **Retrieval-based features**: Use retrieved similar-task traces as features. "Last time someone asked this, tier 3 failed, tier 4 succeeded."
|
| 53 |
+
|
| 54 |
+
4. **Multi-step routing**: Route per-step, not per-task. A task may start easy but get harder mid-execution.
|
| 55 |
+
|
| 56 |
+
5. **Real agent traces**: 50K synthetic traces don't capture real model behavior. Train on actual execution data from SWE-bench, BFCL, or production logs.
|