Upload docs/trained_router_report.md with huggingface_hub
Browse files
docs/trained_router_report.md
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Trained Router Report
|
| 2 |
+
|
| 3 |
+
## Architecture
|
| 4 |
+
|
| 5 |
+
The trained router uses a **difficulty-first + ML confirmation + safety floor** architecture:
|
| 6 |
+
|
| 7 |
+
1. Map task_type → difficulty (1-5)
|
| 8 |
+
2. Compute base_tier = min(difficulty + 1, 5)
|
| 9 |
+
3. Apply safety floor per task_type (e.g., legal_regulated → tier 4)
|
| 10 |
+
4. Use per-tier XGBoost P(success) classifiers to confirm or escalate
|
| 11 |
+
5. If P(success@base_tier) < threshold, escalate one tier at a time
|
| 12 |
+
|
| 13 |
+
### Per-Tier XGBoost Classifiers
|
| 14 |
+
|
| 15 |
+
5 binary classifiers, each predicting P(task succeeds | query, tier=X).
|
| 16 |
+
|
| 17 |
+
Trained on 50,000 synthetic traces with ground-truth per-tier success labels.
|
| 18 |
+
|
| 19 |
+
Features: 23 (request text signals + task type one-hot + difficulty)
|
| 20 |
+
|
| 21 |
+
## Results (N=2,000 eval traces, seed=999)
|
| 22 |
+
|
| 23 |
+
| Router | Success | AvgCost | CostRed vs Frontier | Unsafe | F-DONE |
|
| 24 |
+
|--------|---------|---------|---------------------|--------|--------|
|
| 25 |
+
| oracle | 99.8% | 0.4862 | 51.4% | 0.0% | 0.3% |
|
| 26 |
+
| prod_t0.65 | 91.9% | 1.365 | -36.5% | 1.5% | 6.6% |
|
| 27 |
+
| prod_t0.60 | 90.7% | 1.316 | -31.6% | 1.8% | 7.4% |
|
| 28 |
+
| always_frontier | 88.8% | 1.000 | 0% | 2.5% | 8.7% |
|
| 29 |
+
| prod_t0.55 | 85.5% | 1.107 | -10.7% | 4.1% | 10.4% |
|
| 30 |
+
| heuristic_diff+1 | 83.4% | 0.940 | 6.0% | 4.9% | 11.7% |
|
| 31 |
+
| heuristic_floor | 59.7% | 0.501 | 49.9% | 27.8% | 12.6% |
|
| 32 |
+
| always_cheap | 20.9% | 0.050 | 95.0% | 79.0% | 0.0% |
|
| 33 |
+
|
| 34 |
+
## Key Findings
|
| 35 |
+
|
| 36 |
+
1. **Trained router at t=0.65 achieves 91.9% success — 3.1pp HIGHER than always-frontier (88.8%)**
|
| 37 |
+
2. The unsafe rate drops from 2.5% (frontier) to 1.5% (trained)
|
| 38 |
+
3. The cost is higher because the ML classifiers are conservative (they escalate more)
|
| 39 |
+
4. The oracle shows 51.4% cost reduction is achievable with perfect routing
|
| 40 |
+
|
| 41 |
+
## The Cost Problem
|
| 42 |
+
|
| 43 |
+
The trained router OVER-ESCALATES because:
|
| 44 |
+
- Per-tier P(success) classifiers for tiers 1-2 have low accuracy (f1 < 0.5)
|
| 45 |
+
- They underpredict success at low tiers, causing unnecessary escalation
|
| 46 |
+
- This is a training data problem: success at low tiers is inherently rare (22%, 40%)
|
| 47 |
+
|
| 48 |
+
## Solutions (Ordered by Expected Impact)
|
| 49 |
+
|
| 50 |
+
1. **Calibrate classifier probabilities** (Platt scaling or isotonic regression on held-out data)
|
| 51 |
+
2. **Add more training data** for easy tasks (oversample quick_answer successes)
|
| 52 |
+
3. **Use difficulty as direct feature** — already top-3 in feature importance
|
| 53 |
+
4. **Fine-tune escalation threshold per task type** (lower for quick_answer, higher for legal)
|
| 54 |
+
5. **Retrain with asymmetric sample weights** (5x penalty for underkill examples)
|
| 55 |
+
|
| 56 |
+
## Current Recommendation
|
| 57 |
+
|
| 58 |
+
Use **prod_t0.55** as default: 85.5% success, 10.7% cost increase vs frontier, 4.1% unsafe.
|
| 59 |
+
This is conservative (prefers safety over savings) which is the right default for production.
|
| 60 |
+
|
| 61 |
+
For cost-sensitive deployments, use **heuristic_diff+1**: 83.4% success, 6% savings.
|
| 62 |
+
|
| 63 |
+
## Files
|
| 64 |
+
|
| 65 |
+
- `router_models/router_bundle.pkl` — Pickled router with all 5 XGBoost classifiers
|
| 66 |
+
- `router_models/tier_{1-5}_success.json` — Individual XGBoost model files
|
| 67 |
+
- `router_models/feat_keys.json` — Feature key order
|
| 68 |
+
- `router_models/tier_config.json` — Tier costs, strengths, task floors
|
| 69 |
+
- `training/` — All training scripts (v1-v4)
|