Trained Router Report

Architecture

The trained router uses a difficulty-first + ML confirmation + safety floor architecture:

Map task_type → difficulty (1-5)
Compute base_tier = min(difficulty + 1, 5)
Apply safety floor per task_type (e.g., legal_regulated → tier 4)
Use per-tier XGBoost P(success) classifiers to confirm or escalate
If P(success@base_tier) < threshold, escalate one tier at a time

Per-Tier XGBoost Classifiers

5 binary classifiers, each predicting P(task succeeds | query, tier=X).

Trained on 50,000 synthetic traces with ground-truth per-tier success labels.

Features: 23 (request text signals + task type one-hot + difficulty)

Results (N=2,000 eval traces, seed=999)

Router	Success	AvgCost	CostRed vs Frontier	Unsafe	F-DONE
oracle	99.8%	0.4862	51.4%	0.0%	0.3%
prod_t0.65	91.9%	1.365	-36.5%	1.5%	6.6%
prod_t0.60	90.7%	1.316	-31.6%	1.8%	7.4%
always_frontier	88.8%	1.000	0%	2.5%	8.7%
prod_t0.55	85.5%	1.107	-10.7%	4.1%	10.4%
heuristic_diff+1	83.4%	0.940	6.0%	4.9%	11.7%
heuristic_floor	59.7%	0.501	49.9%	27.8%	12.6%
always_cheap	20.9%	0.050	95.0%	79.0%	0.0%

Key Findings

Trained router at t=0.65 achieves 91.9% success — 3.1pp HIGHER than always-frontier (88.8%)
The unsafe rate drops from 2.5% (frontier) to 1.5% (trained)
The cost is higher because the ML classifiers are conservative (they escalate more)
The oracle shows 51.4% cost reduction is achievable with perfect routing

The Cost Problem

The trained router OVER-ESCALATES because:

Per-tier P(success) classifiers for tiers 1-2 have low accuracy (f1 < 0.5)
They underpredict success at low tiers, causing unnecessary escalation
This is a training data problem: success at low tiers is inherently rare (22%, 40%)

Solutions (Ordered by Expected Impact)

Calibrate classifier probabilities (Platt scaling or isotonic regression on held-out data)
Add more training data for easy tasks (oversample quick_answer successes)
Use difficulty as direct feature — already top-3 in feature importance
Fine-tune escalation threshold per task type (lower for quick_answer, higher for legal)
Retrain with asymmetric sample weights (5x penalty for underkill examples)

Current Recommendation

Use prod_t0.55 as default: 85.5% success, 10.7% cost increase vs frontier, 4.1% unsafe. This is conservative (prefers safety over savings) which is the right default for production.

For cost-sensitive deployments, use heuristic_diff+1: 83.4% success, 6% savings.

Files

router_models/router_bundle.pkl — Pickled router with all 5 XGBoost classifiers
router_models/tier_{1-5}_success.json — Individual XGBoost model files
router_models/feat_keys.json — Feature key order
router_models/tier_config.json — Tier costs, strengths, task floors
training/ — All training scripts (v1-v4)