agent-cost-optimizer / docs /trained_router_report.md
narcolepticchicken's picture
Upload docs/trained_router_report.md with huggingface_hub
93e85ae verified

Trained Router Report

Architecture

The trained router uses a difficulty-first + ML confirmation + safety floor architecture:

  1. Map task_type β†’ difficulty (1-5)
  2. Compute base_tier = min(difficulty + 1, 5)
  3. Apply safety floor per task_type (e.g., legal_regulated β†’ tier 4)
  4. Use per-tier XGBoost P(success) classifiers to confirm or escalate
  5. If P(success@base_tier) < threshold, escalate one tier at a time

Per-Tier XGBoost Classifiers

5 binary classifiers, each predicting P(task succeeds | query, tier=X).

Trained on 50,000 synthetic traces with ground-truth per-tier success labels.

Features: 23 (request text signals + task type one-hot + difficulty)

Results (N=2,000 eval traces, seed=999)

Router Success AvgCost CostRed vs Frontier Unsafe F-DONE
oracle 99.8% 0.4862 51.4% 0.0% 0.3%
prod_t0.65 91.9% 1.365 -36.5% 1.5% 6.6%
prod_t0.60 90.7% 1.316 -31.6% 1.8% 7.4%
always_frontier 88.8% 1.000 0% 2.5% 8.7%
prod_t0.55 85.5% 1.107 -10.7% 4.1% 10.4%
heuristic_diff+1 83.4% 0.940 6.0% 4.9% 11.7%
heuristic_floor 59.7% 0.501 49.9% 27.8% 12.6%
always_cheap 20.9% 0.050 95.0% 79.0% 0.0%

Key Findings

  1. Trained router at t=0.65 achieves 91.9% success β€” 3.1pp HIGHER than always-frontier (88.8%)
  2. The unsafe rate drops from 2.5% (frontier) to 1.5% (trained)
  3. The cost is higher because the ML classifiers are conservative (they escalate more)
  4. The oracle shows 51.4% cost reduction is achievable with perfect routing

The Cost Problem

The trained router OVER-ESCALATES because:

  • Per-tier P(success) classifiers for tiers 1-2 have low accuracy (f1 < 0.5)
  • They underpredict success at low tiers, causing unnecessary escalation
  • This is a training data problem: success at low tiers is inherently rare (22%, 40%)

Solutions (Ordered by Expected Impact)

  1. Calibrate classifier probabilities (Platt scaling or isotonic regression on held-out data)
  2. Add more training data for easy tasks (oversample quick_answer successes)
  3. Use difficulty as direct feature β€” already top-3 in feature importance
  4. Fine-tune escalation threshold per task type (lower for quick_answer, higher for legal)
  5. Retrain with asymmetric sample weights (5x penalty for underkill examples)

Current Recommendation

Use prod_t0.55 as default: 85.5% success, 10.7% cost increase vs frontier, 4.1% unsafe. This is conservative (prefers safety over savings) which is the right default for production.

For cost-sensitive deployments, use heuristic_diff+1: 83.4% success, 6% savings.

Files

  • router_models/router_bundle.pkl β€” Pickled router with all 5 XGBoost classifiers
  • router_models/tier_{1-5}_success.json β€” Individual XGBoost model files
  • router_models/feat_keys.json β€” Feature key order
  • router_models/tier_config.json β€” Tier costs, strengths, task floors
  • training/ β€” All training scripts (v1-v4)