BERT Router Evaluation Results

Setup

BERT: DistilBERTForSequenceClassification (num_labels=2, binary)
Trained on SPROUT (31K rows, 13 models) for binary success/fail prediction
Evaluated on SWE-Router (500 tasks, 8 models)

Results

Policy	Success	CostRed
Oracle	87.0%	80.3%
v10+feedback	81.4%	-35.6%
bert+feedback	81.0%	-35.3%
Frontier	78.2%	baseline
v10 XGBoost	56.2%	-7.4%
BERT	49.2%	-0.4%
Always cheap	63.2%	95.5%

Diagnosis: BERT Router is Broken

Root cause: BERT is a binary classifier (num_labels=2) trained to predict success/fail on SPROUT. When used for per-tier routing by prepending [Tier X] to the input, it ignores the tier prefix and predicts P(success) ≈ 89.5% for ALL tiers.

Evidence:

All 100 sampled tasks routed to Tier 1
P(success) is nearly identical across all tiers: 89.5% ± 0.001
The tier prefix [Tier X] has no effect on BERT's predictions

Why: BERT was trained on SPROUT data where the input was just the problem statement, not [Tier X] problem_statement. The model never saw the tier prefix during training, so it ignores it.

Fix Required

To make BERT work for tier routing, we need one of:

Option A: Retrain as 5-class model

Change num_labels from 2 to 5
Labels: optimal tier (1-5) for each task
This directly predicts the best tier from the problem statement

Option B: Retrain with tier-prefixed inputs

Keep binary classification
Augment training data: for each task, create 5 examples [Tier 1] problem, [Tier 2] problem, etc.
Label = success/fail at that tier
This teaches the model to condition on the tier prefix

Option C: Use BERT for feature extraction only

Use BERT's [CLS] embedding as features for the XGBoost router
This replaces hand-crafted keyword features with learned representations
No need to change BERT's training

Recommended: Option C — it's the least risky and provides immediate value by upgrading the XGBoost feature extraction.

v10 XGBoost Performance Issue

The v10_fixed model also underperforms expectations here (56.2% direct vs 76.6% in previous eval). This is likely because:

The v10_fixed model has only 14 features (fewer than the full model)
The threshold of 0.65 may not be well-calibrated for this model
The safety floor enforcement may be too weak

This needs investigation — the original v10 eval used a different model bundle.