BERT Router Evaluation Results
Setup
- BERT: DistilBERTForSequenceClassification (num_labels=2, binary)
- Trained on SPROUT (31K rows, 13 models) for binary success/fail prediction
- Evaluated on SWE-Router (500 tasks, 8 models)
Results
| Policy | Success | CostRed |
|---|---|---|
| Oracle | 87.0% | 80.3% |
| v10+feedback | 81.4% | -35.6% |
| bert+feedback | 81.0% | -35.3% |
| Frontier | 78.2% | baseline |
| v10 XGBoost | 56.2% | -7.4% |
| BERT | 49.2% | -0.4% |
| Always cheap | 63.2% | 95.5% |
Diagnosis: BERT Router is Broken
Root cause: BERT is a binary classifier (num_labels=2) trained to predict success/fail on SPROUT. When used for per-tier routing by prepending [Tier X] to the input, it ignores the tier prefix and predicts P(success) ≈ 89.5% for ALL tiers.
Evidence:
- All 100 sampled tasks routed to Tier 1
- P(success) is nearly identical across all tiers: 89.5% ± 0.001
- The tier prefix
[Tier X]has no effect on BERT's predictions
Why: BERT was trained on SPROUT data where the input was just the problem statement, not [Tier X] problem_statement. The model never saw the tier prefix during training, so it ignores it.
Fix Required
To make BERT work for tier routing, we need one of:
Option A: Retrain as 5-class model
- Change num_labels from 2 to 5
- Labels: optimal tier (1-5) for each task
- This directly predicts the best tier from the problem statement
Option B: Retrain with tier-prefixed inputs
- Keep binary classification
- Augment training data: for each task, create 5 examples
[Tier 1] problem,[Tier 2] problem, etc. - Label = success/fail at that tier
- This teaches the model to condition on the tier prefix
Option C: Use BERT for feature extraction only
- Use BERT's [CLS] embedding as features for the XGBoost router
- This replaces hand-crafted keyword features with learned representations
- No need to change BERT's training
Recommended: Option C — it's the least risky and provides immediate value by upgrading the XGBoost feature extraction.
v10 XGBoost Performance Issue
The v10_fixed model also underperforms expectations here (56.2% direct vs 76.6% in previous eval). This is likely because:
- The v10_fixed model has only 14 features (fewer than the full model)
- The threshold of 0.65 may not be well-calibrated for this model
- The safety floor enforcement may be too weak
This needs investigation — the original v10 eval used a different model bundle.