| # BERT Router Evaluation Results |
|
|
| ## Setup |
| - BERT: DistilBERTForSequenceClassification (num_labels=2, binary) |
| - Trained on SPROUT (31K rows, 13 models) for binary success/fail prediction |
| - Evaluated on SWE-Router (500 tasks, 8 models) |
| |
| ## Results |
| |
| | Policy | Success | CostRed | |
| |--------|---------|---------| |
| | Oracle | 87.0% | 80.3% | |
| | v10+feedback | 81.4% | -35.6% | |
| | bert+feedback | 81.0% | -35.3% | |
| | Frontier | 78.2% | baseline | |
| | v10 XGBoost | 56.2% | -7.4% | |
| | BERT | 49.2% | -0.4% | |
| | Always cheap | 63.2% | 95.5% | |
| |
| ## Diagnosis: BERT Router is Broken |
| |
| **Root cause**: BERT is a binary classifier (num_labels=2) trained to predict success/fail on SPROUT. When used for per-tier routing by prepending `[Tier X]` to the input, it ignores the tier prefix and predicts P(success) ≈ 89.5% for ALL tiers. |
|
|
| **Evidence**: |
| - All 100 sampled tasks routed to Tier 1 |
| - P(success) is nearly identical across all tiers: 89.5% ± 0.001 |
| - The tier prefix `[Tier X]` has no effect on BERT's predictions |
|
|
| **Why**: BERT was trained on SPROUT data where the input was just the problem statement, not `[Tier X] problem_statement`. The model never saw the tier prefix during training, so it ignores it. |
|
|
| ## Fix Required |
|
|
| To make BERT work for tier routing, we need one of: |
|
|
| ### Option A: Retrain as 5-class model |
| - Change num_labels from 2 to 5 |
| - Labels: optimal tier (1-5) for each task |
| - This directly predicts the best tier from the problem statement |
| |
| ### Option B: Retrain with tier-prefixed inputs |
| - Keep binary classification |
| - Augment training data: for each task, create 5 examples `[Tier 1] problem`, `[Tier 2] problem`, etc. |
| - Label = success/fail at that tier |
| - This teaches the model to condition on the tier prefix |
| |
| ### Option C: Use BERT for feature extraction only |
| - Use BERT's [CLS] embedding as features for the XGBoost router |
| - This replaces hand-crafted keyword features with learned representations |
| - No need to change BERT's training |
| |
| **Recommended**: Option C — it's the least risky and provides immediate value by upgrading the XGBoost feature extraction. |
| |
| ## v10 XGBoost Performance Issue |
| |
| The v10_fixed model also underperforms expectations here (56.2% direct vs 76.6% in previous eval). This is likely because: |
| 1. The v10_fixed model has only 14 features (fewer than the full model) |
| 2. The threshold of 0.65 may not be well-calibrated for this model |
| 3. The safety floor enforcement may be too weak |
| |
| This needs investigation — the original v10 eval used a different model bundle. |
| |