agent-cost-optimizer / docs /bert_eval_report.md
narcolepticchicken's picture
Upload docs/bert_eval_report.md
96c57cf verified

BERT Router Evaluation Results

Setup

  • BERT: DistilBERTForSequenceClassification (num_labels=2, binary)
  • Trained on SPROUT (31K rows, 13 models) for binary success/fail prediction
  • Evaluated on SWE-Router (500 tasks, 8 models)

Results

Policy Success CostRed
Oracle 87.0% 80.3%
v10+feedback 81.4% -35.6%
bert+feedback 81.0% -35.3%
Frontier 78.2% baseline
v10 XGBoost 56.2% -7.4%
BERT 49.2% -0.4%
Always cheap 63.2% 95.5%

Diagnosis: BERT Router is Broken

Root cause: BERT is a binary classifier (num_labels=2) trained to predict success/fail on SPROUT. When used for per-tier routing by prepending [Tier X] to the input, it ignores the tier prefix and predicts P(success) ≈ 89.5% for ALL tiers.

Evidence:

  • All 100 sampled tasks routed to Tier 1
  • P(success) is nearly identical across all tiers: 89.5% ± 0.001
  • The tier prefix [Tier X] has no effect on BERT's predictions

Why: BERT was trained on SPROUT data where the input was just the problem statement, not [Tier X] problem_statement. The model never saw the tier prefix during training, so it ignores it.

Fix Required

To make BERT work for tier routing, we need one of:

Option A: Retrain as 5-class model

  • Change num_labels from 2 to 5
  • Labels: optimal tier (1-5) for each task
  • This directly predicts the best tier from the problem statement

Option B: Retrain with tier-prefixed inputs

  • Keep binary classification
  • Augment training data: for each task, create 5 examples [Tier 1] problem, [Tier 2] problem, etc.
  • Label = success/fail at that tier
  • This teaches the model to condition on the tier prefix

Option C: Use BERT for feature extraction only

  • Use BERT's [CLS] embedding as features for the XGBoost router
  • This replaces hand-crafted keyword features with learned representations
  • No need to change BERT's training

Recommended: Option C — it's the least risky and provides immediate value by upgrading the XGBoost feature extraction.

v10 XGBoost Performance Issue

The v10_fixed model also underperforms expectations here (56.2% direct vs 76.6% in previous eval). This is likely because:

  1. The v10_fixed model has only 14 features (fewer than the full model)
  2. The threshold of 0.65 may not be well-calibrated for this model
  3. The safety floor enforcement may be too weak

This needs investigation — the original v10 eval used a different model bundle.