narcolepticchicken commited on
Commit
96c57cf
·
verified ·
1 Parent(s): 5c39bf9

Upload docs/bert_eval_report.md

Browse files
Files changed (1) hide show
  1. docs/bert_eval_report.md +60 -0
docs/bert_eval_report.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BERT Router Evaluation Results
2
+
3
+ ## Setup
4
+ - BERT: DistilBERTForSequenceClassification (num_labels=2, binary)
5
+ - Trained on SPROUT (31K rows, 13 models) for binary success/fail prediction
6
+ - Evaluated on SWE-Router (500 tasks, 8 models)
7
+
8
+ ## Results
9
+
10
+ | Policy | Success | CostRed |
11
+ |--------|---------|---------|
12
+ | Oracle | 87.0% | 80.3% |
13
+ | v10+feedback | 81.4% | -35.6% |
14
+ | bert+feedback | 81.0% | -35.3% |
15
+ | Frontier | 78.2% | baseline |
16
+ | v10 XGBoost | 56.2% | -7.4% |
17
+ | BERT | 49.2% | -0.4% |
18
+ | Always cheap | 63.2% | 95.5% |
19
+
20
+ ## Diagnosis: BERT Router is Broken
21
+
22
+ **Root cause**: BERT is a binary classifier (num_labels=2) trained to predict success/fail on SPROUT. When used for per-tier routing by prepending `[Tier X]` to the input, it ignores the tier prefix and predicts P(success) ≈ 89.5% for ALL tiers.
23
+
24
+ **Evidence**:
25
+ - All 100 sampled tasks routed to Tier 1
26
+ - P(success) is nearly identical across all tiers: 89.5% ± 0.001
27
+ - The tier prefix `[Tier X]` has no effect on BERT's predictions
28
+
29
+ **Why**: BERT was trained on SPROUT data where the input was just the problem statement, not `[Tier X] problem_statement`. The model never saw the tier prefix during training, so it ignores it.
30
+
31
+ ## Fix Required
32
+
33
+ To make BERT work for tier routing, we need one of:
34
+
35
+ ### Option A: Retrain as 5-class model
36
+ - Change num_labels from 2 to 5
37
+ - Labels: optimal tier (1-5) for each task
38
+ - This directly predicts the best tier from the problem statement
39
+
40
+ ### Option B: Retrain with tier-prefixed inputs
41
+ - Keep binary classification
42
+ - Augment training data: for each task, create 5 examples `[Tier 1] problem`, `[Tier 2] problem`, etc.
43
+ - Label = success/fail at that tier
44
+ - This teaches the model to condition on the tier prefix
45
+
46
+ ### Option C: Use BERT for feature extraction only
47
+ - Use BERT's [CLS] embedding as features for the XGBoost router
48
+ - This replaces hand-crafted keyword features with learned representations
49
+ - No need to change BERT's training
50
+
51
+ **Recommended**: Option C — it's the least risky and provides immediate value by upgrading the XGBoost feature extraction.
52
+
53
+ ## v10 XGBoost Performance Issue
54
+
55
+ The v10_fixed model also underperforms expectations here (56.2% direct vs 76.6% in previous eval). This is likely because:
56
+ 1. The v10_fixed model has only 14 features (fewer than the full model)
57
+ 2. The threshold of 0.65 may not be well-calibrated for this model
58
+ 3. The safety floor enforcement may be too weak
59
+
60
+ This needs investigation — the original v10 eval used a different model bundle.