narcolepticchicken
/

agent-cost-optimizer

Model card Files Files and versions

agent-cost-optimizer / docs /bert_eval_report.md

narcolepticchicken's picture

narcolepticchicken

Upload docs/bert_eval_report.md

96c57cf verified about 12 hours ago

|

history blame contribute delete

2.5 kB

	# BERT Router Evaluation Results

	## Setup
	- BERT: DistilBERTForSequenceClassification (num_labels=2, binary)
	- Trained on SPROUT (31K rows, 13 models) for binary success/fail prediction
	- Evaluated on SWE-Router (500 tasks, 8 models)

	## Results

	\| Policy \| Success \| CostRed \|
	\|--------\|---------\|---------\|
	\| Oracle \| 87.0% \| 80.3% \|
	\| v10+feedback \| 81.4% \| -35.6% \|
	\| bert+feedback \| 81.0% \| -35.3% \|
	\| Frontier \| 78.2% \| baseline \|
	\| v10 XGBoost \| 56.2% \| -7.4% \|
	\| BERT \| 49.2% \| -0.4% \|
	\| Always cheap \| 63.2% \| 95.5% \|

	## Diagnosis: BERT Router is Broken

	Root cause: BERT is a binary classifier (num_labels=2) trained to predict success/fail on SPROUT. When used for per-tier routing by prepending `[Tier X]` to the input, it ignores the tier prefix and predicts P(success) ≈ 89.5% for ALL tiers.

	Evidence:
	- All 100 sampled tasks routed to Tier 1
	- P(success) is nearly identical across all tiers: 89.5% ± 0.001
	- The tier prefix `[Tier X]` has no effect on BERT's predictions

	Why: BERT was trained on SPROUT data where the input was just the problem statement, not `[Tier X] problem_statement`. The model never saw the tier prefix during training, so it ignores it.

	## Fix Required

	To make BERT work for tier routing, we need one of:

	### Option A: Retrain as 5-class model
	- Change num_labels from 2 to 5
	- Labels: optimal tier (1-5) for each task
	- This directly predicts the best tier from the problem statement

	### Option B: Retrain with tier-prefixed inputs
	- Keep binary classification
	- Augment training data: for each task, create 5 examples `[Tier 1] problem`, `[Tier 2] problem`, etc.
	- Label = success/fail at that tier
	- This teaches the model to condition on the tier prefix

	### Option C: Use BERT for feature extraction only
	- Use BERT's [CLS] embedding as features for the XGBoost router
	- This replaces hand-crafted keyword features with learned representations
	- No need to change BERT's training

	Recommended: Option C — it's the least risky and provides immediate value by upgrading the XGBoost feature extraction.

	## v10 XGBoost Performance Issue

	The v10_fixed model also underperforms expectations here (56.2% direct vs 76.6% in previous eval). This is likely because:
	1. The v10_fixed model has only 14 features (fewer than the full model)
	2. The threshold of 0.65 may not be well-calibrated for this model
	3. The safety floor enforcement may be too weak

	This needs investigation — the original v10 eval used a different model bundle.