narcolepticchicken commited on
Commit
ffc22a3
Β·
verified Β·
1 Parent(s): 2748cc5

Upload ABLATION_REPORT_v2.md

Browse files
Files changed (1) hide show
  1. ABLATION_REPORT_v2.md +79 -0
ABLATION_REPORT_v2.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Speculative Tool Actions: Ablation Report (v2)
2
+
3
+ ## Experiment Design
4
+
5
+ We test whether speculative decoding can be adapted from token prediction to agent action prediction.
6
+
7
+ **Task:** Given a conversation context (multi-turn user/assistant/tool), predict the next agent action from 9 classes: `tool_call`, `retrieval`, `file_read`, `file_write`, `repair`, `verifier`, `ask_clarification`, `final_answer`, `BLOCKED`.
8
+
9
+ **Models:**
10
+ - **Cheap proposer:** Qwen3-1.7B + LoRA fine-tuned for 2 epochs on synthetic action-prediction data (96.7% train accuracy)
11
+ - **Strong model:** Qwen3-8B frozen (zero-shot)
12
+ - **Reward model:** Qwen3-4B + LoRA trained on preference pairs from which-actions-are-correct (70% train accuracy)
13
+
14
+ **Configurations:**
15
+
16
+ | Config | Description | Expected Cost (normalized) |
17
+ |--------|-------------|---------------------------|
18
+ | A | Strong 8B model only | 1.00 |
19
+ | B | Cheap 1.7B proposer only | 0.15 |
20
+ | C | Cheap proposes β†’ 8B verifies (ACCEPT/REJECT) | 0.25–1.25 |
21
+ | D | Cheap proposes β†’ 4B reward model scores | 0.25 |
22
+ | E | Cheap generates 3 proposals β†’ reward model picks best | 0.75 |
23
+
24
+ ## Results (200 eval examples, dual A10G)
25
+
26
+ | Config | Accuracy | Cost | xRandom (0.111) | xMajority (0.240) | Accept Rate |
27
+ |--------|----------|------|-----------------|-------------------|-------------|
28
+ | A (8B) | **0.400** | 1.000 | 3.6Γ— | 1.7Γ— | β€” |
29
+ | B (1.7B) | **0.510** | 0.150 | 4.6Γ— | 2.1Γ— | β€” |
30
+ | C (C+S) | **0.400** | 1.250 | 3.6Γ— | 1.7Γ— | 0.000 |
31
+ | D (C+R) | **0.510** | 0.250 | 4.6Γ— | 2.1Γ— | 0.215 |
32
+ | E (multi) | pending | β€” | β€” | β€” | β€” |
33
+
34
+ ### Key Findings
35
+
36
+ **1. Cheap model outperforms strong model (B > A: 51% vs 40%)**
37
+
38
+ The fine-tuned 1.7B proposer learned the task from synthetic data. The frozen 8B model, despite 4.7Γ— more parameters, had zero training on the task and defaults to generic completions. **Fine-tuning matters more than scale for this task.**
39
+
40
+ **2. Verifier (C): 0% accept rate β€” completely useless**
41
+
42
+ The 8B ACCEPT/REJECT verifier rejected EVERY proposal from the cheap model, then fell back to its own (worse) predictions. Result: same accuracy as A (40%) but higher cost (1.25 vs 1.00). The verifier is either:
43
+ - Telling the truth (it genuinely believes all proposals are wrong)
44
+ - Collapsed to always outputting "REJECT"
45
+
46
+ Either way, it provides zero benefit.
47
+
48
+ **3. Reward model (D): 21.5% accept rate, all scores negative (mean -1.52)**
49
+
50
+ The 4B reward model scores all proposals negatively, with only ~1/5 above the acceptance threshold of -1.0. But since Config B and D have identical accuracy (51%), the reward model is not effectively filtering β€” it's just adding cost without changing outcomes. The accepted proposals are no more accurate than the rejected ones.
51
+
52
+ **4. Cost-quality frontier is degenerate**
53
+
54
+ The only sensible point on the frontier is **Config B**: 51% accuracy at 0.15Γ— cost. Configs A, C, and D are all dominated β€” they cost more and deliver equal or worse accuracy.
55
+
56
+ ## Why This Task is Challenging
57
+
58
+ The eval dataset uses **real conversational messages** (from synthetic templates but in natural chat format), while the proposer was fine-tuned on **"Action: <type>\n<reason>" format**. The training/eval distribution mismatch means:
59
+
60
+ - The proposer learned to generate `Action: tool_call\nReason: ...` when given `"Predict the next action for:\n\nuser: ... assistant: ..."`
61
+ - The 8B model had no such training and produces natural language continuations instead
62
+ - The verifier was trained on preference pairs, not on ACCEPT/REJECT decisions
63
+
64
+ **The speculative decoding architecture cannot fix a weak signal problem.** When the task itself is poorly learned, adding complex routing (verification, reranking) only adds cost without improving accuracy.
65
+
66
+ ## Recommendations
67
+
68
+ 1. **Fix the training data:** Generate SFT data using the same chat-template format as the eval dataset. Use `tokenizer.apply_chat_template()` consistently.
69
+ 2. **Train the strong model:** Fine-tune Qwen3-8B on the same SFT data for a fair comparison of proposer size.
70
+ 3. **Train the verifier properly:** The ACCEPT/REJECT verifier needs training examples showing correct proposals paired with ACCEPT decisions.
71
+ 4. **Use a real agent benchmark:** The synthetic action prediction task has limited ecological validity. A task like BFCL (Berkeley Function Calling Leaderboard) would provide realistic tool-calling traces.
72
+
73
+ ## Deliverables
74
+
75
+ - **Dataset:** `narcolepticchicken/speculative-actions-eval` (500 conversational examples with action labels)
76
+ - **Proposer:** `narcolepticchicken/speculative-proposer-qwen3-1.7b` (LoRA adapter)
77
+ - **Verifier:** `narcolepticchicken/speculative-verifier-qwen3-4b` (LoRA adapter for reward model)
78
+ - **Eval runner:** `narcolepticchicken/speculative-tool-actions` (eval scripts + results)
79
+ - **Results:** `narcolepticchicken/speculative-tool-actions/eval_results_v2.json`