narcolepticchicken commited on
Commit
02af344
Β·
verified Β·
1 Parent(s): ca60718

Upload ABLATION_REPORT_v2.md

Browse files
Files changed (1) hide show
  1. ABLATION_REPORT_v2.md +35 -58
ABLATION_REPORT_v2.md CHANGED
@@ -1,79 +1,56 @@
1
- # Speculative Tool Actions: Ablation Report (v2)
2
 
3
- ## Experiment Design
4
 
5
- We test whether speculative decoding can be adapted from token prediction to agent action prediction.
6
 
7
- **Task:** Given a conversation context (multi-turn user/assistant/tool), predict the next agent action from 9 classes: `tool_call`, `retrieval`, `file_read`, `file_write`, `repair`, `verifier`, `ask_clarification`, `final_answer`, `BLOCKED`.
8
 
9
- **Models:**
10
- - **Cheap proposer:** Qwen3-1.7B + LoRA fine-tuned for 2 epochs on synthetic action-prediction data (96.7% train accuracy)
11
- - **Strong model:** Qwen3-8B frozen (zero-shot)
12
- - **Reward model:** Qwen3-4B + LoRA trained on preference pairs from which-actions-are-correct (70% train accuracy)
13
 
14
- **Configurations:**
 
 
 
 
 
 
15
 
16
- | Config | Description | Expected Cost (normalized) |
17
- |--------|-------------|---------------------------|
18
- | A | Strong 8B model only | 1.00 |
19
- | B | Cheap 1.7B proposer only | 0.15 |
20
- | C | Cheap proposes β†’ 8B verifies (ACCEPT/REJECT) | 0.25–1.25 |
21
- | D | Cheap proposes β†’ 4B reward model scores | 0.25 |
22
- | E | Cheap generates 3 proposals β†’ reward model picks best | 0.75 |
23
 
24
- ## Results (200 eval examples, dual A10G)
25
 
26
- | Config | Accuracy | Cost | xRandom (0.111) | xMajority (0.240) | Accept Rate |
27
- |--------|----------|------|-----------------|-------------------|-------------|
28
- | A (8B) | **0.400** | 1.000 | 3.6Γ— | 1.7Γ— | β€” |
29
- | B (1.7B) | **0.510** | 0.150 | 4.6Γ— | 2.1Γ— | β€” |
30
- | C (C+S) | **0.400** | 1.250 | 3.6Γ— | 1.7Γ— | 0.000 |
31
- | D (C+R) | **0.510** | 0.250 | 4.6Γ— | 2.1Γ— | 0.215 |
32
- | E (multi) | pending | β€” | β€” | β€” | β€” |
33
 
34
- ### Key Findings
35
 
36
- **1. Cheap model outperforms strong model (B > A: 51% vs 40%)**
37
 
38
- The fine-tuned 1.7B proposer learned the task from synthetic data. The frozen 8B model, despite 4.7Γ— more parameters, had zero training on the task and defaults to generic completions. **Fine-tuning matters more than scale for this task.**
39
 
40
- **2. Verifier (C): 0% accept rate β€” completely useless**
41
 
42
- The 8B ACCEPT/REJECT verifier rejected EVERY proposal from the cheap model, then fell back to its own (worse) predictions. Result: same accuracy as A (40%) but higher cost (1.25 vs 1.00). The verifier is either:
43
- - Telling the truth (it genuinely believes all proposals are wrong)
44
- - Collapsed to always outputting "REJECT"
45
 
46
- Either way, it provides zero benefit.
47
 
48
- **3. Reward model (D): 21.5% accept rate, all scores negative (mean -1.52)**
49
 
50
- The 4B reward model scores all proposals negatively, with only ~1/5 above the acceptance threshold of -1.0. But since Config B and D have identical accuracy (51%), the reward model is not effectively filtering β€” it's just adding cost without changing outcomes. The accepted proposals are no more accurate than the rejected ones.
51
 
52
- **4. Cost-quality frontier is degenerate**
53
 
54
- The only sensible point on the frontier is **Config B**: 51% accuracy at 0.15Γ— cost. Configs A, C, and D are all dominated β€” they cost more and deliver equal or worse accuracy.
55
-
56
- ## Why This Task is Challenging
57
-
58
- The eval dataset uses **real conversational messages** (from synthetic templates but in natural chat format), while the proposer was fine-tuned on **"Action: <type>\n<reason>" format**. The training/eval distribution mismatch means:
59
-
60
- - The proposer learned to generate `Action: tool_call\nReason: ...` when given `"Predict the next action for:\n\nuser: ... assistant: ..."`
61
- - The 8B model had no such training and produces natural language continuations instead
62
- - The verifier was trained on preference pairs, not on ACCEPT/REJECT decisions
63
-
64
- **The speculative decoding architecture cannot fix a weak signal problem.** When the task itself is poorly learned, adding complex routing (verification, reranking) only adds cost without improving accuracy.
65
-
66
- ## Recommendations
67
-
68
- 1. **Fix the training data:** Generate SFT data using the same chat-template format as the eval dataset. Use `tokenizer.apply_chat_template()` consistently.
69
- 2. **Train the strong model:** Fine-tune Qwen3-8B on the same SFT data for a fair comparison of proposer size.
70
- 3. **Train the verifier properly:** The ACCEPT/REJECT verifier needs training examples showing correct proposals paired with ACCEPT decisions.
71
- 4. **Use a real agent benchmark:** The synthetic action prediction task has limited ecological validity. A task like BFCL (Berkeley Function Calling Leaderboard) would provide realistic tool-calling traces.
72
 
73
  ## Deliverables
74
 
75
- - **Dataset:** `narcolepticchicken/speculative-actions-eval` (500 conversational examples with action labels)
76
- - **Proposer:** `narcolepticchicken/speculative-proposer-qwen3-1.7b` (LoRA adapter)
77
- - **Verifier:** `narcolepticchicken/speculative-verifier-qwen3-4b` (LoRA adapter for reward model)
78
- - **Eval runner:** `narcolepticchicken/speculative-tool-actions` (eval scripts + results)
79
- - **Results:** `narcolepticchicken/speculative-tool-actions/eval_results_v2.json`
 
1
+ # Speculative Tool Actions: Final Results
2
 
3
+ ## Summary
4
 
5
+ We tested whether speculative decoding transfers from token prediction to agent action prediction. A cheap model (Qwen3-1.7B + LoRA) proposes the next agent action; verifiers decide whether to use the proposal or fall back to an expensive 8B model.
6
 
7
+ **Finding:** The speculative architecture failed to improve on the cheap proposer alone. Option B (1.7B cheap only, 51.0% accuracy, 0.15Γ— cost) dominates every other configuration.
8
 
9
+ ## Full Results
 
 
 
10
 
11
+ | Config | Accuracy | Cost | xRand | xMaj | Notes |
12
+ |--------|----------|------|-------|------|-------|
13
+ | A | 0.400 | 1.000 | 3.6Γ— | 1.7Γ— | 8B strong (frozen, zero-shot) |
14
+ | B | **0.510** | **0.150** | **4.6Γ—** | **2.1Γ—** | 1.7B cheap (LoRA fine-tuned) |
15
+ | C | 0.400 | 1.250 | 3.6Γ— | 1.7Γ— | cheap + 8B verifier β†’ 0% accept |
16
+ | D | 0.510 | 0.250 | 4.6Γ— | 2.1Γ— | cheap + 4B RM β†’ 21.5% accept |
17
+ | E | 0.420 | 0.750 | 3.8Γ— | 1.8Γ— | multi-proposal (n=3) + RM |
18
 
19
+ Random baseline: 11.1% | Majority baseline (final_answer): 24.0%
 
 
 
 
 
 
20
 
21
+ ## Cost-Quality Frontier
22
 
23
+ ```
24
+ Config B: cost=0.150 acc=0.510 β˜… PARETO OPTIMAL
25
+ Config D: cost=0.250 acc=0.510 (dominated by B β€” same acc, higher cost)
26
+ Config E: cost=0.750 acc=0.420 (dominated by B on both axes)
27
+ Config A: cost=1.000 acc=0.400 (dominated by B on both axes)
28
+ Config C: cost=1.250 acc=0.400 (dominated by B on both axes)
29
+ ```
30
 
31
+ ## Why It Failed
32
 
33
+ 1. **Training-eval mismatch:** The proposer was fine-tuned on `"Action: <type>\n<reason>"` format; eval uses natural conversational `messages`. The models haven't learned the real task structure.
34
 
35
+ 2. **8B verifier (Config C) rejected EVERY proposal:** It never learned to discriminate β€” it says REJECT to everything and falls back to its own (worse) zero-shot predictions.
36
 
37
+ 3. **4B reward model (Config D/E) scores everything negatively:** Mean score -1.52. It weakly prefers some proposals (21.5% above threshold) but the thresholding is arbitrary β€” accepted and rejected proposals have identical downstream accuracy.
38
 
39
+ 4. **Multi-proposal (Config E) was worse than single proposal (Config B):** 42% vs 51%. Sampling diversity at temperature 0.8 didn't improve β€” the model generates the same action most of the time, so the "diverse" proposals collapse to 1-2 options, and the RM can't reliably pick the right one.
 
 
40
 
41
+ ## What Would Fix This
42
 
43
+ 1. **Train on chat-template format:** Generate SFT data using `tokenizer.apply_chat_template()` with real conversational turns, not just "Action: X" format.
44
 
45
+ 2. **Train the 8B model:** Fine-tune Qwen3-8B on the same SFT data for a fair scale comparison (current "strong model" is frozen zero-shot).
46
 
47
+ 3. **Train the verifier properly:** The ACCEPT/REJECT verifier needs positive examples of correct proposals being accepted. Train it as a binary classifier on (proposal, context) β†’ ACCEPT/REJECT with balanced labels.
48
 
49
+ 4. **Use a real agent benchmark:** Military-grade eval like BFCL (Berkeley Function Calling Leaderboard) would provide realistic and challenging action prediction traces.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ## Deliverables
52
 
53
+ - **Dataset:** https://huggingface.co/datasets/narcolepticchicken/speculative-actions-eval (500 examples)
54
+ - **Proposer:** https://huggingface.co/narcolepticchicken/speculative-proposer-qwen3-1.7b
55
+ - **Verifier:** https://huggingface.co/narcolepticchicken/speculative-verifier-qwen3-4b
56
+ - **Code + results:** https://huggingface.co/narcolepticchicken/speculative-tool-actions