Upload ABLATION_REPORT_v2.md
Browse files- ABLATION_REPORT_v2.md +35 -58
ABLATION_REPORT_v2.md
CHANGED
|
@@ -1,79 +1,56 @@
|
|
| 1 |
-
# Speculative Tool Actions:
|
| 2 |
|
| 3 |
-
##
|
| 4 |
|
| 5 |
-
We
|
| 6 |
|
| 7 |
-
**
|
| 8 |
|
| 9 |
-
|
| 10 |
-
- **Cheap proposer:** Qwen3-1.7B + LoRA fine-tuned for 2 epochs on synthetic action-prediction data (96.7% train accuracy)
|
| 11 |
-
- **Strong model:** Qwen3-8B frozen (zero-shot)
|
| 12 |
-
- **Reward model:** Qwen3-4B + LoRA trained on preference pairs from which-actions-are-correct (70% train accuracy)
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|--------|-------------|---------------------------|
|
| 18 |
-
| A | Strong 8B model only | 1.00 |
|
| 19 |
-
| B | Cheap 1.7B proposer only | 0.15 |
|
| 20 |
-
| C | Cheap proposes β 8B verifies (ACCEPT/REJECT) | 0.25β1.25 |
|
| 21 |
-
| D | Cheap proposes β 4B reward model scores | 0.25 |
|
| 22 |
-
| E | Cheap generates 3 proposals β reward model picks best | 0.75 |
|
| 23 |
|
| 24 |
-
##
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
|
| 34 |
-
##
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
**
|
| 41 |
|
| 42 |
-
|
| 43 |
-
- Telling the truth (it genuinely believes all proposals are wrong)
|
| 44 |
-
- Collapsed to always outputting "REJECT"
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
**
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
**
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
## Why This Task is Challenging
|
| 57 |
-
|
| 58 |
-
The eval dataset uses **real conversational messages** (from synthetic templates but in natural chat format), while the proposer was fine-tuned on **"Action: <type>\n<reason>" format**. The training/eval distribution mismatch means:
|
| 59 |
-
|
| 60 |
-
- The proposer learned to generate `Action: tool_call\nReason: ...` when given `"Predict the next action for:\n\nuser: ... assistant: ..."`
|
| 61 |
-
- The 8B model had no such training and produces natural language continuations instead
|
| 62 |
-
- The verifier was trained on preference pairs, not on ACCEPT/REJECT decisions
|
| 63 |
-
|
| 64 |
-
**The speculative decoding architecture cannot fix a weak signal problem.** When the task itself is poorly learned, adding complex routing (verification, reranking) only adds cost without improving accuracy.
|
| 65 |
-
|
| 66 |
-
## Recommendations
|
| 67 |
-
|
| 68 |
-
1. **Fix the training data:** Generate SFT data using the same chat-template format as the eval dataset. Use `tokenizer.apply_chat_template()` consistently.
|
| 69 |
-
2. **Train the strong model:** Fine-tune Qwen3-8B on the same SFT data for a fair comparison of proposer size.
|
| 70 |
-
3. **Train the verifier properly:** The ACCEPT/REJECT verifier needs training examples showing correct proposals paired with ACCEPT decisions.
|
| 71 |
-
4. **Use a real agent benchmark:** The synthetic action prediction task has limited ecological validity. A task like BFCL (Berkeley Function Calling Leaderboard) would provide realistic tool-calling traces.
|
| 72 |
|
| 73 |
## Deliverables
|
| 74 |
|
| 75 |
-
- **Dataset:**
|
| 76 |
-
- **Proposer:**
|
| 77 |
-
- **Verifier:**
|
| 78 |
-
- **
|
| 79 |
-
- **Results:** `narcolepticchicken/speculative-tool-actions/eval_results_v2.json`
|
|
|
|
| 1 |
+
# Speculative Tool Actions: Final Results
|
| 2 |
|
| 3 |
+
## Summary
|
| 4 |
|
| 5 |
+
We tested whether speculative decoding transfers from token prediction to agent action prediction. A cheap model (Qwen3-1.7B + LoRA) proposes the next agent action; verifiers decide whether to use the proposal or fall back to an expensive 8B model.
|
| 6 |
|
| 7 |
+
**Finding:** The speculative architecture failed to improve on the cheap proposer alone. Option B (1.7B cheap only, 51.0% accuracy, 0.15Γ cost) dominates every other configuration.
|
| 8 |
|
| 9 |
+
## Full Results
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
+
| Config | Accuracy | Cost | xRand | xMaj | Notes |
|
| 12 |
+
|--------|----------|------|-------|------|-------|
|
| 13 |
+
| A | 0.400 | 1.000 | 3.6Γ | 1.7Γ | 8B strong (frozen, zero-shot) |
|
| 14 |
+
| B | **0.510** | **0.150** | **4.6Γ** | **2.1Γ** | 1.7B cheap (LoRA fine-tuned) |
|
| 15 |
+
| C | 0.400 | 1.250 | 3.6Γ | 1.7Γ | cheap + 8B verifier β 0% accept |
|
| 16 |
+
| D | 0.510 | 0.250 | 4.6Γ | 2.1Γ | cheap + 4B RM β 21.5% accept |
|
| 17 |
+
| E | 0.420 | 0.750 | 3.8Γ | 1.8Γ | multi-proposal (n=3) + RM |
|
| 18 |
|
| 19 |
+
Random baseline: 11.1% | Majority baseline (final_answer): 24.0%
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
## Cost-Quality Frontier
|
| 22 |
|
| 23 |
+
```
|
| 24 |
+
Config B: cost=0.150 acc=0.510 β
PARETO OPTIMAL
|
| 25 |
+
Config D: cost=0.250 acc=0.510 (dominated by B β same acc, higher cost)
|
| 26 |
+
Config E: cost=0.750 acc=0.420 (dominated by B on both axes)
|
| 27 |
+
Config A: cost=1.000 acc=0.400 (dominated by B on both axes)
|
| 28 |
+
Config C: cost=1.250 acc=0.400 (dominated by B on both axes)
|
| 29 |
+
```
|
| 30 |
|
| 31 |
+
## Why It Failed
|
| 32 |
|
| 33 |
+
1. **Training-eval mismatch:** The proposer was fine-tuned on `"Action: <type>\n<reason>"` format; eval uses natural conversational `messages`. The models haven't learned the real task structure.
|
| 34 |
|
| 35 |
+
2. **8B verifier (Config C) rejected EVERY proposal:** It never learned to discriminate β it says REJECT to everything and falls back to its own (worse) zero-shot predictions.
|
| 36 |
|
| 37 |
+
3. **4B reward model (Config D/E) scores everything negatively:** Mean score -1.52. It weakly prefers some proposals (21.5% above threshold) but the thresholding is arbitrary β accepted and rejected proposals have identical downstream accuracy.
|
| 38 |
|
| 39 |
+
4. **Multi-proposal (Config E) was worse than single proposal (Config B):** 42% vs 51%. Sampling diversity at temperature 0.8 didn't improve β the model generates the same action most of the time, so the "diverse" proposals collapse to 1-2 options, and the RM can't reliably pick the right one.
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
## What Would Fix This
|
| 42 |
|
| 43 |
+
1. **Train on chat-template format:** Generate SFT data using `tokenizer.apply_chat_template()` with real conversational turns, not just "Action: X" format.
|
| 44 |
|
| 45 |
+
2. **Train the 8B model:** Fine-tune Qwen3-8B on the same SFT data for a fair scale comparison (current "strong model" is frozen zero-shot).
|
| 46 |
|
| 47 |
+
3. **Train the verifier properly:** The ACCEPT/REJECT verifier needs positive examples of correct proposals being accepted. Train it as a binary classifier on (proposal, context) β ACCEPT/REJECT with balanced labels.
|
| 48 |
|
| 49 |
+
4. **Use a real agent benchmark:** Military-grade eval like BFCL (Berkeley Function Calling Leaderboard) would provide realistic and challenging action prediction traces.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
## Deliverables
|
| 52 |
|
| 53 |
+
- **Dataset:** https://huggingface.co/datasets/narcolepticchicken/speculative-actions-eval (500 examples)
|
| 54 |
+
- **Proposer:** https://huggingface.co/narcolepticchicken/speculative-proposer-qwen3-1.7b
|
| 55 |
+
- **Verifier:** https://huggingface.co/narcolepticchicken/speculative-verifier-qwen3-4b
|
| 56 |
+
- **Code + results:** https://huggingface.co/narcolepticchicken/speculative-tool-actions
|
|
|