Upload ABLATION_REPORT_v2.md
Browse files- ABLATION_REPORT_v2.md +79 -0
ABLATION_REPORT_v2.md
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Speculative Tool Actions: Ablation Report (v2)
|
| 2 |
+
|
| 3 |
+
## Experiment Design
|
| 4 |
+
|
| 5 |
+
We test whether speculative decoding can be adapted from token prediction to agent action prediction.
|
| 6 |
+
|
| 7 |
+
**Task:** Given a conversation context (multi-turn user/assistant/tool), predict the next agent action from 9 classes: `tool_call`, `retrieval`, `file_read`, `file_write`, `repair`, `verifier`, `ask_clarification`, `final_answer`, `BLOCKED`.
|
| 8 |
+
|
| 9 |
+
**Models:**
|
| 10 |
+
- **Cheap proposer:** Qwen3-1.7B + LoRA fine-tuned for 2 epochs on synthetic action-prediction data (96.7% train accuracy)
|
| 11 |
+
- **Strong model:** Qwen3-8B frozen (zero-shot)
|
| 12 |
+
- **Reward model:** Qwen3-4B + LoRA trained on preference pairs from which-actions-are-correct (70% train accuracy)
|
| 13 |
+
|
| 14 |
+
**Configurations:**
|
| 15 |
+
|
| 16 |
+
| Config | Description | Expected Cost (normalized) |
|
| 17 |
+
|--------|-------------|---------------------------|
|
| 18 |
+
| A | Strong 8B model only | 1.00 |
|
| 19 |
+
| B | Cheap 1.7B proposer only | 0.15 |
|
| 20 |
+
| C | Cheap proposes β 8B verifies (ACCEPT/REJECT) | 0.25β1.25 |
|
| 21 |
+
| D | Cheap proposes β 4B reward model scores | 0.25 |
|
| 22 |
+
| E | Cheap generates 3 proposals β reward model picks best | 0.75 |
|
| 23 |
+
|
| 24 |
+
## Results (200 eval examples, dual A10G)
|
| 25 |
+
|
| 26 |
+
| Config | Accuracy | Cost | xRandom (0.111) | xMajority (0.240) | Accept Rate |
|
| 27 |
+
|--------|----------|------|-----------------|-------------------|-------------|
|
| 28 |
+
| A (8B) | **0.400** | 1.000 | 3.6Γ | 1.7Γ | β |
|
| 29 |
+
| B (1.7B) | **0.510** | 0.150 | 4.6Γ | 2.1Γ | β |
|
| 30 |
+
| C (C+S) | **0.400** | 1.250 | 3.6Γ | 1.7Γ | 0.000 |
|
| 31 |
+
| D (C+R) | **0.510** | 0.250 | 4.6Γ | 2.1Γ | 0.215 |
|
| 32 |
+
| E (multi) | pending | β | β | β | β |
|
| 33 |
+
|
| 34 |
+
### Key Findings
|
| 35 |
+
|
| 36 |
+
**1. Cheap model outperforms strong model (B > A: 51% vs 40%)**
|
| 37 |
+
|
| 38 |
+
The fine-tuned 1.7B proposer learned the task from synthetic data. The frozen 8B model, despite 4.7Γ more parameters, had zero training on the task and defaults to generic completions. **Fine-tuning matters more than scale for this task.**
|
| 39 |
+
|
| 40 |
+
**2. Verifier (C): 0% accept rate β completely useless**
|
| 41 |
+
|
| 42 |
+
The 8B ACCEPT/REJECT verifier rejected EVERY proposal from the cheap model, then fell back to its own (worse) predictions. Result: same accuracy as A (40%) but higher cost (1.25 vs 1.00). The verifier is either:
|
| 43 |
+
- Telling the truth (it genuinely believes all proposals are wrong)
|
| 44 |
+
- Collapsed to always outputting "REJECT"
|
| 45 |
+
|
| 46 |
+
Either way, it provides zero benefit.
|
| 47 |
+
|
| 48 |
+
**3. Reward model (D): 21.5% accept rate, all scores negative (mean -1.52)**
|
| 49 |
+
|
| 50 |
+
The 4B reward model scores all proposals negatively, with only ~1/5 above the acceptance threshold of -1.0. But since Config B and D have identical accuracy (51%), the reward model is not effectively filtering β it's just adding cost without changing outcomes. The accepted proposals are no more accurate than the rejected ones.
|
| 51 |
+
|
| 52 |
+
**4. Cost-quality frontier is degenerate**
|
| 53 |
+
|
| 54 |
+
The only sensible point on the frontier is **Config B**: 51% accuracy at 0.15Γ cost. Configs A, C, and D are all dominated β they cost more and deliver equal or worse accuracy.
|
| 55 |
+
|
| 56 |
+
## Why This Task is Challenging
|
| 57 |
+
|
| 58 |
+
The eval dataset uses **real conversational messages** (from synthetic templates but in natural chat format), while the proposer was fine-tuned on **"Action: <type>\n<reason>" format**. The training/eval distribution mismatch means:
|
| 59 |
+
|
| 60 |
+
- The proposer learned to generate `Action: tool_call\nReason: ...` when given `"Predict the next action for:\n\nuser: ... assistant: ..."`
|
| 61 |
+
- The 8B model had no such training and produces natural language continuations instead
|
| 62 |
+
- The verifier was trained on preference pairs, not on ACCEPT/REJECT decisions
|
| 63 |
+
|
| 64 |
+
**The speculative decoding architecture cannot fix a weak signal problem.** When the task itself is poorly learned, adding complex routing (verification, reranking) only adds cost without improving accuracy.
|
| 65 |
+
|
| 66 |
+
## Recommendations
|
| 67 |
+
|
| 68 |
+
1. **Fix the training data:** Generate SFT data using the same chat-template format as the eval dataset. Use `tokenizer.apply_chat_template()` consistently.
|
| 69 |
+
2. **Train the strong model:** Fine-tune Qwen3-8B on the same SFT data for a fair comparison of proposer size.
|
| 70 |
+
3. **Train the verifier properly:** The ACCEPT/REJECT verifier needs training examples showing correct proposals paired with ACCEPT decisions.
|
| 71 |
+
4. **Use a real agent benchmark:** The synthetic action prediction task has limited ecological validity. A task like BFCL (Berkeley Function Calling Leaderboard) would provide realistic tool-calling traces.
|
| 72 |
+
|
| 73 |
+
## Deliverables
|
| 74 |
+
|
| 75 |
+
- **Dataset:** `narcolepticchicken/speculative-actions-eval` (500 conversational examples with action labels)
|
| 76 |
+
- **Proposer:** `narcolepticchicken/speculative-proposer-qwen3-1.7b` (LoRA adapter)
|
| 77 |
+
- **Verifier:** `narcolepticchicken/speculative-verifier-qwen3-4b` (LoRA adapter for reward model)
|
| 78 |
+
- **Eval runner:** `narcolepticchicken/speculative-tool-actions` (eval scripts + results)
|
| 79 |
+
- **Results:** `narcolepticchicken/speculative-tool-actions/eval_results_v2.json`
|