narcolepticchicken
/

speculative-tool-actions

Model card Files Files and versions

xet

Community

narcolepticchicken commited on 1 day ago

Commit

02af344

verified ·

1 Parent(s): ca60718

Upload ABLATION_REPORT_v2.md

Browse files

Files changed (1) hide show

ABLATION_REPORT_v2.md +35 -58

ABLATION_REPORT_v2.md CHANGED Viewed

@@ -1,79 +1,56 @@
-# Speculative Tool Actions: Ablation Report (v2)
-## Experiment Design
-We test whether speculative decoding can be adapted from token prediction to agent action prediction.
-**Task:** Given a conversation context (multi-turn user/assistant/tool), predict the next agent action from 9 classes: `tool_call`, `retrieval`, `file_read`, `file_write`, `repair`, `verifier`, `ask_clarification`, `final_answer`, `BLOCKED`.
-**Models:**
-- **Cheap proposer:** Qwen3-1.7B + LoRA fine-tuned for 2 epochs on synthetic action-prediction data (96.7% train accuracy)
-- **Strong model:** Qwen3-8B frozen (zero-shot)
-- **Reward model:** Qwen3-4B + LoRA trained on preference pairs from which-actions-are-correct (70% train accuracy)
-**Configurations:**
-| Config | Description | Expected Cost (normalized) |
-|--------|-------------|---------------------------|
-| A | Strong 8B model only | 1.00 |
-| B | Cheap 1.7B proposer only | 0.15 |
-| C | Cheap proposes → 8B verifies (ACCEPT/REJECT) | 0.25–1.25 |
-| D | Cheap proposes → 4B reward model scores | 0.25 |
-| E | Cheap generates 3 proposals → reward model picks best | 0.75 |
-## Results (200 eval examples, dual A10G)
-| Config | Accuracy | Cost | xRandom (0.111) | xMajority (0.240) | Accept Rate |
-|--------|----------|------|-----------------|-------------------|-------------|
-| A (8B) | **0.400** | 1.000 | 3.6× | 1.7× | — |
-| B (1.7B) | **0.510** | 0.150 | 4.6× | 2.1× | — |
-| C (C+S) | **0.400** | 1.250 | 3.6× | 1.7× | 0.000 |
-| D (C+R) | **0.510** | 0.250 | 4.6× | 2.1× | 0.215 |
-| E (multi) | pending | — | — | — | — |
-### Key Findings
-**1. Cheap model outperforms strong model (B > A: 51% vs 40%)**
-The fine-tuned 1.7B proposer learned the task from synthetic data. The frozen 8B model, despite 4.7× more parameters, had zero training on the task and defaults to generic completions. **Fine-tuning matters more than scale for this task.**
-**2. Verifier (C): 0% accept rate — completely useless**
-The 8B ACCEPT/REJECT verifier rejected EVERY proposal from the cheap model, then fell back to its own (worse) predictions. Result: same accuracy as A (40%) but higher cost (1.25 vs 1.00). The verifier is either:
-- Telling the truth (it genuinely believes all proposals are wrong)
-- Collapsed to always outputting "REJECT"
-Either way, it provides zero benefit.
-**3. Reward model (D): 21.5% accept rate, all scores negative (mean -1.52)**
-The 4B reward model scores all proposals negatively, with only ~1/5 above the acceptance threshold of -1.0. But since Config B and D have identical accuracy (51%), the reward model is not effectively filtering — it's just adding cost without changing outcomes. The accepted proposals are no more accurate than the rejected ones.
-**4. Cost-quality frontier is degenerate**
-The only sensible point on the frontier is **Config B**: 51% accuracy at 0.15× cost. Configs A, C, and D are all dominated — they cost more and deliver equal or worse accuracy.
-## Why This Task is Challenging
-The eval dataset uses **real conversational messages** (from synthetic templates but in natural chat format), while the proposer was fine-tuned on **"Action: <type>\n<reason>" format**. The training/eval distribution mismatch means:
-- The proposer learned to generate `Action: tool_call\nReason: ...` when given `"Predict the next action for:\n\nuser: ... assistant: ..."`
-- The 8B model had no such training and produces natural language continuations instead
-- The verifier was trained on preference pairs, not on ACCEPT/REJECT decisions
-**The speculative decoding architecture cannot fix a weak signal problem.** When the task itself is poorly learned, adding complex routing (verification, reranking) only adds cost without improving accuracy.
-## Recommendations
-1. **Fix the training data:** Generate SFT data using the same chat-template format as the eval dataset. Use `tokenizer.apply_chat_template()` consistently.
-2. **Train the strong model:** Fine-tune Qwen3-8B on the same SFT data for a fair comparison of proposer size.
-3. **Train the verifier properly:** The ACCEPT/REJECT verifier needs training examples showing correct proposals paired with ACCEPT decisions.
-4. **Use a real agent benchmark:** The synthetic action prediction task has limited ecological validity. A task like BFCL (Berkeley Function Calling Leaderboard) would provide realistic tool-calling traces.
 ## Deliverables
-- **Dataset:** `narcolepticchicken/speculative-actions-eval` (500 conversational examples with action labels)
-- **Proposer:** `narcolepticchicken/speculative-proposer-qwen3-1.7b` (LoRA adapter)
-- **Verifier:** `narcolepticchicken/speculative-verifier-qwen3-4b` (LoRA adapter for reward model)
-- **Eval runner:** `narcolepticchicken/speculative-tool-actions` (eval scripts + results)
-- **Results:** `narcolepticchicken/speculative-tool-actions/eval_results_v2.json`

+# Speculative Tool Actions: Final Results
+## Summary
+We tested whether speculative decoding transfers from token prediction to agent action prediction. A cheap model (Qwen3-1.7B + LoRA) proposes the next agent action; verifiers decide whether to use the proposal or fall back to an expensive 8B model.
+**Finding:** The speculative architecture failed to improve on the cheap proposer alone. Option B (1.7B cheap only, 51.0% accuracy, 0.15× cost) dominates every other configuration.
+## Full Results
+| Config | Accuracy | Cost | xRand | xMaj | Notes |
+|--------|----------|------|-------|------|-------|
+| A | 0.400 | 1.000 | 3.6× | 1.7× | 8B strong (frozen, zero-shot) |
+| B | **0.510** | **0.150** | **4.6×** | **2.1×** | 1.7B cheap (LoRA fine-tuned) |
+| C | 0.400 | 1.250 | 3.6× | 1.7× | cheap + 8B verifier → 0% accept |
+| D | 0.510 | 0.250 | 4.6× | 2.1× | cheap + 4B RM → 21.5% accept |
+| E | 0.420 | 0.750 | 3.8× | 1.8× | multi-proposal (n=3) + RM |
+Random baseline: 11.1% | Majority baseline (final_answer): 24.0%
+## Cost-Quality Frontier
+```
+Config B: cost=0.150 acc=0.510   ★ PARETO OPTIMAL
+Config D: cost=0.250 acc=0.510   (dominated by B — same acc, higher cost)
+Config E: cost=0.750 acc=0.420   (dominated by B on both axes)
+Config A: cost=1.000 acc=0.400   (dominated by B on both axes)
+Config C: cost=1.250 acc=0.400   (dominated by B on both axes)
+```
+## Why It Failed
+1. **Training-eval mismatch:** The proposer was fine-tuned on `"Action: <type>\n<reason>"` format; eval uses natural conversational `messages`. The models haven't learned the real task structure.
+2. **8B verifier (Config C) rejected EVERY proposal:** It never learned to discriminate — it says REJECT to everything and falls back to its own (worse) zero-shot predictions.
+3. **4B reward model (Config D/E) scores everything negatively:** Mean score -1.52. It weakly prefers some proposals (21.5% above threshold) but the thresholding is arbitrary — accepted and rejected proposals have identical downstream accuracy.
+4. **Multi-proposal (Config E) was worse than single proposal (Config B):** 42% vs 51%. Sampling diversity at temperature 0.8 didn't improve — the model generates the same action most of the time, so the "diverse" proposals collapse to 1-2 options, and the RM can't reliably pick the right one.
+## What Would Fix This
+1. **Train on chat-template format:** Generate SFT data using `tokenizer.apply_chat_template()` with real conversational turns, not just "Action: X" format.
+2. **Train the 8B model:** Fine-tune Qwen3-8B on the same SFT data for a fair scale comparison (current "strong model" is frozen zero-shot).
+3. **Train the verifier properly:** The ACCEPT/REJECT verifier needs positive examples of correct proposals being accepted. Train it as a binary classifier on (proposal, context) → ACCEPT/REJECT with balanced labels.
+4. **Use a real agent benchmark:** Military-grade eval like BFCL (Berkeley Function Calling Leaderboard) would provide realistic and challenging action prediction traces.
 ## Deliverables
+- **Dataset:** https://huggingface.co/datasets/narcolepticchicken/speculative-actions-eval (500 examples)
+- **Proposer:** https://huggingface.co/narcolepticchicken/speculative-proposer-qwen3-1.7b
+- **Verifier:** https://huggingface.co/narcolepticchicken/speculative-verifier-qwen3-4b
+- **Code + results:** https://huggingface.co/narcolepticchicken/speculative-tool-actions