| # Speculative Tool Actions: Final Results |
|
|
| ## Summary |
|
|
| We tested whether speculative decoding transfers from token prediction to agent action prediction. A cheap model (Qwen3-1.7B + LoRA) proposes the next agent action; verifiers decide whether to use the proposal or fall back to an expensive 8B model. |
|
|
| **Finding:** The speculative architecture failed to improve on the cheap proposer alone. Option B (1.7B cheap only, 51.0% accuracy, 0.15Γ cost) dominates every other configuration. |
|
|
| ## Full Results |
|
|
| | Config | Accuracy | Cost | xRand | xMaj | Notes | |
| |--------|----------|------|-------|------|-------| |
| | A | 0.400 | 1.000 | 3.6Γ | 1.7Γ | 8B strong (frozen, zero-shot) | |
| | B | **0.510** | **0.150** | **4.6Γ** | **2.1Γ** | 1.7B cheap (LoRA fine-tuned) | |
| | C | 0.400 | 1.250 | 3.6Γ | 1.7Γ | cheap + 8B verifier β 0% accept | |
| | D | 0.510 | 0.250 | 4.6Γ | 2.1Γ | cheap + 4B RM β 21.5% accept | |
| | E | 0.420 | 0.750 | 3.8Γ | 1.8Γ | multi-proposal (n=3) + RM | |
|
|
| Random baseline: 11.1% | Majority baseline (final_answer): 24.0% |
| |
| ## Cost-Quality Frontier |
| |
| ``` |
| Config B: cost=0.150 acc=0.510 β
PARETO OPTIMAL |
| Config D: cost=0.250 acc=0.510 (dominated by B β same acc, higher cost) |
| Config E: cost=0.750 acc=0.420 (dominated by B on both axes) |
| Config A: cost=1.000 acc=0.400 (dominated by B on both axes) |
| Config C: cost=1.250 acc=0.400 (dominated by B on both axes) |
| ``` |
| |
| ## Why It Failed |
| |
| 1. **Training-eval mismatch:** The proposer was fine-tuned on `"Action: <type>\n<reason>"` format; eval uses natural conversational `messages`. The models haven't learned the real task structure. |
| |
| 2. **8B verifier (Config C) rejected EVERY proposal:** It never learned to discriminate β it says REJECT to everything and falls back to its own (worse) zero-shot predictions. |
| |
| 3. **4B reward model (Config D/E) scores everything negatively:** Mean score -1.52. It weakly prefers some proposals (21.5% above threshold) but the thresholding is arbitrary β accepted and rejected proposals have identical downstream accuracy. |
| |
| 4. **Multi-proposal (Config E) was worse than single proposal (Config B):** 42% vs 51%. Sampling diversity at temperature 0.8 didn't improve β the model generates the same action most of the time, so the "diverse" proposals collapse to 1-2 options, and the RM can't reliably pick the right one. |
| |
| ## What Would Fix This |
| |
| 1. **Train on chat-template format:** Generate SFT data using `tokenizer.apply_chat_template()` with real conversational turns, not just "Action: X" format. |
| |
| 2. **Train the 8B model:** Fine-tune Qwen3-8B on the same SFT data for a fair scale comparison (current "strong model" is frozen zero-shot). |
| |
| 3. **Train the verifier properly:** The ACCEPT/REJECT verifier needs positive examples of correct proposals being accepted. Train it as a binary classifier on (proposal, context) β ACCEPT/REJECT with balanced labels. |
| |
| 4. **Use a real agent benchmark:** Military-grade eval like BFCL (Berkeley Function Calling Leaderboard) would provide realistic and challenging action prediction traces. |
| |
| ## Deliverables |
| |
| - **Dataset:** https://huggingface.co/datasets/narcolepticchicken/speculative-actions-eval (500 examples) |
| - **Proposer:** https://huggingface.co/narcolepticchicken/speculative-proposer-qwen3-1.7b |
| - **Verifier:** https://huggingface.co/narcolepticchicken/speculative-verifier-qwen3-4b |
| - **Code + results:** https://huggingface.co/narcolepticchicken/speculative-tool-actions |
| |