File size: 3,440 Bytes

02af344
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
 
 
 
 
 
 
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
 
 
 
 
 
 
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
02af344
ffc22a3
 
 
02af344

# Speculative Tool Actions: Final Results

## Summary

We tested whether speculative decoding transfers from token prediction to agent action prediction. A cheap model (Qwen3-1.7B + LoRA) proposes the next agent action; verifiers decide whether to use the proposal or fall back to an expensive 8B model.

**Finding:** The speculative architecture failed to improve on the cheap proposer alone. Option B (1.7B cheap only, 51.0% accuracy, 0.15× cost) dominates every other configuration.

## Full Results

| Config | Accuracy | Cost | xRand | xMaj | Notes |
|--------|----------|------|-------|------|-------|
| A | 0.400 | 1.000 | 3.6× | 1.7× | 8B strong (frozen, zero-shot) |
| B | **0.510** | **0.150** | **4.6×** | **2.1×** | 1.7B cheap (LoRA fine-tuned) |
| C | 0.400 | 1.250 | 3.6× | 1.7× | cheap + 8B verifier → 0% accept |
| D | 0.510 | 0.250 | 4.6× | 2.1× | cheap + 4B RM → 21.5% accept |
| E | 0.420 | 0.750 | 3.8× | 1.8× | multi-proposal (n=3) + RM |

Random baseline: 11.1% | Majority baseline (final_answer): 24.0%

## Cost-Quality Frontier

```
Config B: cost=0.150 acc=0.510   ★ PARETO OPTIMAL
Config D: cost=0.250 acc=0.510   (dominated by B — same acc, higher cost)
Config E: cost=0.750 acc=0.420   (dominated by B on both axes)
Config A: cost=1.000 acc=0.400   (dominated by B on both axes)
Config C: cost=1.250 acc=0.400   (dominated by B on both axes)
```

## Why It Failed

1. **Training-eval mismatch:** The proposer was fine-tuned on `"Action: <type>\n<reason>"` format; eval uses natural conversational `messages`. The models haven't learned the real task structure.

2. **8B verifier (Config C) rejected EVERY proposal:** It never learned to discriminate — it says REJECT to everything and falls back to its own (worse) zero-shot predictions.

3. **4B reward model (Config D/E) scores everything negatively:** Mean score -1.52. It weakly prefers some proposals (21.5% above threshold) but the thresholding is arbitrary — accepted and rejected proposals have identical downstream accuracy.

4. **Multi-proposal (Config E) was worse than single proposal (Config B):** 42% vs 51%. Sampling diversity at temperature 0.8 didn't improve — the model generates the same action most of the time, so the "diverse" proposals collapse to 1-2 options, and the RM can't reliably pick the right one.

## What Would Fix This

1. **Train on chat-template format:** Generate SFT data using `tokenizer.apply_chat_template()` with real conversational turns, not just "Action: X" format.

2. **Train the 8B model:** Fine-tune Qwen3-8B on the same SFT data for a fair scale comparison (current "strong model" is frozen zero-shot).

3. **Train the verifier properly:** The ACCEPT/REJECT verifier needs positive examples of correct proposals being accepted. Train it as a binary classifier on (proposal, context) → ACCEPT/REJECT with balanced labels.

4. **Use a real agent benchmark:** Military-grade eval like BFCL (Berkeley Function Calling Leaderboard) would provide realistic and challenging action prediction traces.

## Deliverables

- **Dataset:** https://huggingface.co/datasets/narcolepticchicken/speculative-actions-eval (500 examples)
- **Proposer:** https://huggingface.co/narcolepticchicken/speculative-proposer-qwen3-1.7b
- **Verifier:** https://huggingface.co/narcolepticchicken/speculative-verifier-qwen3-4b
- **Code + results:** https://huggingface.co/narcolepticchicken/speculative-tool-actions