Speculative Tool Actions β Complete Project Report
Goal: Test whether speculative decoding transfers from token prediction to agent action prediction.
Hypothesis: A cheap model (1.7B) proposes the next agent action; a verifier decides whether to use the cheap proposal or fall back to an expensive model (8B). If most proposals are accepted, we save cost without sacrificing accuracy.
Architecture
Conversation Context
β
βΌ
βββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β Cheap Proposer β β Strong Fallback (8B) β
β (Qwen3-1.7B LoRA) ββββββββ Only called on REJECT β
β ~15% cost β β ~100% cost β
βββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β proposal β²
βΌ β
βββββββββββββββββββββββ β
β Verifier βββββ REJECT βββ
β (Qwen3-4B LoRA) β
β ACCEPT/REJECT β
β ~5% cost β
βββββββββββββββββββββββ
β
βΌ ACCEPT
Use cheap proposal
Five Configurations
| Config |
Name |
Description |
| A |
Always Strong |
8B fine-tuned proposer on every turn |
| B |
Cheap Only |
1.7B LoRA proposer on every turn |
| C |
Cheap + Strong Verifier |
1.7B proposes β 8B accepts/rejects β fallback to 8B on REJECT |
| D |
Cheap + 4B Verifier |
1.7B proposes β 4B LoRA verifier β fallback to 8B on REJECT |
| E |
Multi-Proposal Rerank |
1.7B generates N=3 diverse proposals β verifier picks best |
Cost Model (relative to 8B inference)
| Model |
Cost |
Notes |
| 8B (strong) |
1.00 |
Baseline |
| 1.7B (cheap) |
0.15 |
LoRA adapter, bf16 |
| 4B (verifier) |
0.05 |
Lightweight, ~5 tokens generated |
Models
Trained (have LoRA weights on Hub)
Required to Complete
- narcolepticchicken/speculative-verifier-v3-4b β needs training (script:
train_verifier_v3.py)
- narcolepticchicken/speculative-proposer-v3-8b β needs training (script:
train_sft_v3.py)
Datasets
v3 (Current β Chat-template format)
- speculative-sft-v3-main β 8,744 examples (8,306 train / 438 test)
- speculative-verifier-v3-main β 17,488 ACCEPT/REJECT pairs, 50/50 balanced
- speculative-eval-v3-main β 1,000 eval examples
v2 Results (Measured β April 2026)
| Config |
Accuracy |
Cost |
Notes |
| A: 8B frozen |
0.400 |
1.00 |
8B zero-shot, never fine-tuned |
| B: 1.7B cheap |
0.510 |
0.15 |
Pareto optimal |
| C: cheap + 8B rejector |
0.400 |
1.25 |
0% accept rate |
| D: cheap + 4B RM |
0.510 |
0.25 |
21.5% accept, all scores negative |
| E: multi-proposal n=3 |
0.420 |
0.75 |
RM picks best of 3 |
Key Failures
- Training-eval format mismatch:
"Action: X\n<reason>" vs conversational messages
- Unequal comparison: 1.7B fine-tuned, 8B frozen zero-shot
- Verifier was reward model, not ACCEPT/REJECT classifier
- No positive ACCEPT examples in verifier training
v3 Redesign
- Chat-template training = eval format
- Both 1.7B and 8B fine-tuned identically (same LoRA config, same data)
- Verifier as binary ACCEPT/REJECT SFT classifier on balanced data
How to Complete
Step 1: Train missing models (A100-large, ~2h total)
python train_verifier_v3.py
python train_sft_v3.py --model Qwen/Qwen3-8B --hub-id speculative-proposer-v3-8b
Step 2: Evaluate (a10g-largex2 or A100-large)
python eval_runner_v3.py
Step 3: Results
Output: eval_results_v3.json on narcolepticchicken/speculative-tool-actions
Deliverables Checklist