| # Speculative Tool Actions β Complete Project Report |
|
|
| **Goal:** Test whether speculative decoding transfers from token prediction to agent action prediction. |
|
|
| **Hypothesis:** A cheap model (1.7B) proposes the next agent action; a verifier decides whether to use the cheap proposal or fall back to an expensive model (8B). If most proposals are accepted, we save cost without sacrificing accuracy. |
|
|
| ## Architecture |
|
|
| ``` |
| Conversation Context |
| β |
| βΌ |
| βββββββββββββββββββββββ ββββββββββββββββββββββββββββ |
| β Cheap Proposer β β Strong Fallback (8B) β |
| β (Qwen3-1.7B LoRA) ββββββββ Only called on REJECT β |
| β ~15% cost β β ~100% cost β |
| βββββββββββββββββββββββ ββββββββββββββββββββββββββββ |
| β proposal β² |
| βΌ β |
| βββββββββββββββββββββββ β |
| β Verifier βββββ REJECT βββ |
| β (Qwen3-4B LoRA) β |
| β ACCEPT/REJECT β |
| β ~5% cost β |
| βββββββββββββββββββββββ |
| β |
| βΌ ACCEPT |
| Use cheap proposal |
| ``` |
|
|
| ## Five Configurations |
|
|
| | Config | Name | Description | |
| |--------|------|-------------| |
| | A | Always Strong | 8B fine-tuned proposer on every turn | |
| | B | Cheap Only | 1.7B LoRA proposer on every turn | |
| | C | Cheap + Strong Verifier | 1.7B proposes β 8B accepts/rejects β fallback to 8B on REJECT | |
| | D | Cheap + 4B Verifier | 1.7B proposes β 4B LoRA verifier β fallback to 8B on REJECT | |
| | E | Multi-Proposal Rerank | 1.7B generates N=3 diverse proposals β verifier picks best | |
|
|
| ## Cost Model (relative to 8B inference) |
|
|
| | Model | Cost | Notes | |
| |-------|------|-------| |
| | 8B (strong) | 1.00 | Baseline | |
| | 1.7B (cheap) | 0.15 | LoRA adapter, bf16 | |
| | 4B (verifier) | 0.05 | Lightweight, ~5 tokens generated | |
|
|
| ## Models |
|
|
| ### Trained (have LoRA weights on Hub) |
|
|
| | Model | Size | Role | Hub URL | Status | |
| |-------|------|------|---------|--------| |
| | v2 Proposer | 1.7B LoRA | Cheap proposer | [speculative-proposer-qwen3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-qwen3-1.7b) | 2.3GB adapter | |
| | v2 Verifier | 4B LoRA | Reward model | [speculative-verifier-qwen3-4b](https://hf.co/narcolepticchicken/speculative-verifier-qwen3-4b) | 45MB adapter | |
| | v3 Proposer 1.7B | 1.7B LoRA | Cheap proposer | [speculative-proposer-v3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-v3-1.7b) | 66.5MB adapter | |
| | v3 Verifier 4B | 4B LoRA | ACCEPT/REJECT | [speculative-verifier-v3-4b](https://hf.co/narcolepticchicken/speculative-verifier-v3-4b) | MISSING β no adapter weights | |
| | v3 Proposer 8B | 8B LoRA | Strong proposer | β | Not yet trained | |
|
|
| ### Required to Complete |
| - **narcolepticchicken/speculative-verifier-v3-4b** β needs training (script: `train_verifier_v3.py`) |
| - **narcolepticchicken/speculative-proposer-v3-8b** β needs training (script: `train_sft_v3.py`) |
|
|
| ## Datasets |
|
|
| ### v3 (Current β Chat-template format) |
|
|
| - **speculative-sft-v3-main** β 8,744 examples (8,306 train / 438 test) |
| - **speculative-verifier-v3-main** β 17,488 ACCEPT/REJECT pairs, 50/50 balanced |
| - **speculative-eval-v3-main** β 1,000 eval examples |
|
|
| ## v2 Results (Measured β April 2026) |
|
|
| | Config | Accuracy | Cost | Notes | |
| |--------|----------|------|-------| |
| | A: 8B frozen | 0.400 | 1.00 | 8B zero-shot, never fine-tuned | |
| | B: 1.7B cheap | **0.510** | **0.15** | Pareto optimal | |
| | C: cheap + 8B rejector | 0.400 | 1.25 | 0% accept rate | |
| | D: cheap + 4B RM | 0.510 | 0.25 | 21.5% accept, all scores negative | |
| | E: multi-proposal n=3 | 0.420 | 0.75 | RM picks best of 3 | |
|
|
| ### Key Failures |
| 1. Training-eval format mismatch: `"Action: X\n<reason>"` vs conversational `messages` |
| 2. Unequal comparison: 1.7B fine-tuned, 8B frozen zero-shot |
| 3. Verifier was reward model, not ACCEPT/REJECT classifier |
| 4. No positive ACCEPT examples in verifier training |
|
|
| ## v3 Redesign |
| 1. Chat-template training = eval format |
| 2. Both 1.7B and 8B fine-tuned identically (same LoRA config, same data) |
| 3. Verifier as binary ACCEPT/REJECT SFT classifier on balanced data |
|
|
| ## How to Complete |
|
|
| ### Step 1: Train missing models (A100-large, ~2h total) |
| ```bash |
| python train_verifier_v3.py # 4B verifier, ~30min |
| python train_sft_v3.py --model Qwen/Qwen3-8B --hub-id speculative-proposer-v3-8b # ~1.5h |
| ``` |
|
|
| ### Step 2: Evaluate (a10g-largex2 or A100-large) |
| ```bash |
| python eval_runner_v3.py |
| ``` |
|
|
| ### Step 3: Results |
| Output: `eval_results_v3.json` on narcolepticchicken/speculative-tool-actions |
|
|
| ## Deliverables Checklist |
|
|
| - [x] Dataset: speculative-sft-v3-main, speculative-verifier-v3-main, speculative-eval-v3-main |
| - [x] Proposer 1.7B: speculative-proposer-v3-1.7b (trained) |
| - [ ] Verifier 4B: speculative-verifier-v3-4b (NEEDS TRAINING) |
| - [ ] Proposer 8B: speculative-proposer-v3-8b (NEEDS TRAINING) |
| - [x] Training scripts + eval runner |
| - [ ] Full ablation report + cost-quality frontier |
|
|