# Speculative Tool Actions — Complete Project Report **Goal:** Test whether speculative decoding transfers from token prediction to agent action prediction. **Hypothesis:** A cheap model (1.7B) proposes the next agent action; a verifier decides whether to use the cheap proposal or fall back to an expensive model (8B). If most proposals are accepted, we save cost without sacrificing accuracy. ## Architecture ``` Conversation Context │ ▼ ┌─────────────────────┐ ┌──────────────────────────┐ │ Cheap Proposer │ │ Strong Fallback (8B) │ │ (Qwen3-1.7B LoRA) │──────│ Only called on REJECT │ │ ~15% cost │ │ ~100% cost │ └─────────────────────┘ └──────────────────────────┘ │ proposal ▲ ▼ │ ┌─────────────────────┐ │ │ Verifier │──── REJECT ──┘ │ (Qwen3-4B LoRA) │ │ ACCEPT/REJECT │ │ ~5% cost │ └─────────────────────┘ │ ▼ ACCEPT Use cheap proposal ``` ## Five Configurations | Config | Name | Description | |--------|------|-------------| | A | Always Strong | 8B fine-tuned proposer on every turn | | B | Cheap Only | 1.7B LoRA proposer on every turn | | C | Cheap + Strong Verifier | 1.7B proposes → 8B accepts/rejects → fallback to 8B on REJECT | | D | Cheap + 4B Verifier | 1.7B proposes → 4B LoRA verifier → fallback to 8B on REJECT | | E | Multi-Proposal Rerank | 1.7B generates N=3 diverse proposals → verifier picks best | ## Cost Model (relative to 8B inference) | Model | Cost | Notes | |-------|------|-------| | 8B (strong) | 1.00 | Baseline | | 1.7B (cheap) | 0.15 | LoRA adapter, bf16 | | 4B (verifier) | 0.05 | Lightweight, ~5 tokens generated | ## Models ### Trained (have LoRA weights on Hub) | Model | Size | Role | Hub URL | Status | |-------|------|------|---------|--------| | v2 Proposer | 1.7B LoRA | Cheap proposer | [speculative-proposer-qwen3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-qwen3-1.7b) | 2.3GB adapter | | v2 Verifier | 4B LoRA | Reward model | [speculative-verifier-qwen3-4b](https://hf.co/narcolepticchicken/speculative-verifier-qwen3-4b) | 45MB adapter | | v3 Proposer 1.7B | 1.7B LoRA | Cheap proposer | [speculative-proposer-v3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-v3-1.7b) | 66.5MB adapter | | v3 Verifier 4B | 4B LoRA | ACCEPT/REJECT | [speculative-verifier-v3-4b](https://hf.co/narcolepticchicken/speculative-verifier-v3-4b) | MISSING — no adapter weights | | v3 Proposer 8B | 8B LoRA | Strong proposer | — | Not yet trained | ### Required to Complete - **narcolepticchicken/speculative-verifier-v3-4b** — needs training (script: `train_verifier_v3.py`) - **narcolepticchicken/speculative-proposer-v3-8b** — needs training (script: `train_sft_v3.py`) ## Datasets ### v3 (Current — Chat-template format) - **speculative-sft-v3-main** — 8,744 examples (8,306 train / 438 test) - **speculative-verifier-v3-main** — 17,488 ACCEPT/REJECT pairs, 50/50 balanced - **speculative-eval-v3-main** — 1,000 eval examples ## v2 Results (Measured — April 2026) | Config | Accuracy | Cost | Notes | |--------|----------|------|-------| | A: 8B frozen | 0.400 | 1.00 | 8B zero-shot, never fine-tuned | | B: 1.7B cheap | **0.510** | **0.15** | Pareto optimal | | C: cheap + 8B rejector | 0.400 | 1.25 | 0% accept rate | | D: cheap + 4B RM | 0.510 | 0.25 | 21.5% accept, all scores negative | | E: multi-proposal n=3 | 0.420 | 0.75 | RM picks best of 3 | ### Key Failures 1. Training-eval format mismatch: `"Action: X\n"` vs conversational `messages` 2. Unequal comparison: 1.7B fine-tuned, 8B frozen zero-shot 3. Verifier was reward model, not ACCEPT/REJECT classifier 4. No positive ACCEPT examples in verifier training ## v3 Redesign 1. Chat-template training = eval format 2. Both 1.7B and 8B fine-tuned identically (same LoRA config, same data) 3. Verifier as binary ACCEPT/REJECT SFT classifier on balanced data ## How to Complete ### Step 1: Train missing models (A100-large, ~2h total) ```bash python train_verifier_v3.py # 4B verifier, ~30min python train_sft_v3.py --model Qwen/Qwen3-8B --hub-id speculative-proposer-v3-8b # ~1.5h ``` ### Step 2: Evaluate (a10g-largex2 or A100-large) ```bash python eval_runner_v3.py ``` ### Step 3: Results Output: `eval_results_v3.json` on narcolepticchicken/speculative-tool-actions ## Deliverables Checklist - [x] Dataset: speculative-sft-v3-main, speculative-verifier-v3-main, speculative-eval-v3-main - [x] Proposer 1.7B: speculative-proposer-v3-1.7b (trained) - [ ] Verifier 4B: speculative-verifier-v3-4b (NEEDS TRAINING) - [ ] Proposer 8B: speculative-proposer-v3-8b (NEEDS TRAINING) - [x] Training scripts + eval runner - [ ] Full ablation report + cost-quality frontier