File size: 5,230 Bytes
e6e10cb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | # Speculative Tool Actions β Complete Project Report
**Goal:** Test whether speculative decoding transfers from token prediction to agent action prediction.
**Hypothesis:** A cheap model (1.7B) proposes the next agent action; a verifier decides whether to use the cheap proposal or fall back to an expensive model (8B). If most proposals are accepted, we save cost without sacrificing accuracy.
## Architecture
```
Conversation Context
β
βΌ
βββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β Cheap Proposer β β Strong Fallback (8B) β
β (Qwen3-1.7B LoRA) ββββββββ Only called on REJECT β
β ~15% cost β β ~100% cost β
βββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β proposal β²
βΌ β
βββββββββββββββββββββββ β
β Verifier βββββ REJECT βββ
β (Qwen3-4B LoRA) β
β ACCEPT/REJECT β
β ~5% cost β
βββββββββββββββββββββββ
β
βΌ ACCEPT
Use cheap proposal
```
## Five Configurations
| Config | Name | Description |
|--------|------|-------------|
| A | Always Strong | 8B fine-tuned proposer on every turn |
| B | Cheap Only | 1.7B LoRA proposer on every turn |
| C | Cheap + Strong Verifier | 1.7B proposes β 8B accepts/rejects β fallback to 8B on REJECT |
| D | Cheap + 4B Verifier | 1.7B proposes β 4B LoRA verifier β fallback to 8B on REJECT |
| E | Multi-Proposal Rerank | 1.7B generates N=3 diverse proposals β verifier picks best |
## Cost Model (relative to 8B inference)
| Model | Cost | Notes |
|-------|------|-------|
| 8B (strong) | 1.00 | Baseline |
| 1.7B (cheap) | 0.15 | LoRA adapter, bf16 |
| 4B (verifier) | 0.05 | Lightweight, ~5 tokens generated |
## Models
### Trained (have LoRA weights on Hub)
| Model | Size | Role | Hub URL | Status |
|-------|------|------|---------|--------|
| v2 Proposer | 1.7B LoRA | Cheap proposer | [speculative-proposer-qwen3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-qwen3-1.7b) | 2.3GB adapter |
| v2 Verifier | 4B LoRA | Reward model | [speculative-verifier-qwen3-4b](https://hf.co/narcolepticchicken/speculative-verifier-qwen3-4b) | 45MB adapter |
| v3 Proposer 1.7B | 1.7B LoRA | Cheap proposer | [speculative-proposer-v3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-v3-1.7b) | 66.5MB adapter |
| v3 Verifier 4B | 4B LoRA | ACCEPT/REJECT | [speculative-verifier-v3-4b](https://hf.co/narcolepticchicken/speculative-verifier-v3-4b) | MISSING β no adapter weights |
| v3 Proposer 8B | 8B LoRA | Strong proposer | β | Not yet trained |
### Required to Complete
- **narcolepticchicken/speculative-verifier-v3-4b** β needs training (script: `train_verifier_v3.py`)
- **narcolepticchicken/speculative-proposer-v3-8b** β needs training (script: `train_sft_v3.py`)
## Datasets
### v3 (Current β Chat-template format)
- **speculative-sft-v3-main** β 8,744 examples (8,306 train / 438 test)
- **speculative-verifier-v3-main** β 17,488 ACCEPT/REJECT pairs, 50/50 balanced
- **speculative-eval-v3-main** β 1,000 eval examples
## v2 Results (Measured β April 2026)
| Config | Accuracy | Cost | Notes |
|--------|----------|------|-------|
| A: 8B frozen | 0.400 | 1.00 | 8B zero-shot, never fine-tuned |
| B: 1.7B cheap | **0.510** | **0.15** | Pareto optimal |
| C: cheap + 8B rejector | 0.400 | 1.25 | 0% accept rate |
| D: cheap + 4B RM | 0.510 | 0.25 | 21.5% accept, all scores negative |
| E: multi-proposal n=3 | 0.420 | 0.75 | RM picks best of 3 |
### Key Failures
1. Training-eval format mismatch: `"Action: X\n<reason>"` vs conversational `messages`
2. Unequal comparison: 1.7B fine-tuned, 8B frozen zero-shot
3. Verifier was reward model, not ACCEPT/REJECT classifier
4. No positive ACCEPT examples in verifier training
## v3 Redesign
1. Chat-template training = eval format
2. Both 1.7B and 8B fine-tuned identically (same LoRA config, same data)
3. Verifier as binary ACCEPT/REJECT SFT classifier on balanced data
## How to Complete
### Step 1: Train missing models (A100-large, ~2h total)
```bash
python train_verifier_v3.py # 4B verifier, ~30min
python train_sft_v3.py --model Qwen/Qwen3-8B --hub-id speculative-proposer-v3-8b # ~1.5h
```
### Step 2: Evaluate (a10g-largex2 or A100-large)
```bash
python eval_runner_v3.py
```
### Step 3: Results
Output: `eval_results_v3.json` on narcolepticchicken/speculative-tool-actions
## Deliverables Checklist
- [x] Dataset: speculative-sft-v3-main, speculative-verifier-v3-main, speculative-eval-v3-main
- [x] Proposer 1.7B: speculative-proposer-v3-1.7b (trained)
- [ ] Verifier 4B: speculative-verifier-v3-4b (NEEDS TRAINING)
- [ ] Proposer 8B: speculative-proposer-v3-8b (NEEDS TRAINING)
- [x] Training scripts + eval runner
- [ ] Full ablation report + cost-quality frontier
|