File size: 5,230 Bytes
e6e10cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# Speculative Tool Actions β€” Complete Project Report

**Goal:** Test whether speculative decoding transfers from token prediction to agent action prediction.

**Hypothesis:** A cheap model (1.7B) proposes the next agent action; a verifier decides whether to use the cheap proposal or fall back to an expensive model (8B). If most proposals are accepted, we save cost without sacrificing accuracy.

## Architecture

```
Conversation Context
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Cheap Proposer     β”‚      β”‚  Strong Fallback (8B)     β”‚
β”‚  (Qwen3-1.7B LoRA)  │──────│  Only called on REJECT    β”‚
β”‚  ~15% cost          β”‚      β”‚  ~100% cost               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚ proposal                        β–²
    β–Ό                                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚  Verifier            │──── REJECT β”€β”€β”˜
β”‚  (Qwen3-4B LoRA)     β”‚
β”‚  ACCEPT/REJECT       β”‚
β”‚  ~5% cost            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό ACCEPT
   Use cheap proposal
```

## Five Configurations

| Config | Name | Description |
|--------|------|-------------|
| A | Always Strong | 8B fine-tuned proposer on every turn |
| B | Cheap Only | 1.7B LoRA proposer on every turn |
| C | Cheap + Strong Verifier | 1.7B proposes β†’ 8B accepts/rejects β†’ fallback to 8B on REJECT |
| D | Cheap + 4B Verifier | 1.7B proposes β†’ 4B LoRA verifier β†’ fallback to 8B on REJECT |
| E | Multi-Proposal Rerank | 1.7B generates N=3 diverse proposals β†’ verifier picks best |

## Cost Model (relative to 8B inference)

| Model | Cost | Notes |
|-------|------|-------|
| 8B (strong) | 1.00 | Baseline |
| 1.7B (cheap) | 0.15 | LoRA adapter, bf16 |
| 4B (verifier) | 0.05 | Lightweight, ~5 tokens generated |

## Models

### Trained (have LoRA weights on Hub)

| Model | Size | Role | Hub URL | Status |
|-------|------|------|---------|--------|
| v2 Proposer | 1.7B LoRA | Cheap proposer | [speculative-proposer-qwen3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-qwen3-1.7b) | 2.3GB adapter |
| v2 Verifier | 4B LoRA | Reward model | [speculative-verifier-qwen3-4b](https://hf.co/narcolepticchicken/speculative-verifier-qwen3-4b) | 45MB adapter |
| v3 Proposer 1.7B | 1.7B LoRA | Cheap proposer | [speculative-proposer-v3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-v3-1.7b) | 66.5MB adapter |
| v3 Verifier 4B | 4B LoRA | ACCEPT/REJECT | [speculative-verifier-v3-4b](https://hf.co/narcolepticchicken/speculative-verifier-v3-4b) | MISSING β€” no adapter weights |
| v3 Proposer 8B | 8B LoRA | Strong proposer | β€” | Not yet trained |

### Required to Complete
- **narcolepticchicken/speculative-verifier-v3-4b** β€” needs training (script: `train_verifier_v3.py`)
- **narcolepticchicken/speculative-proposer-v3-8b** β€” needs training (script: `train_sft_v3.py`)

## Datasets

### v3 (Current β€” Chat-template format)

- **speculative-sft-v3-main** β€” 8,744 examples (8,306 train / 438 test)
- **speculative-verifier-v3-main** β€” 17,488 ACCEPT/REJECT pairs, 50/50 balanced
- **speculative-eval-v3-main** β€” 1,000 eval examples

## v2 Results (Measured β€” April 2026)

| Config | Accuracy | Cost | Notes |
|--------|----------|------|-------|
| A: 8B frozen | 0.400 | 1.00 | 8B zero-shot, never fine-tuned |
| B: 1.7B cheap | **0.510** | **0.15** | Pareto optimal |
| C: cheap + 8B rejector | 0.400 | 1.25 | 0% accept rate |
| D: cheap + 4B RM | 0.510 | 0.25 | 21.5% accept, all scores negative |
| E: multi-proposal n=3 | 0.420 | 0.75 | RM picks best of 3 |

### Key Failures
1. Training-eval format mismatch: `"Action: X\n<reason>"` vs conversational `messages`
2. Unequal comparison: 1.7B fine-tuned, 8B frozen zero-shot
3. Verifier was reward model, not ACCEPT/REJECT classifier
4. No positive ACCEPT examples in verifier training

## v3 Redesign
1. Chat-template training = eval format
2. Both 1.7B and 8B fine-tuned identically (same LoRA config, same data)
3. Verifier as binary ACCEPT/REJECT SFT classifier on balanced data

## How to Complete

### Step 1: Train missing models (A100-large, ~2h total)
```bash
python train_verifier_v3.py  # 4B verifier, ~30min
python train_sft_v3.py --model Qwen/Qwen3-8B --hub-id speculative-proposer-v3-8b  # ~1.5h
```

### Step 2: Evaluate (a10g-largex2 or A100-large)
```bash
python eval_runner_v3.py
```

### Step 3: Results
Output: `eval_results_v3.json` on narcolepticchicken/speculative-tool-actions

## Deliverables Checklist

- [x] Dataset: speculative-sft-v3-main, speculative-verifier-v3-main, speculative-eval-v3-main
- [x] Proposer 1.7B: speculative-proposer-v3-1.7b (trained)
- [ ] Verifier 4B: speculative-verifier-v3-4b (NEEDS TRAINING)
- [ ] Proposer 8B: speculative-proposer-v3-8b (NEEDS TRAINING)
- [x] Training scripts + eval runner
- [ ] Full ablation report + cost-quality frontier