# Speculative Tool Actions — Complete Project Report

**Goal:** Test whether speculative decoding transfers from token prediction to agent action prediction.

**Hypothesis:** A cheap model (1.7B) proposes the next agent action; a verifier decides whether to use the cheap proposal or fall back to an expensive model (8B). If most proposals are accepted, we save cost without sacrificing accuracy.

## Architecture

```
Conversation Context
    │
    ▼
┌─────────────────────┐      ┌──────────────────────────┐
│  Cheap Proposer     │      │  Strong Fallback (8B)     │
│  (Qwen3-1.7B LoRA)  │──────│  Only called on REJECT    │
│  ~15% cost          │      │  ~100% cost               │
└─────────────────────┘      └──────────────────────────┘
    │ proposal                        ▲
    ▼                                 │
┌─────────────────────┐               │
│  Verifier            │──── REJECT ──┘
│  (Qwen3-4B LoRA)     │
│  ACCEPT/REJECT       │
│  ~5% cost            │
└─────────────────────┘
    │
    ▼ ACCEPT
   Use cheap proposal
```

## Five Configurations

| Config | Name | Description |
|--------|------|-------------|
| A | Always Strong | 8B fine-tuned proposer on every turn |
| B | Cheap Only | 1.7B LoRA proposer on every turn |
| C | Cheap + Strong Verifier | 1.7B proposes → 8B accepts/rejects → fallback to 8B on REJECT |
| D | Cheap + 4B Verifier | 1.7B proposes → 4B LoRA verifier → fallback to 8B on REJECT |
| E | Multi-Proposal Rerank | 1.7B generates N=3 diverse proposals → verifier picks best |

## Cost Model (relative to 8B inference)

| Model | Cost | Notes |
|-------|------|-------|
| 8B (strong) | 1.00 | Baseline |
| 1.7B (cheap) | 0.15 | LoRA adapter, bf16 |
| 4B (verifier) | 0.05 | Lightweight, ~5 tokens generated |

## Models

### Trained (have LoRA weights on Hub)

| Model | Size | Role | Hub URL | Status |
|-------|------|------|---------|--------|
| v2 Proposer | 1.7B LoRA | Cheap proposer | [speculative-proposer-qwen3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-qwen3-1.7b) | 2.3GB adapter |
| v2 Verifier | 4B LoRA | Reward model | [speculative-verifier-qwen3-4b](https://hf.co/narcolepticchicken/speculative-verifier-qwen3-4b) | 45MB adapter |
| v3 Proposer 1.7B | 1.7B LoRA | Cheap proposer | [speculative-proposer-v3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-v3-1.7b) | 66.5MB adapter |
| v3 Verifier 4B | 4B LoRA | ACCEPT/REJECT | [speculative-verifier-v3-4b](https://hf.co/narcolepticchicken/speculative-verifier-v3-4b) | MISSING — no adapter weights |
| v3 Proposer 8B | 8B LoRA | Strong proposer | — | Not yet trained |

### Required to Complete
- **narcolepticchicken/speculative-verifier-v3-4b** — needs training (script: `train_verifier_v3.py`)
- **narcolepticchicken/speculative-proposer-v3-8b** — needs training (script: `train_sft_v3.py`)

## Datasets

### v3 (Current — Chat-template format)

- **speculative-sft-v3-main** — 8,744 examples (8,306 train / 438 test)
- **speculative-verifier-v3-main** — 17,488 ACCEPT/REJECT pairs, 50/50 balanced
- **speculative-eval-v3-main** — 1,000 eval examples

## v2 Results (Measured — April 2026)

| Config | Accuracy | Cost | Notes |
|--------|----------|------|-------|
| A: 8B frozen | 0.400 | 1.00 | 8B zero-shot, never fine-tuned |
| B: 1.7B cheap | **0.510** | **0.15** | Pareto optimal |
| C: cheap + 8B rejector | 0.400 | 1.25 | 0% accept rate |
| D: cheap + 4B RM | 0.510 | 0.25 | 21.5% accept, all scores negative |
| E: multi-proposal n=3 | 0.420 | 0.75 | RM picks best of 3 |

### Key Failures
1. Training-eval format mismatch: `"Action: X\n<reason>"` vs conversational `messages`
2. Unequal comparison: 1.7B fine-tuned, 8B frozen zero-shot
3. Verifier was reward model, not ACCEPT/REJECT classifier
4. No positive ACCEPT examples in verifier training

## v3 Redesign
1. Chat-template training = eval format
2. Both 1.7B and 8B fine-tuned identically (same LoRA config, same data)
3. Verifier as binary ACCEPT/REJECT SFT classifier on balanced data

## How to Complete

### Step 1: Train missing models (A100-large, ~2h total)
```bash
python train_verifier_v3.py  # 4B verifier, ~30min
python train_sft_v3.py --model Qwen/Qwen3-8B --hub-id speculative-proposer-v3-8b  # ~1.5h
```

### Step 2: Evaluate (a10g-largex2 or A100-large)
```bash
python eval_runner_v3.py
```

### Step 3: Results
Output: `eval_results_v3.json` on narcolepticchicken/speculative-tool-actions

## Deliverables Checklist

- [x] Dataset: speculative-sft-v3-main, speculative-verifier-v3-main, speculative-eval-v3-main
- [x] Proposer 1.7B: speculative-proposer-v3-1.7b (trained)
- [ ] Verifier 4B: speculative-verifier-v3-4b (NEEDS TRAINING)
- [ ] Proposer 8B: speculative-proposer-v3-8b (NEEDS TRAINING)
- [x] Training scripts + eval runner
- [ ] Full ablation report + cost-quality frontier