speculative-tool-actions / PROJECT_REPORT.md

Upload PROJECT_REPORT.md

e6e10cb verified about 13 hours ago

5.23 kB

	# Speculative Tool Actions — Complete Project Report

	Goal: Test whether speculative decoding transfers from token prediction to agent action prediction.

	Hypothesis: A cheap model (1.7B) proposes the next agent action; a verifier decides whether to use the cheap proposal or fall back to an expensive model (8B). If most proposals are accepted, we save cost without sacrificing accuracy.

	## Architecture

	```
	Conversation Context
	│
	▼
	┌─────────────────────┐ ┌──────────────────────────┐
	│ Cheap Proposer │ │ Strong Fallback (8B) │
	│ (Qwen3-1.7B LoRA) │──────│ Only called on REJECT │
	│ ~15% cost │ │ ~100% cost │
	└─────────────────────┘ └──────────────────────────┘
	│ proposal ▲
	▼ │
	┌─────────────────────┐ │
	│ Verifier │──── REJECT ──┘
	│ (Qwen3-4B LoRA) │
	│ ACCEPT/REJECT │
	│ ~5% cost │
	└─────────────────────┘
	│
	▼ ACCEPT
	Use cheap proposal
	```

	## Five Configurations

	\| Config \| Name \| Description \|
	\|--------\|------\|-------------\|
	\| A \| Always Strong \| 8B fine-tuned proposer on every turn \|
	\| B \| Cheap Only \| 1.7B LoRA proposer on every turn \|
	\| C \| Cheap + Strong Verifier \| 1.7B proposes → 8B accepts/rejects → fallback to 8B on REJECT \|
	\| D \| Cheap + 4B Verifier \| 1.7B proposes → 4B LoRA verifier → fallback to 8B on REJECT \|
	\| E \| Multi-Proposal Rerank \| 1.7B generates N=3 diverse proposals → verifier picks best \|

	## Cost Model (relative to 8B inference)

	\| Model \| Cost \| Notes \|
	\|-------\|------\|-------\|
	\| 8B (strong) \| 1.00 \| Baseline \|
	\| 1.7B (cheap) \| 0.15 \| LoRA adapter, bf16 \|
	\| 4B (verifier) \| 0.05 \| Lightweight, ~5 tokens generated \|

	## Models

	### Trained (have LoRA weights on Hub)

	\| Model \| Size \| Role \| Hub URL \| Status \|
	\|-------\|------\|------\|---------\|--------\|
	\| v2 Proposer \| 1.7B LoRA \| Cheap proposer \| [speculative-proposer-qwen3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-qwen3-1.7b) \| 2.3GB adapter \|
	\| v2 Verifier \| 4B LoRA \| Reward model \| [speculative-verifier-qwen3-4b](https://hf.co/narcolepticchicken/speculative-verifier-qwen3-4b) \| 45MB adapter \|
	\| v3 Proposer 1.7B \| 1.7B LoRA \| Cheap proposer \| [speculative-proposer-v3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-v3-1.7b) \| 66.5MB adapter \|
	\| v3 Verifier 4B \| 4B LoRA \| ACCEPT/REJECT \| [speculative-verifier-v3-4b](https://hf.co/narcolepticchicken/speculative-verifier-v3-4b) \| MISSING — no adapter weights \|
	\| v3 Proposer 8B \| 8B LoRA \| Strong proposer \| — \| Not yet trained \|

	### Required to Complete
	- narcolepticchicken/speculative-verifier-v3-4b — needs training (script: `train_verifier_v3.py`)
	- narcolepticchicken/speculative-proposer-v3-8b — needs training (script: `train_sft_v3.py`)

	## Datasets

	### v3 (Current — Chat-template format)

	- speculative-sft-v3-main — 8,744 examples (8,306 train / 438 test)
	- speculative-verifier-v3-main — 17,488 ACCEPT/REJECT pairs, 50/50 balanced
	- speculative-eval-v3-main — 1,000 eval examples

	## v2 Results (Measured — April 2026)

	\| Config \| Accuracy \| Cost \| Notes \|
	\|--------\|----------\|------\|-------\|
	\| A: 8B frozen \| 0.400 \| 1.00 \| 8B zero-shot, never fine-tuned \|
	\| B: 1.7B cheap \| 0.510 \| 0.15 \| Pareto optimal \|
	\| C: cheap + 8B rejector \| 0.400 \| 1.25 \| 0% accept rate \|
	\| D: cheap + 4B RM \| 0.510 \| 0.25 \| 21.5% accept, all scores negative \|
	\| E: multi-proposal n=3 \| 0.420 \| 0.75 \| RM picks best of 3 \|

	### Key Failures
	1. Training-eval format mismatch: `"Action: X\n<reason>"` vs conversational `messages`
	2. Unequal comparison: 1.7B fine-tuned, 8B frozen zero-shot
	3. Verifier was reward model, not ACCEPT/REJECT classifier
	4. No positive ACCEPT examples in verifier training

	## v3 Redesign
	1. Chat-template training = eval format
	2. Both 1.7B and 8B fine-tuned identically (same LoRA config, same data)
	3. Verifier as binary ACCEPT/REJECT SFT classifier on balanced data

	## How to Complete

	### Step 1: Train missing models (A100-large, ~2h total)
	```bash
	python train_verifier_v3.py # 4B verifier, ~30min
	python train_sft_v3.py --model Qwen/Qwen3-8B --hub-id speculative-proposer-v3-8b # ~1.5h
	```

	### Step 2: Evaluate (a10g-largex2 or A100-large)
	```bash
	python eval_runner_v3.py
	```

	### Step 3: Results
	Output: `eval_results_v3.json` on narcolepticchicken/speculative-tool-actions

	## Deliverables Checklist

	- [x] Dataset: speculative-sft-v3-main, speculative-verifier-v3-main, speculative-eval-v3-main
	- [x] Proposer 1.7B: speculative-proposer-v3-1.7b (trained)
	- [ ] Verifier 4B: speculative-verifier-v3-4b (NEEDS TRAINING)
	- [ ] Proposer 8B: speculative-proposer-v3-8b (NEEDS TRAINING)
	- [x] Training scripts + eval runner
	- [ ] Full ablation report + cost-quality frontier