speculative-tool-actions / PROJECT_REPORT.md
narcolepticchicken's picture
Upload PROJECT_REPORT.md
e6e10cb verified
# Speculative Tool Actions β€” Complete Project Report
**Goal:** Test whether speculative decoding transfers from token prediction to agent action prediction.
**Hypothesis:** A cheap model (1.7B) proposes the next agent action; a verifier decides whether to use the cheap proposal or fall back to an expensive model (8B). If most proposals are accepted, we save cost without sacrificing accuracy.
## Architecture
```
Conversation Context
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Cheap Proposer β”‚ β”‚ Strong Fallback (8B) β”‚
β”‚ (Qwen3-1.7B LoRA) │──────│ Only called on REJECT β”‚
β”‚ ~15% cost β”‚ β”‚ ~100% cost β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ proposal β–²
β–Ό β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ Verifier │──── REJECT β”€β”€β”˜
β”‚ (Qwen3-4B LoRA) β”‚
β”‚ ACCEPT/REJECT β”‚
β”‚ ~5% cost β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό ACCEPT
Use cheap proposal
```
## Five Configurations
| Config | Name | Description |
|--------|------|-------------|
| A | Always Strong | 8B fine-tuned proposer on every turn |
| B | Cheap Only | 1.7B LoRA proposer on every turn |
| C | Cheap + Strong Verifier | 1.7B proposes β†’ 8B accepts/rejects β†’ fallback to 8B on REJECT |
| D | Cheap + 4B Verifier | 1.7B proposes β†’ 4B LoRA verifier β†’ fallback to 8B on REJECT |
| E | Multi-Proposal Rerank | 1.7B generates N=3 diverse proposals β†’ verifier picks best |
## Cost Model (relative to 8B inference)
| Model | Cost | Notes |
|-------|------|-------|
| 8B (strong) | 1.00 | Baseline |
| 1.7B (cheap) | 0.15 | LoRA adapter, bf16 |
| 4B (verifier) | 0.05 | Lightweight, ~5 tokens generated |
## Models
### Trained (have LoRA weights on Hub)
| Model | Size | Role | Hub URL | Status |
|-------|------|------|---------|--------|
| v2 Proposer | 1.7B LoRA | Cheap proposer | [speculative-proposer-qwen3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-qwen3-1.7b) | 2.3GB adapter |
| v2 Verifier | 4B LoRA | Reward model | [speculative-verifier-qwen3-4b](https://hf.co/narcolepticchicken/speculative-verifier-qwen3-4b) | 45MB adapter |
| v3 Proposer 1.7B | 1.7B LoRA | Cheap proposer | [speculative-proposer-v3-1.7b](https://hf.co/narcolepticchicken/speculative-proposer-v3-1.7b) | 66.5MB adapter |
| v3 Verifier 4B | 4B LoRA | ACCEPT/REJECT | [speculative-verifier-v3-4b](https://hf.co/narcolepticchicken/speculative-verifier-v3-4b) | MISSING β€” no adapter weights |
| v3 Proposer 8B | 8B LoRA | Strong proposer | β€” | Not yet trained |
### Required to Complete
- **narcolepticchicken/speculative-verifier-v3-4b** β€” needs training (script: `train_verifier_v3.py`)
- **narcolepticchicken/speculative-proposer-v3-8b** β€” needs training (script: `train_sft_v3.py`)
## Datasets
### v3 (Current β€” Chat-template format)
- **speculative-sft-v3-main** β€” 8,744 examples (8,306 train / 438 test)
- **speculative-verifier-v3-main** β€” 17,488 ACCEPT/REJECT pairs, 50/50 balanced
- **speculative-eval-v3-main** β€” 1,000 eval examples
## v2 Results (Measured β€” April 2026)
| Config | Accuracy | Cost | Notes |
|--------|----------|------|-------|
| A: 8B frozen | 0.400 | 1.00 | 8B zero-shot, never fine-tuned |
| B: 1.7B cheap | **0.510** | **0.15** | Pareto optimal |
| C: cheap + 8B rejector | 0.400 | 1.25 | 0% accept rate |
| D: cheap + 4B RM | 0.510 | 0.25 | 21.5% accept, all scores negative |
| E: multi-proposal n=3 | 0.420 | 0.75 | RM picks best of 3 |
### Key Failures
1. Training-eval format mismatch: `"Action: X\n<reason>"` vs conversational `messages`
2. Unequal comparison: 1.7B fine-tuned, 8B frozen zero-shot
3. Verifier was reward model, not ACCEPT/REJECT classifier
4. No positive ACCEPT examples in verifier training
## v3 Redesign
1. Chat-template training = eval format
2. Both 1.7B and 8B fine-tuned identically (same LoRA config, same data)
3. Verifier as binary ACCEPT/REJECT SFT classifier on balanced data
## How to Complete
### Step 1: Train missing models (A100-large, ~2h total)
```bash
python train_verifier_v3.py # 4B verifier, ~30min
python train_sft_v3.py --model Qwen/Qwen3-8B --hub-id speculative-proposer-v3-8b # ~1.5h
```
### Step 2: Evaluate (a10g-largex2 or A100-large)
```bash
python eval_runner_v3.py
```
### Step 3: Results
Output: `eval_results_v3.json` on narcolepticchicken/speculative-tool-actions
## Deliverables Checklist
- [x] Dataset: speculative-sft-v3-main, speculative-verifier-v3-main, speculative-eval-v3-main
- [x] Proposer 1.7B: speculative-proposer-v3-1.7b (trained)
- [ ] Verifier 4B: speculative-verifier-v3-4b (NEEDS TRAINING)
- [ ] Proposer 8B: speculative-proposer-v3-8b (NEEDS TRAINING)
- [x] Training scripts + eval runner
- [ ] Full ablation report + cost-quality frontier