speculative-tool-actions / PROJECT_REPORT.md
narcolepticchicken's picture
Upload PROJECT_REPORT.md
e6e10cb verified

Speculative Tool Actions β€” Complete Project Report

Goal: Test whether speculative decoding transfers from token prediction to agent action prediction.

Hypothesis: A cheap model (1.7B) proposes the next agent action; a verifier decides whether to use the cheap proposal or fall back to an expensive model (8B). If most proposals are accepted, we save cost without sacrificing accuracy.

Architecture

Conversation Context
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Cheap Proposer     β”‚      β”‚  Strong Fallback (8B)     β”‚
β”‚  (Qwen3-1.7B LoRA)  │──────│  Only called on REJECT    β”‚
β”‚  ~15% cost          β”‚      β”‚  ~100% cost               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚ proposal                        β–²
    β–Ό                                 β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚  Verifier            │──── REJECT β”€β”€β”˜
β”‚  (Qwen3-4B LoRA)     β”‚
β”‚  ACCEPT/REJECT       β”‚
β”‚  ~5% cost            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό ACCEPT
   Use cheap proposal

Five Configurations

Config Name Description
A Always Strong 8B fine-tuned proposer on every turn
B Cheap Only 1.7B LoRA proposer on every turn
C Cheap + Strong Verifier 1.7B proposes β†’ 8B accepts/rejects β†’ fallback to 8B on REJECT
D Cheap + 4B Verifier 1.7B proposes β†’ 4B LoRA verifier β†’ fallback to 8B on REJECT
E Multi-Proposal Rerank 1.7B generates N=3 diverse proposals β†’ verifier picks best

Cost Model (relative to 8B inference)

Model Cost Notes
8B (strong) 1.00 Baseline
1.7B (cheap) 0.15 LoRA adapter, bf16
4B (verifier) 0.05 Lightweight, ~5 tokens generated

Models

Trained (have LoRA weights on Hub)

Model Size Role Hub URL Status
v2 Proposer 1.7B LoRA Cheap proposer speculative-proposer-qwen3-1.7b 2.3GB adapter
v2 Verifier 4B LoRA Reward model speculative-verifier-qwen3-4b 45MB adapter
v3 Proposer 1.7B 1.7B LoRA Cheap proposer speculative-proposer-v3-1.7b 66.5MB adapter
v3 Verifier 4B 4B LoRA ACCEPT/REJECT speculative-verifier-v3-4b MISSING β€” no adapter weights
v3 Proposer 8B 8B LoRA Strong proposer β€” Not yet trained

Required to Complete

  • narcolepticchicken/speculative-verifier-v3-4b β€” needs training (script: train_verifier_v3.py)
  • narcolepticchicken/speculative-proposer-v3-8b β€” needs training (script: train_sft_v3.py)

Datasets

v3 (Current β€” Chat-template format)

  • speculative-sft-v3-main β€” 8,744 examples (8,306 train / 438 test)
  • speculative-verifier-v3-main β€” 17,488 ACCEPT/REJECT pairs, 50/50 balanced
  • speculative-eval-v3-main β€” 1,000 eval examples

v2 Results (Measured β€” April 2026)

Config Accuracy Cost Notes
A: 8B frozen 0.400 1.00 8B zero-shot, never fine-tuned
B: 1.7B cheap 0.510 0.15 Pareto optimal
C: cheap + 8B rejector 0.400 1.25 0% accept rate
D: cheap + 4B RM 0.510 0.25 21.5% accept, all scores negative
E: multi-proposal n=3 0.420 0.75 RM picks best of 3

Key Failures

  1. Training-eval format mismatch: "Action: X\n<reason>" vs conversational messages
  2. Unequal comparison: 1.7B fine-tuned, 8B frozen zero-shot
  3. Verifier was reward model, not ACCEPT/REJECT classifier
  4. No positive ACCEPT examples in verifier training

v3 Redesign

  1. Chat-template training = eval format
  2. Both 1.7B and 8B fine-tuned identically (same LoRA config, same data)
  3. Verifier as binary ACCEPT/REJECT SFT classifier on balanced data

How to Complete

Step 1: Train missing models (A100-large, ~2h total)

python train_verifier_v3.py  # 4B verifier, ~30min
python train_sft_v3.py --model Qwen/Qwen3-8B --hub-id speculative-proposer-v3-8b  # ~1.5h

Step 2: Evaluate (a10g-largex2 or A100-large)

python eval_runner_v3.py

Step 3: Results

Output: eval_results_v3.json on narcolepticchicken/speculative-tool-actions

Deliverables Checklist

  • Dataset: speculative-sft-v3-main, speculative-verifier-v3-main, speculative-eval-v3-main
  • Proposer 1.7B: speculative-proposer-v3-1.7b (trained)
  • Verifier 4B: speculative-verifier-v3-4b (NEEDS TRAINING)
  • Proposer 8B: speculative-proposer-v3-8b (NEEDS TRAINING)
  • Training scripts + eval runner
  • Full ablation report + cost-quality frontier