speculative-tool-actions / ABLATION_REPORT_v2.md

Upload ABLATION_REPORT_v2.md

02af344 verified 1 day ago

3.44 kB

	# Speculative Tool Actions: Final Results

	## Summary

	We tested whether speculative decoding transfers from token prediction to agent action prediction. A cheap model (Qwen3-1.7B + LoRA) proposes the next agent action; verifiers decide whether to use the proposal or fall back to an expensive 8B model.

	Finding: The speculative architecture failed to improve on the cheap proposer alone. Option B (1.7B cheap only, 51.0% accuracy, 0.15× cost) dominates every other configuration.

	## Full Results

	\| Config \| Accuracy \| Cost \| xRand \| xMaj \| Notes \|
	\|--------\|----------\|------\|-------\|------\|-------\|
	\| A \| 0.400 \| 1.000 \| 3.6× \| 1.7× \| 8B strong (frozen, zero-shot) \|
	\| B \| 0.510 \| 0.150 \| 4.6× \| 2.1× \| 1.7B cheap (LoRA fine-tuned) \|
	\| C \| 0.400 \| 1.250 \| 3.6× \| 1.7× \| cheap + 8B verifier → 0% accept \|
	\| D \| 0.510 \| 0.250 \| 4.6× \| 2.1× \| cheap + 4B RM → 21.5% accept \|
	\| E \| 0.420 \| 0.750 \| 3.8× \| 1.8× \| multi-proposal (n=3) + RM \|

	Random baseline: 11.1% \| Majority baseline (final_answer): 24.0%

	## Cost-Quality Frontier

	```
	Config B: cost=0.150 acc=0.510 ★ PARETO OPTIMAL
	Config D: cost=0.250 acc=0.510 (dominated by B — same acc, higher cost)
	Config E: cost=0.750 acc=0.420 (dominated by B on both axes)
	Config A: cost=1.000 acc=0.400 (dominated by B on both axes)
	Config C: cost=1.250 acc=0.400 (dominated by B on both axes)
	```

	## Why It Failed

	1. Training-eval mismatch: The proposer was fine-tuned on `"Action: <type>\n<reason>"` format; eval uses natural conversational `messages`. The models haven't learned the real task structure.

	2. 8B verifier (Config C) rejected EVERY proposal: It never learned to discriminate — it says REJECT to everything and falls back to its own (worse) zero-shot predictions.

	3. 4B reward model (Config D/E) scores everything negatively: Mean score -1.52. It weakly prefers some proposals (21.5% above threshold) but the thresholding is arbitrary — accepted and rejected proposals have identical downstream accuracy.

	4. Multi-proposal (Config E) was worse than single proposal (Config B): 42% vs 51%. Sampling diversity at temperature 0.8 didn't improve — the model generates the same action most of the time, so the "diverse" proposals collapse to 1-2 options, and the RM can't reliably pick the right one.

	## What Would Fix This

	1. Train on chat-template format: Generate SFT data using `tokenizer.apply_chat_template()` with real conversational turns, not just "Action: X" format.

	2. Train the 8B model: Fine-tune Qwen3-8B on the same SFT data for a fair scale comparison (current "strong model" is frozen zero-shot).

	3. Train the verifier properly: The ACCEPT/REJECT verifier needs positive examples of correct proposals being accepted. Train it as a binary classifier on (proposal, context) → ACCEPT/REJECT with balanced labels.

	4. Use a real agent benchmark: Military-grade eval like BFCL (Berkeley Function Calling Leaderboard) would provide realistic and challenging action prediction traces.

	## Deliverables

	- Dataset: https://huggingface.co/datasets/narcolepticchicken/speculative-actions-eval (500 examples)
	- Proposer: https://huggingface.co/narcolepticchicken/speculative-proposer-qwen3-1.7b
	- Verifier: https://huggingface.co/narcolepticchicken/speculative-verifier-qwen3-4b
	- Code + results: https://huggingface.co/narcolepticchicken/speculative-tool-actions

	# Speculative Tool Actions: Final Results

	## Summary

	We tested whether speculative decoding transfers from token prediction to agent action prediction. A cheap model (Qwen3-1.7B + LoRA) proposes the next agent action; verifiers decide whether to use the proposal or fall back to an expensive 8B model.

	Finding: The speculative architecture failed to improve on the cheap proposer alone. Option B (1.7B cheap only, 51.0% accuracy, 0.15× cost) dominates every other configuration.

	## Full Results

	\| Config \| Accuracy \| Cost \| xRand \| xMaj \| Notes \|
	\|--------\|----------\|------\|-------\|------\|-------\|
	\| A \| 0.400 \| 1.000 \| 3.6× \| 1.7× \| 8B strong (frozen, zero-shot) \|
	\| B \| 0.510 \| 0.150 \| 4.6× \| 2.1× \| 1.7B cheap (LoRA fine-tuned) \|
	\| C \| 0.400 \| 1.250 \| 3.6× \| 1.7× \| cheap + 8B verifier → 0% accept \|
	\| D \| 0.510 \| 0.250 \| 4.6× \| 2.1× \| cheap + 4B RM → 21.5% accept \|
	\| E \| 0.420 \| 0.750 \| 3.8× \| 1.8× \| multi-proposal (n=3) + RM \|

	Random baseline: 11.1% \| Majority baseline (final_answer): 24.0%

	## Cost-Quality Frontier

	```
	Config B: cost=0.150 acc=0.510 ★ PARETO OPTIMAL
	Config D: cost=0.250 acc=0.510 (dominated by B — same acc, higher cost)
	Config E: cost=0.750 acc=0.420 (dominated by B on both axes)
	Config A: cost=1.000 acc=0.400 (dominated by B on both axes)
	Config C: cost=1.250 acc=0.400 (dominated by B on both axes)
	```

	## Why It Failed

	1. Training-eval mismatch: The proposer was fine-tuned on `"Action: <type>\n<reason>"` format; eval uses natural conversational `messages`. The models haven't learned the real task structure.

	2. 8B verifier (Config C) rejected EVERY proposal: It never learned to discriminate — it says REJECT to everything and falls back to its own (worse) zero-shot predictions.

	3. 4B reward model (Config D/E) scores everything negatively: Mean score -1.52. It weakly prefers some proposals (21.5% above threshold) but the thresholding is arbitrary — accepted and rejected proposals have identical downstream accuracy.

	4. Multi-proposal (Config E) was worse than single proposal (Config B): 42% vs 51%. Sampling diversity at temperature 0.8 didn't improve — the model generates the same action most of the time, so the "diverse" proposals collapse to 1-2 options, and the RM can't reliably pick the right one.

	## What Would Fix This

	1. Train on chat-template format: Generate SFT data using `tokenizer.apply_chat_template()` with real conversational turns, not just "Action: X" format.

	2. Train the 8B model: Fine-tune Qwen3-8B on the same SFT data for a fair scale comparison (current "strong model" is frozen zero-shot).

	3. Train the verifier properly: The ACCEPT/REJECT verifier needs positive examples of correct proposals being accepted. Train it as a binary classifier on (proposal, context) → ACCEPT/REJECT with balanced labels.

	4. Use a real agent benchmark: Military-grade eval like BFCL (Berkeley Function Calling Leaderboard) would provide realistic and challenging action prediction traces.

	## Deliverables

	- Dataset: https://huggingface.co/datasets/narcolepticchicken/speculative-actions-eval (500 examples)
	- Proposer: https://huggingface.co/narcolepticchicken/speculative-proposer-qwen3-1.7b
	- Verifier: https://huggingface.co/narcolepticchicken/speculative-verifier-qwen3-4b
	- Code + results: https://huggingface.co/narcolepticchicken/speculative-tool-actions