Speculative Tool Actions — Related Work

Foundational

SpecInfer (2023) — Accelerating Generative LLM Serving with Tree-based Speculative Inference (arxiv:2305.09781). Introduced token-tree speculative decoding. Multiple draft tokens from small models verified in parallel by the large model. Basis for all subsequent speculative decoding work.

SuffixDecoding (2024) — SuffixDecoding: Speeding Up Large Language Model Inference with Tree-structured Suffix-based Drafting (arxiv:2411.04975). Applied speculative decoding to agentic workloads by caching action suffixes and reusing them as drafts. Achieved up to 5.3× speedup on tool-calling benchmarks.

Heterogeneous Speculation for Agents (2025-2026)

DualSpec (Mar 2026) — DualSpec: Accelerating Deep Research Agents via Dual-Process Action Speculation (arxiv:2603.07416). Closest work to ours. Uses heterogeneous speculation: large model handles high-entropy "Search" actions; small model drafts low-entropy "Visit" actions. Semantic verifier (prompt-based, not trained) accepts/rejects drafts. 1.33-3.28× end-to-end speedup with no pass@1 loss on GAIA, XBench-DeepSearch, Seal-0. Key insight: entropy-based action partitioning — some actions need System 2, others are fine with System 1.

Our contribution vs DualSpec: We train a separate verifier model (SFT on ACCEPT/REJECT pairs) instead of using prompt-based critics. This should be both faster and more accurate. We also use a single cheap model for all actions rather than partitioning by type.

DSP (Aug 2025) — Dynamic Speculative Agent Planning (arxiv:2509.01920). Uses online RL to predict how many speculative steps the drafter can produce correctly. Models optimal k as varying from 1-5 even within a single task (mean variance 1.46). ~2× latency reduction with 30% lower total cost.

Our contribution vs DSP: DSP uses exact action matching; we train a learned verifier. DSP predicts k dynamically; we use single-action proposal + verify (k=1 by design). The cost breakdown analysis in DSP directly motivates our approach: draft prompt tokens dominate waste cost.

SpecEyes (Mar 2026) — SpecEyes: Accelerating Agentic Multimodal LLMs (arxiv:2603.23483). For multimodal agents: lightweight MLLM drafts answers; "cognitive gating" via answer separability score decides accept/fallback. 1.1-3.35× speedup with +6.7% accuracy gain.

Small Models for Tool Use

TinyAgent (Sep 2024) — TinyAgent: Function Calling at the Edge (arxiv:2409.00608). TinyLlama-1.1B and Wizard-2-7B match or surpass GPT-4-Turbo on function calling via SFT on LLMCompiler-format traces + Tool RAG. Demonstrates that small models can be extremely capable for tool use when trained properly.

SLM for Agentic Systems Survey (Oct 2025) — Small Language Models for Agentic Systems (arxiv:2510.03847). Formalizes the SLM-default/LLM-fallback pattern with uncertainty-aware routing and verifier cascades. Recommends schema-first prompting, type-safe function registries, and LoRA adaptation.

Learned Routers and Verifiers

RouteLLM (Jun 2024) — RouteLLM: Learning to Route LLMs with Preference Data (arxiv:2406.18665). Trains routers (BERT, causal LLM, matrix factorization) on Chatbot Arena preference data. >2× cost reduction with no quality degradation. BERT router achieves best cost-quality tradeoff.

Internal Representation Hallucination Detection (Jan 2026) — Internal Representations as Indicators of Hallucinations in Agent Tool Selection (arxiv:2601.05214). 86.4% accuracy detecting tool-calling hallucinations in a single forward pass via 2-layer MLP on final hidden states. Validates the approach of using separate classifiers to verify tool-call quality.

How Our Work Differs

Aspect	Prior Work	Our Work
Speculation target	Tokens (SpecInfer), actions+reasoning (DualSpec), multi-step plans (DSP)	Single next action type
Verifier type	Exact match (DSP), prompt-based critic (DualSpec), confidence heuristic (SpecEyes)	Trained SFT classifier
Model sizes	72B+8B or 32B+4B asymmetric pairs	8B + 1.7B + 4B verifier
Training data	Proprietary (DualSpec, DSP) or synthesized (TinyAgent)	ToolBench-derived, open
Safety focus	None	Explicit BLOCKED action + unsafe-action avoidance metric

Novel contribution: First system to use a trained SFT verifier for speculative tool action proposal verification. All prior work uses either exact matching, prompt-based critics, or confidence heuristics. Our approach has better accuracy:latency ratio by design.