SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
Abstract
Speculative Decoding evaluation requires diverse workloads to accurately measure performance, which existing benchmarks lack, prompting the introduction of SPEED-Bench for standardized assessment across semantic domains and serving regimes.
Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.
Community
SPEED-Bench (SPEculative Evaluation Dataset) is a unified benchmark designed to evaluate speculative decoding (SD) across diverse semantic domains and realistic serving regimes, using production-grade inference engines. It measures both acceptance-rate characteristics and end-to-end throughput, enabling fair, reproducible, and robust comparisons between SD strategies. SPEED-Bench introduces a benchmarking ecosystem for SD. It combines two purpose-built dataset splits and a unified measurement framework, each designed to capture a different aspect of SD behavior:
A "Qualitative" data split, optimized for semantic diversity and designed to measure speculation quality (drafter accuracy) across domains.
A "Throughput" data split, constructed to evaluate system-level performance across various input sequence lengths and high concurrency.
A unified measurement framework, integrated with production inference engines, that standardizes evaluation across systems.
Together, these components enable practitioners and researchers to analyze SD behavior that is often masked by existing benchmarks.
Get this paper in your agent:
hf papers read 2604.09557 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper