arxiv:2604.09557

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Published on Feb 10

· Submitted by

Talor Abramovich on Apr 14

NVIDIA

Upvote

Authors:

Abstract

Speculative Decoding evaluation requires diverse workloads to accurately measure performance, which existing benchmarks lack, prompting the introduction of SPEED-Bench for standardized assessment across semantic domains and serving regimes.

AI-generated summary

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.

View arXiv page View PDF Project page GitHub 2.47k Add to collection

Community

talor-abr

Paper submitter about 5 hours ago

SPEED-Bench (SPEculative Evaluation Dataset) is a unified benchmark designed to evaluate speculative decoding (SD) across diverse semantic domains and realistic serving regimes, using production-grade inference engines. It measures both acceptance-rate characteristics and end-to-end throughput, enabling fair, reproducible, and robust comparisons between SD strategies. SPEED-Bench introduces a benchmarking ecosystem for SD. It combines two purpose-built dataset splits and a unified measurement framework, each designed to capture a different aspect of SD behavior:

A "Qualitative" data split, optimized for semantic diversity and designed to measure speculation quality (drafter accuracy) across domains.
A "Throughput" data split, constructed to evaluate system-level performance across various input sequence lengths and high concurrency.
A unified measurement framework, integrated with production inference engines, that standardizes evaluation across systems.

Together, these components enable practitioners and researchers to analyze SD behavior that is often masked by existing benchmarks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.09557

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.09557 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.09557 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.09557 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.