Papers
arxiv:2605.10453

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

Published on May 11
· Submitted by
Aleksandr Samarin
on May 12
Authors:
,
,
,

Abstract

SlimSpec improves speculative decoding efficiency by using low-rank parameterization to compress the drafter's language model head while maintaining full vocabulary support and achieving significant speedup with minimal pipeline changes.

AI-generated summary

Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although the drafter network is small in modern architectures, its LM-head still performs projection to a large vocabulary, becoming one of the major computational bottlenecks. In prior work this issue has been predominantly addressed via static or dynamic vocabulary truncation. Yet mitigating the bottleneck, these methods bring in extra complexity, such as special vocabulary curation, sophisticated inference-time logic or modifications of the training setup. In this paper, we propose SlimSpec, a low-rank parameterization of the drafter's LM-head that compresses the inner representation rather than the output, preserving full vocabulary support. We evaluate our method with EAGLE-3 drafter across three target models and diverse benchmarks in both latency- and throughput-bound inference regimes. SlimSpec achieves 4-5times acceleration over the standard LM-head architecture while maintaining a competitive acceptance length, surpassing existing methods by up to 8-9% of the end-to-end speedup. Our method requires minimal adjustments of training and inference pipelines. Combined with the aforementioned speedup improvements, it makes SlimSpec a strong alternative across wide variety of draft LM-head architectures.

Community

Paper submitter

For questions contact astrlrd@nebius.com

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.10453
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.10453 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.10453 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.10453 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.