Spaces:

Thump604
/

specprefill-paper

Running

App Files Files Community

specprefill-paper / README.md

Thump604

Upload README.md with huggingface_hub

94f42c6 verified 21 days ago

preview code

raw

history blame contribute delete

1.04 kB

metadata

title: SpecPrefill on Unified Memory
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: static
pinned: false
license: mit

SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for Large Language Models on Apple Silicon

Author: David Green (@Thump604)

Paper: specprefill-v2.pdf | specprefill.pdf | Source: specprefill.tex

DOI: 10.5281/zenodo.19120919

Related: vllm-mlx PR #180 (merged upstream)

Abstract

Long-context prefill is the dominant latency bottleneck for local LLM inference: a 64K-token prompt on Qwen3.5-122B takes 7 minutes before the first token appears. SpecPrefill -- attention-based sparse prefill using a draft model -- reduces TTFT by 3.71-5.45x across 8K-128K tokens on Apple Silicon unified memory, cutting 128K prefill from 19.3 minutes to 3.5 minutes with a 1.4 GB draft model.