Spaces:
Running
Running
File size: 1,041 Bytes
ecabc19 94f42c6 ecabc19 94f42c6 ecabc19 94f42c6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | ---
title: SpecPrefill on Unified Memory
emoji: "\U0001F4C4"
colorFrom: blue
colorTo: purple
sdk: static
pinned: false
license: mit
---
# SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for Large Language Models on Apple Silicon
**Author:** David Green ([@Thump604](https://github.com/Thump604))
**Paper:** [specprefill-v2.pdf](specprefill-v2.pdf) | [specprefill.pdf](specprefill.pdf) | **Source:** [specprefill.tex](specprefill.tex)
**DOI:** [10.5281/zenodo.19120919](https://doi.org/10.5281/zenodo.19120919)
**Related:** [vllm-mlx PR #180](https://github.com/waybarrios/vllm-mlx/pull/180) (merged upstream)
## Abstract
Long-context prefill is the dominant latency bottleneck for local LLM inference: a 64K-token prompt on Qwen3.5-122B takes 7 minutes before the first token appears. SpecPrefill -- attention-based sparse prefill using a draft model -- reduces TTFT by 3.71-5.45x across 8K-128K tokens on Apple Silicon unified memory, cutting 128K prefill from 19.3 minutes to 3.5 minutes with a 1.4 GB draft model.
|