Spaces:

Thump604
/

specprefill-paper

Running

Upload README.md with huggingface_hub

94f42c6 verified 21 days ago

1.04 kB

	---
	title: SpecPrefill on Unified Memory
	emoji: "\U0001F4C4"
	colorFrom: blue
	colorTo: purple
	sdk: static
	pinned: false
	license: mit
	---

	# SpecPrefill on Unified Memory: Cross-Architecture Sparse Prefill for Large Language Models on Apple Silicon

	Author: David Green ([@Thump604](https://github.com/Thump604))

	Paper: [specprefill-v2.pdf](specprefill-v2.pdf) \| [specprefill.pdf](specprefill.pdf) \| Source: [specprefill.tex](specprefill.tex)

	DOI: [10.5281/zenodo.19120919](https://doi.org/10.5281/zenodo.19120919)

	Related: [vllm-mlx PR #180](https://github.com/waybarrios/vllm-mlx/pull/180) (merged upstream)

	## Abstract

	Long-context prefill is the dominant latency bottleneck for local LLM inference: a 64K-token prompt on Qwen3.5-122B takes 7 minutes before the first token appears. SpecPrefill -- attention-based sparse prefill using a draft model -- reduces TTFT by 3.71-5.45x across 8K-128K tokens on Apple Silicon unified memory, cutting 128K prefill from 19.3 minutes to 3.5 minutes with a 1.4 GB draft model.