Spaces:

Thump604
/

specprefill-paper

Running

App Files Files Community

specprefill-paper / specprefill.tex

Thump604

Upload specprefill.tex with huggingface_hub

64106e2 verified 21 days ago

raw

history blame contribute delete

59.8 kB

	\documentclass[11pt,a4paper]{article}

	% === Packages ===
	\usepackage[utf8]{inputenc}
	\usepackage[T1]{fontenc}
	\usepackage{amsmath,amssymb}
	\usepackage{graphicx}
	\usepackage{booktabs}
	\usepackage{hyperref}
	\usepackage{xcolor}
	\usepackage{algorithm}
	\usepackage{algpseudocode}
	\usepackage{listings}
	\usepackage[margin=1in]{geometry}
	\usepackage{caption}
	\usepackage{subcaption}
	\usepackage{natbib}

	% === Macros ===
	\newcommand{\tbd}[1]{\textcolor{red}{\textbf{[TBD: #1]}}}
	\newcommand{\specprefill}{\textsc{SpecPrefill}}
	\newcommand{\ttft}{\text{TTFT}}

	\lstset{
	basicstyle=\ttfamily\small,
	keywordstyle=\color{blue},
	commentstyle=\color{gray},
	breaklines=true,
	frame=single,
	numbers=left,
	numberstyle=\tiny\color{gray},
	}

	% === Title ===
	\title{\specprefill{} on Unified Memory:\\Cross-Architecture Sparse Prefill for\\Large Language Models on Apple Silicon}

	\author{
	\texttt{github.com/Thump604}
	}

	\date{March 2026}

	\begin{document}
	\maketitle

	% ============================================================
	\begin{abstract}
	% ============================================================

	Long-context prefill is the dominant latency bottleneck for local LLM inference: a 64K-token prompt on Qwen3.5-122B (MoE, 10B active parameters) takes 7 minutes before the first token appears.
	\specprefill{}---attention-based sparse prefill using a draft model---was designed for GPU clusters with discrete memory.
	We port it to Apple Silicon's unified memory architecture and generalize it across three model families: transformer Mixture-of-Experts (Qwen3.5), Mamba-2/attention hybrid (Nemotron-H), and sliding-window dense (GPT-OSS\footnote{GPT-OSS refers to a publicly available model by its open-source project designation.}).

	On M2~Ultra (128\,GB unified memory), \specprefill{} with a 2B draft model (Qwen3.5-2B, 1.4\,GB, 4-bit) reduces TTFT by $3.71$--$5.45\times$ across 8K--128K tokens on Qwen3.5-122B, cutting 128K prefill from 19.3~minutes to 3.5~minutes.
	Composed with system prompt KV caching, end-to-end speedup reaches $5.6\times$ on a 73K-token production workload.
	We also achieve $2.10$--$2.19\times$ on Nemotron-H~120B across 8K--64K tokens.
	Unified memory eliminates PCIe transfer overhead, making the draft-to-target FLOP ratio the dominant predictor of speedup. We formalize and validate this relationship across six draft/target configurations.
	Under adversarial evaluation (0/16 regressions at 20\% keep), LLM-judge, and perplexity analysis, we observe no quality degradation at our recommended operating point.

	Our implementation handles architecture-specific challenges including gated queries with per-head normalization (Qwen3.5), SSM-interleaved attention layers without positional encoding (Nemotron-H), and sliding-window cache preservation (GPT-OSS), deployed in a production serving stack with per-request API control.

	\end{abstract}

	% ============================================================
	\section{Introduction}
	\label{sec:intro}
	% ============================================================

	\subsection{The TTFT Problem in Local Inference}

	Time-to-first-token (TTFT) is the dominant user-facing latency for large language models serving long-context requests.
	On commodity hardware---a Mac Studio with Apple M2~Ultra and 128\,GB unified memory---prefilling a 64K-token prompt through Qwen3.5-122B-A10B (a 122-billion parameter Mixture-of-Experts model with 10 billion active parameters per token) requires \textbf{418~seconds}, nearly 7~minutes before the first output token appears.
	Even at 16K tokens, the wait is 92~seconds.

	This latency is not a bandwidth problem.
	MLX prefill on Apple Silicon is FLOP-limited: the Metal GPU is compute-bound processing each token through the model's forward pass~\cite{mlx}.
	Reducing the number of tokens processed during prefill therefore yields near-linear TTFT improvement.

	In local inference, the cost of long TTFT is not measured in dollars-per-token but in user time.
	A 16K-token context (typical for an IDE coding assistant with tool definitions and file contents) means 92 seconds of waiting before the first response token.
	A long creative writing session or research conversation that accumulates 64K tokens of history means 7 minutes per response.
	On a serialized single-request engine, every second of prefill also delays all queued requests.

	\noindent With \specprefill{}, 128K prefill on the 122B model drops from 19.5~minutes to 3.5~minutes, making long-context requests practical for interactive use.

	\subsection{Why Unified Memory Changes the Calculus}

	\specprefill{}~\cite{specprefill} addresses TTFT by using a small draft model to identify which prompt tokens are most important via attention scoring, then sparse-prefilling only the selected subset into the target model.
	The original formulation assumes a discrete-memory GPU architecture where the draft model either (a) shares GPU VRAM with the target, reducing KV cache headroom, or (b) runs on CPU or a separate GPU, incurring PCIe transfer latency for importance scores.

	On Apple Silicon's unified memory architecture, neither penalty applies.
	Draft and target models share the same physical address space---the draft's weights (${\sim}$1.4\,GB for a 4-bit 2B model) are simply additional allocations in the same memory pool as the target's \raise.17ex\hbox{$\scriptstyle\sim$}79\,GB.
	Scoring requires zero data movement.
	On discrete GPU systems, draft scoring would either compete for GPU VRAM with the target's KV cache (reducing effective context length) or require CPU$\leftrightarrow$GPU transfers with latency proportional to prompt length---neither penalty exists on unified memory.

	This simplifies the cost equation.
	Let $C_t$ denote the target model's prefill cost (in FLOPs), $C_d$ the draft model's scoring cost (full prefill plus lookahead steps), and $k$ the fraction of tokens retained.
	On unified memory:
	\begin{equation}
	\label{eq:speedup}
	\text{Speedup} = \frac{C_t}{C_d + k \cdot C_t}
	\end{equation}
	When $C_d \ll C_t$---as with a 2B draft scoring for a 122B MoE target, where the FLOP ratio is approximately $50\times$---speedup approaches $1/k$.
	At $k = 0.2$, this predicts up to $4.5\times$; we measure $4.11\times$ at 16K tokens (5 trials), with the gap attributable to overhead from chunk selection, RoPE patching, and memory management.

	We term this the \textbf{ratio thesis}: on unified memory where $T = 0$, the draft-to-target FLOP ratio $r$ is the dominant predictor of \specprefill{} benefit, modulated by architecture-dependent overhead $\epsilon$ (Equation~\ref{eq:speedup_ratio}).
	Section~\ref{sec:ratio} validates this across six draft/target configurations spanning an $8.5\times$ range of FLOP ratios.

	\subsection{Contributions}

	\begin{enumerate}
	\item \textbf{First implementation on unified memory hardware.}
	We implement \specprefill{} on Apple Silicon via the MLX framework, demonstrating that zero-copy scoring shifts the viability threshold, making the technique effective even at moderate prompt lengths (8K tokens).

	\item \textbf{Cross-architecture generalization.}
	We extend \specprefill{} beyond standard transformers to Mamba-2/attention hybrids (Nemotron-H, where only 8 of 88 target layers have attention) and sliding-window models (GPT-OSS, with RotatingKVCache).
	Auto-detecting query extraction handles gated attention with per-head normalization (Qwen3.5), content-based attention without positional encoding (Nemotron-H), and YarnRoPE with sliding-window cache preservation (GPT-OSS).

	\item \textbf{Production system integration.}
	We show per-request API control with graceful fallback, coexistence with Multi-Token Prediction (MTP) speculative decoding in a three-phase ``Speculative Stack,'' and demonstrate that the FLOP ratio thesis extends to draft model selection---smaller drafts with lower $r$ yield higher speedup.
	\end{enumerate}

	This is a \textbf{systems paper with algorithmic adaptations}, not a claim of a new algorithm.
	The core \specprefill{} idea is due to~\citet{specprefill}; our contributions are making it work across architectures on new hardware with real deployment.


	% ============================================================
	\section{Background}
	\label{sec:background}
	% ============================================================

	\subsection{SpecPrefill}

	\citet{specprefill} observe that during autoregressive generation, the model attends heavily to a small fraction of prompt tokens.
	If these important tokens can be identified cheaply, the full prompt need not be prefilled into the target model.
	Their method uses a small draft model to score token importance via attention weights, selects the top-$k$\% of tokens (in non-overlapping chunks for spatial locality), and sparse-prefills only the selected subset.

	Sparse prefill is possible because Rotary Position Embeddings (RoPE)~\cite{rope} encode \emph{relative} position.
	The inner product $Q_m \cdot K_p^T$ depends only on the relative distance $(m - p)$ encoded via rotation angles.
	If selected tokens are stored in the KV cache with their \emph{original} position angles, they interact with future decode queries at correct relative distances regardless of gaps.

	\subsection{MLX and Apple Silicon Unified Memory}

	MLX~\cite{mlx} is Apple's machine learning framework designed for Apple Silicon.
	Apple Silicon uses \emph{unified memory}: CPU and GPU share the same physical DRAM through a common memory controller.
	There is no PCIe bus, no \texttt{cudaMemcpy}, no distinct VRAM allocation.
	Tensors created by any compute unit are immediately accessible to any other.

	MLX exploits this with lazy evaluation and reference-counted memory management.
	Metal compute shaders execute matrix operations on the GPU.
	In practice, a draft model's weights are additional allocations in the same memory pool, accessible at full bandwidth from any compute unit without copying.

	\subsection{Target Architectures}

	We evaluate \specprefill{} on three architecturally distinct model families, establishing that generalization requires non-trivial adaptations:

	\begin{table}[h]
	\centering
	\small
	\begin{tabular}{@{}lllll@{}}
	\toprule
	\textbf{Model} & \textbf{Architecture} & \textbf{Attention} & \textbf{Position Enc.} & \textbf{Cache} \\
	\midrule
	Qwen3.5-122B & MoE (10B active) & Gated + q\_norm & Standard RoPE & Standard KV \\
	Nemotron-H 120B & Mamba-2 + Attn + MoE & Standard (8/88 layers) & None (SSM) & Compacted \\
	GPT-OSS 120B & Dense + sliding window & Standard & YarnRoPE & RotatingKV \\
	\bottomrule
	\end{tabular}
	\caption{Target model architectures. Each requires different query extraction, position handling, and cache management in \specprefill{}.}
	\label{tab:architectures}
	\end{table}

	\textbf{Qwen3.5} uses gated attention where \texttt{q\_proj} outputs $2\times$ the expected width (query concatenated with a gate), requiring a split before per-head RMSNorm (\texttt{q\_norm}) and RoPE application.

	\textbf{Nemotron-H} is a hybrid architecture with 40 Mamba-2 SSM layers, 8 full-attention layers, and 40 MoE feed-forward layers.
	Positional information is encoded entirely in the SSM state---the attention layers have no RoPE.
	Only the attention layers produce Q/K scores usable for importance scoring.

	\textbf{GPT-OSS} uses YarnRoPE~\cite{yarn} with a sliding-window attention pattern where alternating layers use \texttt{RotatingKVCache} retaining only the last 128 tokens.


	% ============================================================
	\section{Method}
	\label{sec:method}
	% ============================================================

	\subsection{Token Importance Scoring}
	\label{sec:scoring}

	Given a prompt of $M$ tokens, importance scoring proceeds in three phases:

	\begin{enumerate}
	\item \textbf{Draft prefill.} The full prompt is prefilled into a small same-tokenizer draft model (e.g., Qwen3.5-2B at 4-bit quantization, 1.4\,GB) in chunks of 2{,}048 tokens, populating the draft's KV cache. The FLOP ratio thesis (Section~\ref{sec:cost_model}) predicts that minimizing the draft's active parameter count maximizes speedup, favoring the smallest available compatible model.

	\item \textbf{Lookahead decode with attention capture.} Eight autoregressive decode steps are executed with \texttt{AttentionCapture} wrappers installed on each attention layer. These wrappers intercept post-RoPE query vectors via architecture-specific extractors (Section~\ref{sec:extractors}), appending them to a capture buffer before delegating to the original attention computation.

	\item \textbf{Importance computation.} For each attention layer $\ell$ with captured queries $\{q^{(t)}\}_{t=1}^{8}$ and prompt keys $K_\ell \in \mathbb{R}^{h \times M \times d}$:
	\begin{align}
	S_\ell^{(t)} &= \text{softmax}\!\left(\frac{q^{(t)} K_\ell^T}{\sqrt{d}}\right) \in \mathbb{R}^{h \times M} \\
	\bar{S}_\ell^{(t)} &= \text{AvgPool1D}(S_\ell^{(t)},\; \text{kernel}=13)
	\end{align}
	Scores are aggregated as $\max$ across layers and heads, then $\text{mean}$ across lookahead steps, yielding importance $I \in \mathbb{R}^M$.
	Average pooling with kernel 13 smooths the signal, preventing isolated-token artifacts.
	For GQA models, keys are expanded via \texttt{repeat} to match the query head count before scoring.
	\end{enumerate}

	Layers whose cache does not span the full prompt (e.g., sliding-window \texttt{RotatingKVCache} layers caching only 128 tokens) are skipped during importance computation.

	After scoring, the draft KV cache is explicitly freed and \texttt{mx.clear\_cache()} is called, reclaiming memory before target prefill begins.

	\subsubsection{Architecture-Specific Query Extraction}
	\label{sec:extractors}

	The query extractor is auto-detected at runtime based on the attention module's attributes:

	\begin{itemize}
	\item \texttt{q\_norm} present $\rightarrow$ Qwen3.5 path
	\item No \texttt{rope} attribute $\rightarrow$ Nemotron-H path
	\item Otherwise $\rightarrow$ Standard (Llama/GPT-OSS) path
	\end{itemize}

	\paragraph{Qwen3.5: Gated queries with per-head normalization.}
	The \texttt{q\_proj} output has $2\times$ the expected width, containing both query and gate tensors concatenated along the head dimension.
	The output is reshaped to $(B, L, n_\text{heads}, 2 \cdot d_\text{head})$ and split at the midpoint.
	After splitting, \texttt{q\_norm} (a per-head RMSNorm) is applied before RoPE rotation.
	Treating this as a standard projection produces silent shape errors or incorrect scoring.

	\paragraph{Nemotron-H: Heterogeneous layer navigation.}
	Of 88 total layers (40 Mamba-2 + 8 attention + 40 MoE), only the 8 attention layers produce Q/K scores.
	\texttt{\_find\_attention\_layers} navigates the heterogeneous layer structure by inspecting \texttt{block\_type} annotations (\texttt{M} for Mamba, \texttt{*} for attention, \texttt{-} for MLP, \texttt{E} for MoE) and locating modules with a \texttt{mixer} attribute rather than the standard \texttt{self\_attn}.
	\texttt{\_build\_layer\_to\_cache\_map} constructs a compacted index because only Mamba and attention layers have cache entries.

	These attention layers have \textbf{no RoPE}---positional information comes entirely from the Mamba-2 SSM state.
	Queries are used as-is for content-based scoring.
	This is a non-trivial engineering challenge the original paper did not address: the draft model (Nano~4B) has 42 heterogeneous layers with only 4 attention layers among 21 Mamba and 17 MLP layers, all in a model where positional information is entirely implicit.

	\paragraph{GPT-OSS: RotatingKVCache awareness.}
	Standard query extraction applies, but importance computation must skip sliding-window layers whose \texttt{RotatingKVCache} contains only the last 128 tokens.
	Without this check, importance scores would be computed against a truncated key set, producing misleading rankings.
	Correct handling requires three cache-aware adaptations:
	(1)~layer-level cache introspection to distinguish full-context from sliding-window layers;
	(2)~skipping importance computation for layers whose cache does not span the full prompt;
	(3)~force-preserving the last \texttt{max\_size} positions during sparse selection to ensure sliding-window layers have valid recent context at decode time.


	\subsection{Chunk Selection}

	Tokens are grouped into non-overlapping chunks of $C = 32$ tokens.
	Each chunk is scored by the mean importance of its constituent tokens.
	The top $\lceil k \cdot M/C \rceil$ chunks by score are selected and their token indices returned in sorted order.
	This preserves spatial locality---coherent phrases are kept or dropped as units.

	At $k = 0.2$ (our optimal configuration for Qwen3.5-122B), 80\% of prefill computation is eliminated.


	\subsection{Sparse Prefill with Position-Mapped RoPE}
	\label{sec:sparse_prefill}

	The correctness of sparse prefill depends on maintaining correct RoPE angles despite non-contiguous token positions.
	If position angles are not re-mapped, the model perceives selected tokens as adjacent, destroying long-range coherence.

	\paragraph{Step 1: Sliding-window tail preservation.}
	For architectures using \texttt{RotatingKVCache} (GPT-OSS), the last \texttt{max\_size} positions from the prompt are force-merged into the selection set, ensuring sliding-window attention layers have valid recent context at decode time.
	This is auto-detected via cache type inspection.

	\paragraph{Step 2: Position mapping during prefill.}
	Each attention layer's \texttt{nn.RoPE} is replaced with \texttt{PositionMappedRoPE}, which maps contiguous batch positions $[0, 1, \ldots, N{-}1]$ to the original absolute positions $[p_0, p_1, \ldots, p_{N-1}]$ of the selected tokens.
	For models with custom RoPE variants (YarnRoPE with pre-computed frequencies, SuScaled RoPE with \texttt{mscale}), the replacement module captures and replays the original frequency tensors and scale factors.

	\paragraph{Step 3: Chunked forward pass.}
	Selected tokens are fed through the target model in chunks of \texttt{step\_size} (default 2{,}048), populating the KV cache with entries at correct absolute positions.

	\paragraph{Step 4: Decode offset adjustment.}
	After sparse prefill of $N$ selected tokens from $M$ total prompt tokens, the cache offset is $N$ but decode must start at position $M$.
	\texttt{OffsetAdjustedRoPE} wraps the original RoPE module and adds adjustment $\Delta = M - N$ to all offset calls:
	\begin{equation}
	\text{RoPE\_position}(i) = N + i + (M - N) = M + i \quad \checkmark
	\end{equation}

	\paragraph{Step 5: Cleanup.}
	After generation completes, \texttt{cleanup\_rope()} traverses all attention layers and unwraps patched RoPE modules back to their originals, ensuring the model is unmodified for subsequent requests.

	\paragraph{Nemotron-H (no RoPE).}
	Steps 2 and 4 are skipped entirely---Nemotron-H's attention layers have no RoPE, deriving positional information from the Mamba-2 SSM state instead.

	The SSM layers are updated \emph{only} on retained tokens; skipped tokens do not contribute to state evolution.
	Concretely: the Mamba-2 recurrence $h_t = A h_{t-1} + B x_t$ advances only at selected positions, so the hidden state after processing $N$ selected tokens diverges from the state after processing all $M$ tokens.
	This alters the underlying state trajectory.
	However, in practice, the retained tokens---selected by attention importance---appear sufficient to preserve global semantics for long-context tasks: our server-side benchmarks show $2.10$--$2.19\times$ speedup across 8K--64K tokens with no observed quality degradation (Section~\ref{sec:quality}).
	Quantifying the SSM state drift (e.g., L2 distance between full and sparse hidden states) is left to future work.
	Despite this mismatch, empirical evaluation shows no quality degradation under our test suite (Section~\ref{sec:quality}).
	We attribute this to the hybrid architecture: the 8 full-attention layers retain the most important $N$ tokens with correct content-based scores, providing long-range context that compensates for gaps in the SSM's recurrent state.
	This is an empirical result, not a theoretical guarantee; extending \specprefill{} to pure SSM architectures would require additional analysis.


	\subsection{Unified Memory Cost Model}
	\label{sec:cost_model}

	We formalize the relationship between hardware architecture and \specprefill{} efficiency.

	On discrete-GPU systems, the cost of \specprefill{} includes a data transfer term $T$ (PCIe bandwidth, memory copies between draft and target devices):
	\begin{equation}
	\label{eq:speedup_gpu}
	\text{Speedup}_{\text{GPU}} = \frac{C_t}{C_d + T + k \cdot C_t}
	\end{equation}

	On unified memory, $T = 0$, simplifying to Equation~\ref{eq:speedup}.
	The speedup is determined by the empirical wall-clock cost ratio $r = C_d / C_t$ and keep percentage $k$:
	\begin{equation}
	\label{eq:speedup_ratio}
	\text{Speedup} = \frac{1}{r + k + \epsilon}
	\end{equation}
	where $\epsilon$ captures fixed overhead from chunk selection, RoPE patching, memory management (\texttt{mx.clear\_cache()}), and architecture-specific scoring costs.
	In practice, $\epsilon$ ranges from 0.03 (Qwen3.5, low overhead) to 0.30 (Nemotron-H, where navigating 88 heterogeneous layers adds cost).
	We emphasize that this model is \emph{descriptive}: it correctly predicts the \emph{ranking} of configurations by speedup (Table~\ref{tab:cost_model}) but does not predict exact magnitudes, as $\epsilon$ varies by architecture.
	The value of the model is in draft selection---given two candidate drafts, the one with lower $r$ will yield higher speedup---not in predicting absolute speedup from first principles.

	\textbf{Boundary conditions.}
	Equation~\ref{eq:speedup_ratio} assumes: (1)~sequential, single-request prefill (no batching); (2)~a FLOP-dominated regime where compute, not memory bandwidth, is the bottleneck; and (3)~negligible KV cache fragmentation cost.
	In bandwidth-bound regimes or under heavy concurrent batching, $\epsilon$ may dominate and reduce the model's predictive accuracy.

	However, $r$ is not simply the ratio of active parameters.
	On unified memory, all expert weights reside in the same address space---there is no discrete VRAM to overflow into.
	MoE forward passes incur costs beyond active-parameter FLOPs: router gating across all experts, weight loading for selected experts from unified memory, and memory bandwidth pressure from the full parameter footprint.
	As a result, the empirical wall-clock cost of a MoE forward pass on unified memory scales closer to \emph{total} parameters than to active parameters alone.
	A small dense 2B draft scoring against a 122B MoE target (10B active but 122B total) therefore achieves $r \approx 0.02$, reflecting the full-parameter cost disparity rather than the ${\sim}5\times$ active-parameter ratio.
	This is a finding specific to unified memory systems: on discrete GPUs where expert weights page between CPU and GPU memory, the effective cost ratio may differ.

	For MoE models where total parameters far exceed active parameters, a small dense draft has $r \ll 1$.
	This model assumes $C_t$ scales linearly with token count; at extreme context lengths (${\geq}$128K), the $O(n^2)$ attention component causes superlinear growth in $C_t$, and measured speedup can exceed the linear prediction (Table~\ref{tab:cost_model}).
	Table~\ref{tab:cost_model} compares predicted ($\epsilon = 0$) and measured speedups, with the gap $G = \text{Predicted} / \text{Measured}$ quantifying per-configuration overhead:

	\begin{table}[h]
	\centering
	\small
	\begin{tabular}{@{}lcccc@{}}
	\toprule
	\textbf{Configuration} & $\boldsymbol{r = C_d/C_t}$ & \textbf{Predicted} ($k{=}0.2, \epsilon{=}0$) & \textbf{Measured} & $\boldsymbol{G}$ \\
	\midrule
	4B / 122B MoE (10B active) & $\sim$0.03 & 4.3$\times$ & 2.90$\times$ & 1.5$\times$ \\
	2B / 122B MoE (10B active) & $\sim$0.02 & 4.5$\times$ & 4.11$\times$ & 1.1$\times$ \\
	2B / 122B MoE (10B active)$^\ddagger$ & $\sim$0.02 & 4.5$\times$ & 5.45$\times$ & 0.8$\times$ \\
	Qwen-4B / 35B MoE (3B active) & $\sim$0.10 & 3.3$\times$ & 1.86$\times$ & 1.8$\times$ \\
	4B / 120B Nemotron-H (12B active) & $\sim$0.03 & 4.3$\times$ & 2.17$\times$ & 2.0$\times$ \\
	20B / 120B GPT-OSS (120B active) & $\sim$0.17 & 2.7$\times$ & 1.28$\times$ & 2.1$\times$ \\
	\bottomrule
	\multicolumn{5}{l}{\footnotesize $^\ddagger$Measured at 128K tokens. Measured speedup exceeds prediction due to $O(n^2)$ attention.}
	\end{tabular}
	\caption{Cost model predictions ($\epsilon = 0$) vs.\ measured speedups at $k = 0.2$, 16K tokens unless noted. $G = \text{Predicted} / \text{Measured}$; values $> 1$ indicate overhead exceeding the model, $< 1$ indicates superlinear baseline growth benefiting \specprefill{} beyond the linear prediction.}
	\label{tab:cost_model}
	\end{table}

	The $G$ values reveal architecture-dependent overhead.
	Nemotron-H ($G = 2.0$) has the highest $\epsilon$: the target has 88 heterogeneous layers (40~Mamba + 8~attention + 40~MoE) and the draft has 42 (21~Mamba + 4~attention + 17~MLP), requiring architecture-aware layer navigation for both scoring and sparse prefill.
	GPT-OSS ($G = 2.1$) combines high $r$ (0.17, the 20B draft dominates the denominator) with sliding-window cache management overhead.
	The Qwen3.5-122B configurations ($G = 1.1$--$1.5$, 5 trials) have the lowest overhead, benefiting from uniform architecture and favorable MoE FLOP ratios; the 35B target ($G = 1.8$) sits higher due to its smaller active-parameter count (3B) narrowing the draft-to-target gap.
	At 128K, $G < 1$ because the superlinear baseline growth (Section~\ref{sec:experiments}) is not captured by the linear cost model.

	\paragraph{Comparison with GPU results.}
	The original \specprefill{} on discrete GPUs~\cite{specprefill} reports up to $7.66\times$ TTFT reduction on Llama-3.1-405B-FP8, benefiting from batch-level parallelism and GPU-optimized attention kernels not available on Apple Silicon.
	Our lower absolute speedups ($3.71$--$5.45\times$ on 122B) reflect the single-request, unbatched setting and MLX's Metal compute pipeline.
	The unified memory contribution is not higher \emph{absolute} speedup but a \emph{simpler cost model}: eliminating $T$ makes the FLOP ratio the sole dominant predictor, enabling principled draft model selection without profiling transfer overhead.


	% ============================================================
	\section{System Integration}
	\label{sec:system}
	% ============================================================

	\subsection{Composition with System Prompt KV Caching}
	\label{sec:composition}

	In production agentic workflows, the system prompt (tool definitions, instructions, context documents) often constitutes 10--20K tokens that remain identical across requests.
	System prompt KV caching snapshots this prefix on the first request and restores it for subsequent requests, eliminating redundant prefill.

	These optimizations operate on orthogonal axes: KV caching eliminates \emph{prefix} cost (identical system prompt); \specprefill{} reduces \emph{suffix} cost (variable user context).
	This is why they compose multiplicatively rather than additively.
	When both techniques are active, \specprefill{} operates on the \emph{suffix} only (user and assistant turns), receiving a \texttt{position\_offset} equal to the system token count.
	The scoring phase evaluates only suffix tokens; sparse prefill maps positions relative to the offset so selected tokens land at correct absolute positions in the full context.

	The threshold check uses suffix length, not full prompt length, ensuring \specprefill{} activates only when the suffix itself is long enough to benefit.

	\begin{table}[h]
	\centering
	\small
	\begin{tabular}{@{}lcc@{}}
	\toprule
	\textbf{Configuration} & \textbf{TTFT (s)} & \textbf{Speedup} \\
	\midrule
	Baseline (cold, full prefill) & 517.5 & 1.0$\times$ \\
	System KV cache only & 417.1 & 1.24$\times$ \\
	\textbf{Combined (SysKV + SP 20\%)} & \textbf{92.5} & \textbf{5.59$\times$} \\
	Combined (repeat) & 92.4 & 5.60$\times$ \\
	\bottomrule
	\end{tabular}
	\caption{Composition of system prompt KV caching and \specprefill{} on Qwen3.5-122B, 2B draft, M2~Ultra 128\,GB. The prompt (73K tokens) is a realistic agentic workload: ${\sim}$10K system prefix (tool definitions, instructions) + ${\sim}$63K user context. System KV cache saves the prefix; \specprefill{} sparse-prefills the suffix. The combined $5.6\times$ speedup exceeds either technique alone.}
	\label{tab:composition}
	\end{table}


	\subsection{The Speculative Stack}

	\specprefill{} is not a standalone optimization but one phase of a three-phase speculative pipeline:

	\begin{enumerate}
	\item \textbf{Score} (\specprefill{}): Draft model (2B) identifies important tokens via attention scoring.
	\item \textbf{Sparse Prefill}: Target model (122B) processes selected token chunks with position-mapped RoPE.
	\item \textbf{MTP Decode}: Target model with Multi-Token Prediction heads generates output tokens speculatively (Qwen3.5 only; Nemotron-H and GPT-OSS skip this phase).
	\end{enumerate}

	The draft model used in Phase~1 is architecturally separate from MTP's prediction heads (which are part of the target model's weights).
	The draft KV cache is freed after Phase~1 (\texttt{mx.clear\_cache()}), and the draft model's static weights (1.4--3.0\,GB depending on draft size) remain resident for amortized startup cost across requests.
	At extreme context lengths (128K+), the draft's transient KV cache size becomes a practical constraint: the 2B draft produces 1.5\,GB at 128K tokens, fitting comfortably alongside the target model's ${\sim}$100\,GB on a 128\,GB system.


	\subsection{Per-Request API and Graceful Fallback}

	The OpenAI-compatible API accepts per-request overrides:

	\begin{lstlisting}[language=Python]
	extra_body = {
	"specprefill": True, # force enable (bypass threshold)
	"specprefill_keep_pct": 0.15 # override server default
	}
	\end{lstlisting}

	The default threshold (8{,}192 tokens) is enforced only in server-default mode; explicit \texttt{specprefill: true} bypasses it.
	Any error during scoring or sparse prefill triggers graceful fallback to full prefill---no request fails due to \specprefill{}.


	% ============================================================
	\section{Experiments}
	\label{sec:experiments}
	% ============================================================

	\subsection{Setup}

	\paragraph{Hardware.} Apple M2~Ultra, 128\,GB unified memory, Mac Studio, macOS~26.3.1.

	\paragraph{Software.} MLX~0.31.1, vllm-mlx~0.2.6 with patches, Python~3.12.

	\paragraph{Sampling.} Qwen3.5 models: $\text{temp}=0.6$, $\text{top\_p}=0.95$, $\text{top\_k}=20$ (official thinking+coding profile~\cite{qwen35}).
	Nemotron-H: $\text{temp}=1.0$, $\text{top\_p}=0.95$ (NVIDIA model card: ``trained and evaluated with'').
	GPT-OSS: $\text{temp}=0.6$, $\text{top\_p}=0.95$.
	All models run with thinking mode enabled (\texttt{enable\_thinking=true}).
	Sampling parameters do not affect TTFT measurement (prefill is deterministic; sampling occurs only during generation).

	\paragraph{Methodology.} Five trials per configuration for Qwen3.5-122B (plus one warmup); two trials for remaining configurations. Server-side TTFT measured via streaming OpenAI-compatible API.
	Server restarted between configuration changes.

	\paragraph{Reproducibility.} Our \specprefill{} implementation is available as patches against vllm-mlx~0.2.6 (PRs~\#175, \#180) and mlx-lm~0.31.2 (PR~\#990), with benchmark scripts (\texttt{bench-specprefill}, \texttt{bench-specprefill-adversarial}, \texttt{bench-specprefill-perplexity}) included in the repository.
	All models are publicly available Qwen3.5 quantizations.
	Nemotron-H and GPT-OSS weights are available from their respective model hubs.
	Experiments require an Apple Silicon system with $\geq$128\,GB unified memory for the 122B configuration; the 35B configuration runs on 64\,GB systems.

	\begin{table}[h]
	\centering
	\small
	\begin{tabular}{@{}llllllr@{}}
	\toprule
	\textbf{Model} & \textbf{Role} & \textbf{Architecture} & \textbf{Params} & \textbf{Active} & \textbf{Quant} & \textbf{RAM} \\
	\midrule
	Qwen3.5-122B-VLM-MTP & Target & MoE & 122B & 10B & 5-bit & 79\,GB \\
	Qwen3.5-35B-VLM-MTP & Target & MoE & 35B & 3B & 8-bit & 38\,GB \\
	Nemotron-H 120B & Target & Mamba-2 + Attn + MoE & 120B & 12B & 5-bit & 83\,GB \\
	GPT-OSS 120B & Target & Dense + sliding window & 120B & 120B & 5-bit & 58\,GB \\
	\midrule
	Qwen3.5-4B-VLM-MTP & Draft & Dense hybrid & 4B & 4B & 4-bit & 3.0\,GB \\
	Qwen3.5-2B-OptiQ & Draft & Hybrid + MoE & 2B & $<$2B & 4-bit & 1.4\,GB \\
	Nemotron-H Nano 4B & Draft & Mamba-2 + Attn hybrid & 4B & 4B & 4-bit & 2.1\,GB \\
	GPT-OSS-20B & Draft & MoE & 20B & 3.6B & 4-bit & 10\,GB \\
	\bottomrule
	\end{tabular}
	\caption{Models evaluated. \specprefill{} requires draft and target to share the same tokenizer. Qwen3.5 drafts (2B, 4B) serve all Qwen3.5 targets (248K vocabulary); the 2B is the primary draft model. Nemotron-H Nano~4B serves Nemotron-H~120B (131K vocabulary). GPT-OSS-20B is the smallest available same-family draft for GPT-OSS-120B (201K vocabulary).}
	\label{tab:models}
	\end{table}


	\subsection{TTFT Benchmarks}

	\begin{table}[h]
	\centering
	\small
	\begin{tabular}{@{}llccccc@{}}
	\toprule
	\textbf{Model} & \textbf{Draft} & \textbf{8K} & \textbf{16K} & \textbf{32K} & \textbf{64K} & \textbf{128K} \\
	\midrule
	Qwen3.5-122B (MoE, 10B) & 4B$^\ddagger$ & 2.79$\times$ & 2.90$\times$ & ---$^\dagger$ & ---$^\dagger$ & ---$^\dagger$ \\
	Qwen3.5-122B (MoE, 10B) & 2B & 3.71$\times$ & 4.11$\times$ & 4.23$\times$ & 4.50$\times$ & 5.45$\times$ \\
	Qwen3.5-35B (MoE, 3B) & 4B & 1.81$\times$ & 1.86$\times$ & 1.85$\times$ & 1.84$\times$ & --- \\
	Nemotron-H 120B (hybrid) & Nano-4B & 2.10$\times$ & 2.17$\times$ & 2.19$\times$ & 2.19$\times$ & --- \\
	GPT-OSS 120B (dense) & 20B & 1.24$\times$ & 1.28$\times$ & --- & --- & --- \\
	\bottomrule
	\multicolumn{7}{l}{\footnotesize $^\dagger$Not measured at this context length.} \\
	\multicolumn{7}{l}{\footnotesize $^\ddagger$4B draft includes VLM weights (3.0\,GB); the 2B text-only draft (1.4\,GB) is the primary configuration.}
	\end{tabular}
	\caption{TTFT speedups at 20\% keep, 5 trials (mean). Speedup increases with prompt length as scoring overhead is amortized and $O(n^2)$ attention savings compound. Qwen3.5-122B and 4B rows use 5 trials; other rows use 2 trials.}
	\label{tab:ttft}
	\end{table}

	For Qwen3.5-122B with the 2B draft, the absolute TTFT at 64K tokens drops from $417.6 \pm 0.6$\,s to $92.8 \pm 0.8$\,s.
	At 128K tokens: $1{,}155.8 \pm 8.5$\,s (19.3~minutes) $\rightarrow$ $212.3 \pm 1.9$\,s (3.5~minutes), a \textbf{5.45$\times$} reduction.
	At 8K tokens: $45.0 \pm 0.1$\,s $\rightarrow$ $12.1 \pm 0.03$\,s.
	Standard deviations across 5 trials are $<$1\% of the mean at all context lengths, confirming measurement stability.
	Speedup increases monotonically with context length, from $3.71\times$ at 8K to $5.45\times$ at 128K, consistent with the amortization of fixed scoring overhead and the superlinear growth of baseline prefill cost.

	\paragraph{Nemotron-H: architecture-limited speedup plateau.}
	Nemotron-H shows a flat speedup profile ($2.10\times$ at 8K to $2.19\times$ at 64K), in contrast to Qwen3.5's monotonically increasing curve.
	This plateau is explained by the hybrid architecture: only 8 of 88 layers are attention---the remaining 80 layers (40~Mamba-2 SSM + 40~MoE feed-forward) scale linearly with token count regardless of \specprefill{}.
	The $O(n^2)$ attention component that drives Qwen3.5's compounding speedup at long contexts constitutes only ${\sim}9\%$ of Nemotron-H's total compute, so the quadratic savings are a small fraction of overall prefill cost.
	This confirms the architecture-dependent nature of the cost model: \specprefill{} benefit scales with the attention fraction of total computation.
	In absolute terms, Nemotron-H 120B TTFT drops from 58\,s to 27\,s at 16K and from 253\,s to 116\,s at 64K.
	For Qwen3.5-35B: 41\,s to 22\,s at 16K.

	\paragraph{Superlinear scaling at extreme context lengths.}
	The 128K baseline (1{,}156\,s) is $2.77\times$ the 64K baseline (418\,s), not the $2\times$ expected from linear scaling.
	This superlinear growth arises from the $O(n^2)$ attention component in chunked prefill: each 2{,}048-token chunk attends to all preceding tokens, so cumulative attention FLOPs grow quadratically.
	\specprefill{} benefits disproportionately: by reducing the effective sequence length from $N$ to $kN$, it reduces the cumulative attention FLOPs---which scale approximately as $N^2$ under full-prefix chunked prefill---by approximately $k^2$.
	At $k = 0.2$, this yields a ${\sim}25\times$ reduction in attention computation, not the $5\times$ that linear token-count scaling would suggest.
	Draft scoring remains $O(N)$, so its cost grows linearly while the attention savings grow quadratically, explaining why the measured 128K speedup ($5.45\times$) exceeds the linear prediction ($4.5\times$, Table~\ref{tab:cost_model}).
	MoE models exhibit stronger gains because sparse prefill reduces both the quadratic attention cost (sequence length) and the number of active expert evaluations---80\% fewer tokens means 80\% fewer expert routing and weight-loading operations.

	\paragraph{Draft model selection.}
	Draft models are selected to maximize FLOP asymmetry while maintaining tokenizer and architectural compatibility.
	The selection criteria are: (1)~same tokenizer as the target (required---token IDs are passed directly without translation); (2)~smallest available model in the family (to maximize the FLOP ratio $r$); (3)~presence of attention layers for importance scoring (at least 4 layers suffice per our Nemotron-H validation).
	Our primary configuration uses a 2B draft ($r \approx 0.02$, 1.4\,GB); complementary measurements with a 4B draft ($r \approx 0.03$, 3.0\,GB) at 8K--16K confirm the relationship: the lower-ratio 2B achieves $4.11\times$ vs.\ $2.90\times$ at 16K.

	The GPT-OSS result is a \textbf{negative result confirming the ratio thesis}: the 20B draft model (the smallest available in the GPT-OSS family) has an unfavorable FLOP ratio of $\sim$0.17, yielding only $1.24$--$1.28\times$ speedup.
	This validates that architecture is not the determining factor---the FLOP ratio is.
	A hypothetical 4B GPT-OSS draft ($r \approx 0.03$) would be predicted to achieve $\sim$2.5--3$\times$ under our cost model, but no such model exists in the GPT-OSS family (Section~\ref{sec:future}).


	\subsection{Draft-to-Target FLOP Ratio Analysis}
	\label{sec:ratio}

	\begin{figure}[h]
	\centering
	\includegraphics[width=0.9\textwidth]{figures/ratio-speedup.pdf}
	\caption{Measured speedup vs.\ draft-to-target FLOP ratio. The theoretical upper bound (Eq.~\ref{eq:speedup_ratio}) correctly predicts the ranking. Overhead from RoPE patching, chunk selection, and architecture-specific scoring reduces measured values below the theoretical curve.}
	\label{fig:ratio}
	\end{figure}

	The FLOP ratio $r = C_d / C_t$ is the dominant predictor of \specprefill{} benefit on unified memory.
	Across our configurations (Table~\ref{tab:cost_model}), $r$ spans from 0.02 (2B/122B MoE) to 0.17 (20B/120B dense), and measured speedup tracks this ratio monotonically.
	Complementary 4B draft measurements at 8K--16K ($r \approx 0.03$, $2.79$--$2.90\times$) confirm the ratio relationship: the lower-ratio 2B achieves higher speedup at the same context length.

	This relationship distinguishes unified-memory \specprefill{} from GPU-based implementations, where PCIe bandwidth introduces an additional term $T$ (Equation~\ref{eq:speedup_gpu}) that weakens the FLOP-ratio signal.


	\subsection{Keep Percentage Ablation}

	\begin{table}[h]
	\centering
	\small
	\begin{tabular}{@{}ccccc@{}}
	\toprule
	\textbf{Keep \%} & \textbf{TTFT (s)} & \textbf{Speedup} & \textbf{Needle} & \textbf{JSON} \\
	\midrule
	10\% & 23.90 & 3.85$\times$ & PASS & 1/2$^\S$ \\
	20\% & 31.30 & 2.94$\times$ & PASS & PASS \\
	30\% & 38.88 & 2.37$\times$ & PASS & PASS \\
	50\% & 53.78 & 1.71$\times$ & PASS & PASS \\
	100\% (baseline) & 91.97 & 1.0$\times$ & PASS & PASS \\
	\bottomrule
	\multicolumn{5}{l}{\footnotesize $^\S$JSON extraction at 10\% keep: one regression (1 of 3 values wrong) in trial 1, pass in trial 2.} \\
	\end{tabular}
	\caption{Keep percentage ablation on Qwen3.5-122B at ${\sim}$16K tokens with the 4B draft. ``Needle'' is UUID retrieval at all three depths (10\%, 50\%, 90\%); ``JSON'' is exact value extraction from an 80-record array. The 10\% row shows the quality boundary.}
	\label{tab:ablation}
	\end{table}

	The curve shows a clear knee at ${\sim}$20\%: all quality tests pass while delivering $2.94\times$ speedup (4B draft), and marginal speedup gains diminish beyond this point while compute cost increases linearly with $k$.
	This is our recommended operating point.
	At 10\%, speedup increases to $3.85\times$ but structured data extraction becomes unreliable (1 JSON regression in 2 trials), establishing 10\% as the quality boundary.
	At 50\%, the $1.71\times$ speedup is marginal relative to the scoring overhead incurred.


	\subsection{Quality Validation}
	\label{sec:quality}

	Because \specprefill{} retains the highest-scoring tokens by attention importance, the model keeps the tokens it would have attended to most heavily during generation.
	We validate this through three complementary evaluations, leading with the most concrete and falsifiable tests.

	\paragraph{Adversarial tests (primary).}
	Eight test types designed to expose sparse-prefill weaknesses: needle-in-a-haystack (UUID retrieval at 10\%, 50\%, 90\% depth), JSON value extraction from an 80-record array, code bug detection, back-reference, mixed-language retrieval, and XML structure extraction.

	At 20\% keep: \textbf{0/16 regressions} across 2 trials $\times$ 8 tests.
	All needle-in-a-haystack and JSON extraction tests pass under both baseline and \specprefill{}.
	\specprefill{} does not drop needles or corrupt structured data retrieval at 20\% keep.

	At 10\% keep: 1/16 regressions---a JSON extraction test returned one incorrect value out of three (trial 1 of 2; trial 2 passed).
	All needle tests pass at all depths even at 10\%, suggesting that high-importance tokens (which needles represent) are robustly retained; the failure mode at extreme sparsity is degraded recall of \emph{low-salience} structured data.

	\paragraph{ROUGE-L (lexical similarity).}
	We compare outputs from \specprefill{} (20\% keep) against full-prefill baselines on six real-task prompts (code generation, code review, summarization, reasoning, tutorial writing, tool use), each targeting $\sim$8K actual tokens on Qwen3.5-122B.

	To establish a variance floor, we first compare two independent baseline runs against each other:

	\begin{center}
	\small
	\begin{tabular}{lc}
	\toprule
	\textbf{Comparison} & \textbf{ROUGE-L F1} \\
	\midrule
	Baseline vs.\ baseline (variance floor) & 0.190 $\pm$ 0.174 \\
	\specprefill{} vs.\ baseline & 0.236 \\
	\bottomrule
	\end{tabular}
	\end{center}

	The high baseline-vs-baseline variance ($0.190 \pm 0.174$) demonstrates that lexical similarity between outputs is dominated by the model's inherent stochasticity at $\text{temp}=0.6$, not by any effect of \specprefill{}.
	The \specprefill{} similarity (0.236) falls within the baseline noise floor, but this comparison is not informative enough to draw quality conclusions from---the adversarial tests above provide the primary quality evidence.

	\paragraph{LLM-as-Judge (supporting).}
	Blinded A/B evaluation scores coherence, completeness, and accuracy (1--5 scale) for both baseline and \specprefill{} outputs, plus an overall equivalence rating:

	\begin{center}
	\small
	\begin{tabular}{lcc}
	\toprule
	\textbf{Comparison} & \textbf{Avg.\ Equivalence} \\
	\midrule
	Baseline vs.\ baseline & 3.0 / 5.0 \\
	\specprefill{} vs.\ baseline & 3.0 / 5.0 \\
	\bottomrule
	\end{tabular}
	\end{center}

	\paragraph{Perplexity (distributional).}
	We measure next-token perplexity on 256-token continuations after full vs.\ sparse prefill (20\% keep) on five documents at 8K context (code, documentation, LaTeX, mixed):

	\begin{center}
	\small
	\begin{tabular}{lccc}
	\toprule
	\textbf{Document type} & \textbf{PPL (full)} & \textbf{PPL (sparse)} & \textbf{Ratio} \\
	\midrule
	Python (engine code) & 1.85 & 2.53 & 1.37 \\
	Python (benchmark script) & 1.66 & 1.74 & 1.05 \\
	Python (test harness) & 1.49 & 1.58 & 1.06 \\
	LaTeX (this paper) & 2.00 & 2.14 & 1.07 \\
	Mixed (concatenated) & 2.76 & 3.17 & 1.15 \\
	\midrule
	\textbf{Mean} & 1.95 & 2.23 & \textbf{1.14} \\
	\bottomrule
	\end{tabular}
	\end{center}

	Mean perplexity increases 14\%, though 4 of 5 documents show $\leq$7\% increase (median ratio 1.07).
	The outlier (dense engine code, 1.37$\times$) contains many local variable dependencies where discarded tokens carry predictive information.
	This distributional shift does not translate to generation quality degradation in our adversarial or LLM-judge evaluations above---sampling smooths over small distributional differences that perplexity measures precisely.
	16K context was not tested: loading both the 122B target and 2B draft for offline evaluation leaves insufficient headroom on 128\,GB unified memory.

	\paragraph{Limitations.}
	Six prompts, eight adversarial types, and five perplexity documents confirm no catastrophic quality loss and validate the methodology, but the sample size is insufficient for tight confidence intervals.
	We make no claim of statistical equivalence---only that \textbf{no measurable quality degradation was observed under our evaluation at 20\% keep}.
	The 10\% keep JSON regression demonstrates that the quality boundary is observable and characterizable within our framework.
	All quality evaluations were conducted on Qwen3.5-122B.
	Nemotron-H and GPT-OSS were validated via pipeline tests (Section~\ref{sec:method}) but lack server-side adversarial evaluation.
	Future work will include larger-scale evaluation on standardized long-context benchmarks (e.g., RULER, LongBench) and extend quality validation to non-Qwen architectures.


	\subsection{Memory Profile}
	\label{sec:memory}

	\begin{table}[h]
	\centering
	\small
	\begin{tabular}{@{}lrrl@{}}
	\toprule
	\textbf{Component} & \textbf{Memory} & \textbf{Cumulative} & \textbf{Notes} \\
	\midrule
	Target weights (122B, 5-bit) & 79\,GB & 79\,GB & Fixed at load \\
	Draft weights (2B, 4-bit) & 1.4\,GB & 80\,GB & Fixed at load \\
	MLX Metal cache limit & 4\,GB & 84\,GB & Computation scratch \\
	Target KV cache (128K) & 12\,GB & 96\,GB & 96\,KB/token $\times$ 127K \\
	Draft KV cache (128K, transient) & 1.5\,GB & 97\,GB & 12\,KB/token, freed after scoring \\
	OS + framework overhead & $\sim$25\,GB & $\sim$122\,GB & Observed via \texttt{memory\_pressure} \\
	\midrule
	\textbf{Peak (128K baseline)} & & \textbf{$\sim$122\,GB} & Of 128\,GB unified \\
	\textbf{Peak (128K \specprefill{})} & & \textbf{$\sim$122\,GB} & Draft KV transient, not additive \\
	\bottomrule
	\end{tabular}
	\caption{Memory budget for Qwen3.5-122B with 2B draft at 128K tokens on M2~Ultra 128\,GB. The draft KV cache is transient: allocated during scoring, freed via \texttt{mx.clear\_cache()} before target prefill begins. Peak \specprefill{} memory $\approx$ baseline peak because the draft and target KV caches are never resident simultaneously.}
	\label{tab:memory}
	\end{table}


	% ============================================================
	\section{Discussion}
	\label{sec:discussion}
	% ============================================================

	\subsection{The MoE Sweet Spot}

	\specprefill{} benefits MoE architectures more than dense models.
	In MoE models, each token is routed to a subset of experts during the forward pass, but the routing computation and expert weight loading occur for \emph{every} token regardless.
	By reducing the token count during prefill, \specprefill{} reduces the total number of expert activations---not just attention FLOPs, but the dominant feed-forward computation.
	The savings exceed the keep fraction: at $k = 0.2$, the model processes 80\% fewer tokens through its expert layers, each involving sparse routing across 128 experts (Qwen3.5-122B).

	Dense models, by contrast, apply the same computation to every token regardless of routing.
	The savings from \specprefill{} on dense models are proportional only to the reduced attention and MLP computation, which is less dramatic when the model is fully compute-bound.


	\subsection{When SpecPrefill Does Not Help}

	\paragraph{Dense models with large drafts.}
	When the FLOP ratio $r$ exceeds $\sim$0.15, scoring overhead consumes most of the potential savings.
	Our GPT-OSS result (20B draft, $r \approx 0.17$, speedup $1.28\times$) demonstrates this boundary.
	No smaller GPT-OSS model exists; the proprietary tokenizer (201K vocabulary) prevents cross-family draft substitution without a re-tokenization layer (Section~\ref{sec:future}).

	\paragraph{Short prompts.}
	Below $\sim$4K tokens, the fixed overhead of draft scoring and RoPE patching exceeds the savings from sparse prefill.
	Our default threshold of 8{,}192 tokens reflects this empirical boundary.

	\paragraph{Comparison with CritiPrefill.}
	CritiPrefill~\cite{critiprefill} achieves sparse prefill without a draft model by using the target's own attention scores from an initial partial prefill.
	On dense standard transformers where all layers are attention, CritiPrefill achieves 2.7--3.0$\times$ speedup (reported on Llama-3-8B and Yi-9B at 128K context).
	However, it only saves attention FLOPs---on MoE architectures where attention constitutes 7--9\% of total computation, the attention-only savings would be proportionally limited; our analysis estimates $\sim$1.03--1.08$\times$.
	\specprefill{} saves \emph{all} FLOPs (attention + routing + expert computation) for dropped tokens, yielding $3.7$--$5.5\times$ on MoE targets vs.\ the estimated ${\sim}1.03$--$1.08\times$ for attention-only savings.


	\subsection{Limitations}

	\begin{itemize}
	\item \textbf{Draft model dependency.} Requires a small model with a compatible tokenizer. This limits applicability to model families with multiple size variants (Qwen3.5: 2B/4B/27B/35B/122B; GPT-OSS: only 20B/120B).
	\item \textbf{Nemotron-H SSM state.} SSM layers are updated only on retained tokens; skipped tokens do not contribute to state evolution. The resulting state trajectory diverges from full prefill. Empirically safe under our evaluation ($2.10$--$2.19\times$ with no observed quality degradation), but the magnitude of state drift is not quantified.
	\item \textbf{Quality validation scale.} Six prompts and eight adversarial types validate methodology and confirm no catastrophic loss, but are insufficient for tight confidence intervals.
	\item \textbf{Single hardware platform.} Results on M2~Ultra (128\,GB). Memory bandwidth and compute characteristics differ on M3/M4 variants, and the optimal keep percentage may shift.
	\item \textbf{Single-request evaluation.} The serving engine serializes requests via \texttt{asyncio.Lock}. We do not evaluate \specprefill{} under concurrent load.
	\end{itemize}


	% ============================================================
	\section{Related Work}
	\label{sec:related}
	% ============================================================

	\paragraph{Sparse prefill.}
	\citet{specprefill} introduce attention-based sparse prefill using a draft model on discrete GPU architectures.
	CritiPrefill~\cite{critiprefill} achieves draft-free sparse prefill using the target model's own attention.
	We extend the draft-based approach to unified memory hardware and non-transformer architectures.
	To our knowledge, no prior work addresses sparse prefill specifically for unified memory systems.

	\paragraph{Speculative decoding.}
	\citet{leviathan2023fast} and \citet{chen2023accelerating} propose using a draft model to speculatively generate candidate tokens verified by the target model.
	Multi-Token Prediction (MTP)~\cite{gloeckle2024better} uses auxiliary prediction heads within the target model itself.
	Our ``Speculative Stack'' composes prefill-phase speculation (\specprefill{}) with decode-phase speculation (MTP), operating in non-overlapping phases.

	\paragraph{Efficient attention.}
	FlashAttention~\cite{dao2022flashattention,dao2023flashattention2} optimizes attention computation through tiling and memory-efficient kernels.
	\specprefill{} is orthogonal: it reduces the token count \emph{before} attention, and can compose with efficient attention implementations.

	\paragraph{Serving systems.}
	vLLM~\cite{kwon2023efficient} introduces PagedAttention for efficient KV cache management.
	Our system (vllm-mlx) adapts continuous-batching and paged-attention concepts for Apple Silicon.
	\specprefill{} integrates as a per-request prefill optimization within this serving framework.

	\paragraph{MLX framework.}
	MLX~\cite{mlx} provides the unified-memory ML runtime enabling our zero-copy scoring approach.
	Prior MLX-based serving work has focused on standard inference optimization; we show that unified memory enables system-level optimizations, specifically zero-copy draft scoring, that are impractical on discrete-GPU architectures.


	% ============================================================
	\section{Future Work}
	\label{sec:future}
	% ============================================================

	\paragraph{Universal draft models with tokenizer translation.}
	Our results are constrained to model families where a small same-tokenizer draft exists.
	A \emph{universal draft model}---trained to score token importance across vocabularies via a learned re-tokenization layer---would decouple \specprefill{} from the family constraint.
	The translation layer would map target token IDs to text, re-tokenize with the universal draft's vocabulary, score importance, and project scores back to target token space via character-offset alignment.
	This is non-trivial (tokenization boundaries differ across BPE vocabularies) but would enable \specprefill{} for models like GPT-OSS where no small same-family draft exists.

	\paragraph{CritiPrefill for dense models.}
	For dense architectures where the FLOP ratio is unfavorable, CritiPrefill (draft-free) may be more practical.
	Published results show 2.7--3.0$\times$ on dense 8B--9B transformers at 128K context.
	On our MoE targets, where attention is a small fraction of compute, gains would be limited; the approach warrants investigation for dense models on unified memory where memory access patterns differ from GPU.

	\paragraph{SSM state drift analysis.}
	For hybrid models like Nemotron-H, quantifying the L2 distance between full-prefill and sparse-prefill SSM hidden states would characterize the information loss from skipping tokens in recurrent layers and establish quality guarantees beyond empirical testing.

	\paragraph{Continuous batching.}
	Our current implementation uses a single-request serialized engine.
	Integrating \specprefill{} with continuous batching would enable concurrent request handling, but is currently blocked by a Metal driver stability issue on macOS~26.3.1.

	\paragraph{Hardware generalization.}
	Apple's M3 and M4 generations have different memory bandwidth and compute characteristics.
	The optimal keep percentage and FLOP-ratio threshold may shift on these platforms.


	% ============================================================
	\section{Conclusion}
	\label{sec:conclusion}
	% ============================================================

	We have presented the first implementation of \specprefill{} on unified memory hardware, demonstrating that Apple Silicon's shared address space eliminates the data-transfer overhead that complicates draft-based sparse prefill on discrete GPUs.
	With transfer overhead removed, the cost equation reduces to a single dominant term: the draft-to-target FLOP ratio, validated across six configurations spanning MoE, Mamba-2 hybrid, and sliding-window dense architectures.

	On Qwen3.5-122B, \specprefill{} reduces TTFT by $3.71$--$5.45\times$ across 8K--128K tokens with a 1.4\,GB draft model and no observed quality degradation under our evaluation. At 128K tokens, prefill drops from 19.3~minutes to 3.5~minutes.
	Composed with system prompt KV caching, end-to-end speedup reaches $5.6\times$ on a 73K-token production workload.
	The implementation handles architecture-specific challenges (gated queries, heterogeneous SSM/attention layers, sliding-window caches) through auto-detecting adapters that require no user configuration.

	\specprefill{} is most effective on MoE and hybrid models where total parameters far exceed active computation, making a small dense draft model orders of magnitude cheaper than the target.
	As large models move to local hardware, reducing prefill cost through techniques like zero-copy draft scoring directly determines whether long-context inference is usable.


	% ============================================================
	% References
	% ============================================================

	\bibliographystyle{plainnat}

	\begin{thebibliography}{99}

	\bibitem[Chen et~al.(2023)]{chen2023accelerating}
	Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper.
	\newblock Accelerating large language model decoding with speculative sampling.
	\newblock \emph{arXiv preprint arXiv:2302.01318}, 2023.

	\bibitem[Dao(2023)]{dao2023flashattention2}
	Tri Dao.
	\newblock Flash{A}ttention-2: Faster attention with better parallelism and work partitioning.
	\newblock \emph{arXiv preprint arXiv:2307.08691}, 2023.

	\bibitem[Dao et~al.(2022)]{dao2022flashattention}
	Tri Dao, Daniel~Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\'{e}.
	\newblock Flash{A}ttention: Fast and memory-efficient exact attention with {IO}-awareness.
	\newblock In \emph{NeurIPS}, 2022.

	\bibitem[Gloeckle et~al.(2024)]{gloeckle2024better}
	Fabian Gloeckle, Badr Youbi~Idrissi, Baptiste Rozi\`{e}re, David Lopez-Paz, and Gabriel Synnaeve.
	\newblock Better \& faster large language models via multi-token prediction.
	\newblock \emph{arXiv preprint arXiv:2404.19737}, 2024.

	\bibitem[Kwon et~al.(2023)]{kwon2023efficient}
	Woosuk Kwon, Zhuohan Li, Sicheng Zhuang, Ying Sheng, Lianmin Zheng, Cody~Hao Yu, Joseph~E. Gonzalez, Hao Zhang, and Ion Stoica.
	\newblock Efficient memory management for large language model serving with {PagedAttention}.
	\newblock In \emph{SOSP}, 2023.

	\bibitem[Leviathan et~al.(2023)]{leviathan2023fast}
	Yaniv Leviathan, Matan Kalman, and Yossi Matias.
	\newblock Fast inference from transformers via speculative decoding.
	\newblock In \emph{ICML}, 2023.

	\bibitem[Apple(2023)]{mlx}
	Apple Machine Learning Research.
	\newblock {MLX}: An array framework for Apple Silicon.
	\newblock \url{https://github.com/ml-explore/mlx}, 2023.

	\bibitem[Peng et~al.(2024)]{yarn}
	Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole.
	\newblock {YaRN}: Efficient context window extension of large language models.
	\newblock In \emph{ICLR}, 2024.

	\bibitem[Su et~al.(2024)]{rope}
	Jianlin Su, Murtadha Ahmed, Yu~Lu, Shengfeng Pan, Wen Liu, and Bo~Liu.
	\newblock {RoFormer}: Enhanced transformer with rotary position embedding.
	\newblock \emph{Neurocomputing}, 568:127063, 2024.

	\bibitem[Yao et~al.(2025)]{specprefill}
	Ziteng Yao, Wei Chen, Yushi Huang, and others.
	\newblock {SpecPrefill}: Speculative prefilling for faster long-context {LLM} inference.
	\newblock \emph{arXiv preprint arXiv:2502.02789}, 2025.

	\bibitem[Qwen(2025)]{qwen35}
	Qwen Team.
	\newblock {Qwen3.5}: A series of large language models.
	\newblock \url{https://huggingface.co/Qwen/Qwen3.5-122B-A10B}, 2025.

	\bibitem[Zhang et~al.(2025)]{critiprefill}
	Junlin Zhang, Jiahao Li, and others.
	\newblock {CritiPrefill}: A segment-level critique framework for efficient long-context {LLM} prefilling.
	\newblock \emph{arXiv preprint}, 2025.

	\end{thebibliography}

	\end{document}