Spaces:

Thump604
/

specprefill-paper

Running

App Files Files Community

Thump604 commited on 21 days ago

Commit

64106e2

verified ·

1 Parent(s): 94f42c6

Upload specprefill.tex with huggingface_hub

Browse files

Files changed (1) hide show

specprefill.tex +877 -0

specprefill.tex ADDED Viewed

	@@ -0,0 +1,877 @@

+\documentclass[11pt,a4paper]{article}
+% === Packages ===
+\usepackage[utf8]{inputenc}
+\usepackage[T1]{fontenc}
+\usepackage{amsmath,amssymb}
+\usepackage{graphicx}
+\usepackage{booktabs}
+\usepackage{hyperref}
+\usepackage{xcolor}
+\usepackage{algorithm}
+\usepackage{algpseudocode}
+\usepackage{listings}
+\usepackage[margin=1in]{geometry}
+\usepackage{caption}
+\usepackage{subcaption}
+\usepackage{natbib}
+% === Macros ===
+\newcommand{\tbd}[1]{\textcolor{red}{\textbf{[TBD: #1]}}}
+\newcommand{\specprefill}{\textsc{SpecPrefill}}
+\newcommand{\ttft}{\text{TTFT}}
+\lstset{
+  basicstyle=\ttfamily\small,
+  keywordstyle=\color{blue},
+  commentstyle=\color{gray},
+  breaklines=true,
+  frame=single,
+  numbers=left,
+  numberstyle=\tiny\color{gray},
+}
+% === Title ===
+\title{\specprefill{} on Unified Memory:\\Cross-Architecture Sparse Prefill for\\Large Language Models on Apple Silicon}
+\author{
+  \texttt{github.com/Thump604}
+}
+\date{March 2026}
+\begin{document}
+\maketitle
+% ============================================================
+\begin{abstract}
+% ============================================================
+Long-context prefill is the dominant latency bottleneck for local LLM inference: a 64K-token prompt on Qwen3.5-122B (MoE, 10B active parameters) takes 7 minutes before the first token appears.
+\specprefill{}---attention-based sparse prefill using a draft model---was designed for GPU clusters with discrete memory.
+We port it to Apple Silicon's unified memory architecture and generalize it across three model families: transformer Mixture-of-Experts (Qwen3.5), Mamba-2/attention hybrid (Nemotron-H), and sliding-window dense (GPT-OSS\footnote{GPT-OSS refers to a publicly available model by its open-source project designation.}).
+On M2~Ultra (128\,GB unified memory), \specprefill{} with a 2B draft model (Qwen3.5-2B, 1.4\,GB, 4-bit) reduces TTFT by $3.71$--$5.45\times$ across 8K--128K tokens on Qwen3.5-122B, cutting 128K prefill from 19.3~minutes to 3.5~minutes.
+Composed with system prompt KV caching, end-to-end speedup reaches $5.6\times$ on a 73K-token production workload.
+We also achieve $2.10$--$2.19\times$ on Nemotron-H~120B across 8K--64K tokens.
+Unified memory eliminates PCIe transfer overhead, making the draft-to-target FLOP ratio the dominant predictor of speedup. We formalize and validate this relationship across six draft/target configurations.
+Under adversarial evaluation (0/16 regressions at 20\% keep), LLM-judge, and perplexity analysis, we observe no quality degradation at our recommended operating point.
+Our implementation handles architecture-specific challenges including gated queries with per-head normalization (Qwen3.5), SSM-interleaved attention layers without positional encoding (Nemotron-H), and sliding-window cache preservation (GPT-OSS), deployed in a production serving stack with per-request API control.
+\end{abstract}
+% ============================================================
+\section{Introduction}
+\label{sec:intro}
+% ============================================================
+\subsection{The TTFT Problem in Local Inference}
+Time-to-first-token (TTFT) is the dominant user-facing latency for large language models serving long-context requests.
+On commodity hardware---a Mac Studio with Apple M2~Ultra and 128\,GB unified memory---prefilling a 64K-token prompt through Qwen3.5-122B-A10B (a 122-billion parameter Mixture-of-Experts model with 10 billion active parameters per token) requires \textbf{418~seconds}, nearly 7~minutes before the first output token appears.
+Even at 16K tokens, the wait is 92~seconds.
+This latency is not a bandwidth problem.
+MLX prefill on Apple Silicon is FLOP-limited: the Metal GPU is compute-bound processing each token through the model's forward pass~\cite{mlx}.
+Reducing the number of tokens processed during prefill therefore yields near-linear TTFT improvement.
+In local inference, the cost of long TTFT is not measured in dollars-per-token but in user time.
+A 16K-token context (typical for an IDE coding assistant with tool definitions and file contents) means 92 seconds of waiting before the first response token.
+A long creative writing session or research conversation that accumulates 64K tokens of history means 7 minutes per response.
+On a serialized single-request engine, every second of prefill also delays all queued requests.
+\noindent With \specprefill{}, 128K prefill on the 122B model drops from 19.5~minutes to 3.5~minutes, making long-context requests practical for interactive use.
+\subsection{Why Unified Memory Changes the Calculus}
+\specprefill{}~\cite{specprefill} addresses TTFT by using a small draft model to identify which prompt tokens are most important via attention scoring, then sparse-prefilling only the selected subset into the target model.
+The original formulation assumes a discrete-memory GPU architecture where the draft model either (a) shares GPU VRAM with the target, reducing KV cache headroom, or (b) runs on CPU or a separate GPU, incurring PCIe transfer latency for importance scores.
+On Apple Silicon's unified memory architecture, neither penalty applies.
+Draft and target models share the same physical address space---the draft's weights (${\sim}$1.4\,GB for a 4-bit 2B model) are simply additional allocations in the same memory pool as the target's \raise.17ex\hbox{$\scriptstyle\sim$}79\,GB.
+Scoring requires zero data movement.
+On discrete GPU systems, draft scoring would either compete for GPU VRAM with the target's KV cache (reducing effective context length) or require CPU$\leftrightarrow$GPU transfers with latency proportional to prompt length---neither penalty exists on unified memory.
+This simplifies the cost equation.
+Let $C_t$ denote the target model's prefill cost (in FLOPs), $C_d$ the draft model's scoring cost (full prefill plus lookahead steps), and $k$ the fraction of tokens retained.
+On unified memory:
+\begin{equation}
+\label{eq:speedup}
+\text{Speedup} = \frac{C_t}{C_d + k \cdot C_t}
+\end{equation}
+When $C_d \ll C_t$---as with a 2B draft scoring for a 122B MoE target, where the FLOP ratio is approximately $50\times$---speedup approaches $1/k$.
+At $k = 0.2$, this predicts up to $4.5\times$; we measure $4.11\times$ at 16K tokens (5 trials), with the gap attributable to overhead from chunk selection, RoPE patching, and memory management.
+We term this the \textbf{ratio thesis}: on unified memory where $T = 0$, the draft-to-target FLOP ratio $r$ is the dominant predictor of \specprefill{} benefit, modulated by architecture-dependent overhead $\epsilon$ (Equation~\ref{eq:speedup_ratio}).
+Section~\ref{sec:ratio} validates this across six draft/target configurations spanning an $8.5\times$ range of FLOP ratios.
+\subsection{Contributions}
+\begin{enumerate}
+  \item \textbf{First implementation on unified memory hardware.}
+    We implement \specprefill{} on Apple Silicon via the MLX framework, demonstrating that zero-copy scoring shifts the viability threshold, making the technique effective even at moderate prompt lengths (8K tokens).
+  \item \textbf{Cross-architecture generalization.}
+    We extend \specprefill{} beyond standard transformers to Mamba-2/attention hybrids (Nemotron-H, where only 8 of 88 target layers have attention) and sliding-window models (GPT-OSS, with RotatingKVCache).
+    Auto-detecting query extraction handles gated attention with per-head normalization (Qwen3.5), content-based attention without positional encoding (Nemotron-H), and YarnRoPE with sliding-window cache preservation (GPT-OSS).
+  \item \textbf{Production system integration.}
+    We show per-request API control with graceful fallback, coexistence with Multi-Token Prediction (MTP) speculative decoding in a three-phase ``Speculative Stack,'' and demonstrate that the FLOP ratio thesis extends to draft model selection---smaller drafts with lower $r$ yield higher speedup.
+\end{enumerate}
+This is a \textbf{systems paper with algorithmic adaptations}, not a claim of a new algorithm.
+The core \specprefill{} idea is due to~\citet{specprefill}; our contributions are making it work across architectures on new hardware with real deployment.
+% ============================================================
+\section{Background}
+\label{sec:background}
+% ============================================================
+\subsection{SpecPrefill}
+\citet{specprefill} observe that during autoregressive generation, the model attends heavily to a small fraction of prompt tokens.
+If these important tokens can be identified cheaply, the full prompt need not be prefilled into the target model.
+Their method uses a small draft model to score token importance via attention weights, selects the top-$k$\% of tokens (in non-overlapping chunks for spatial locality), and sparse-prefills only the selected subset.
+Sparse prefill is possible because Rotary Position Embeddings (RoPE)~\cite{rope} encode \emph{relative} position.
+The inner product $Q_m \cdot K_p^T$ depends only on the relative distance $(m - p)$ encoded via rotation angles.
+If selected tokens are stored in the KV cache with their \emph{original} position angles, they interact with future decode queries at correct relative distances regardless of gaps.
+\subsection{MLX and Apple Silicon Unified Memory}
+MLX~\cite{mlx} is Apple's machine learning framework designed for Apple Silicon.
+Apple Silicon uses \emph{unified memory}: CPU and GPU share the same physical DRAM through a common memory controller.
+There is no PCIe bus, no \texttt{cudaMemcpy}, no distinct VRAM allocation.
+Tensors created by any compute unit are immediately accessible to any other.
+MLX exploits this with lazy evaluation and reference-counted memory management.
+Metal compute shaders execute matrix operations on the GPU.
+In practice, a draft model's weights are additional allocations in the same memory pool, accessible at full bandwidth from any compute unit without copying.
+\subsection{Target Architectures}
+We evaluate \specprefill{} on three architecturally distinct model families, establishing that generalization requires non-trivial adaptations:
+\begin{table}[h]
+\centering
+\small
+\begin{tabular}{@{}lllll@{}}
+\toprule
+\textbf{Model} & \textbf{Architecture} & \textbf{Attention} & \textbf{Position Enc.} & \textbf{Cache} \\
+\midrule
+Qwen3.5-122B & MoE (10B active) & Gated + q\_norm & Standard RoPE & Standard KV \\
+Nemotron-H 120B & Mamba-2 + Attn + MoE & Standard (8/88 layers) & None (SSM) & Compacted \\
+GPT-OSS 120B & Dense + sliding window & Standard & YarnRoPE & RotatingKV \\
+\bottomrule
+\end{tabular}
+\caption{Target model architectures. Each requires different query extraction, position handling, and cache management in \specprefill{}.}
+\label{tab:architectures}
+\end{table}
+\textbf{Qwen3.5} uses gated attention where \texttt{q\_proj} outputs $2\times$ the expected width (query concatenated with a gate), requiring a split before per-head RMSNorm (\texttt{q\_norm}) and RoPE application.
+\textbf{Nemotron-H} is a hybrid architecture with 40 Mamba-2 SSM layers, 8 full-attention layers, and 40 MoE feed-forward layers.
+Positional information is encoded entirely in the SSM state---the attention layers have no RoPE.
+Only the attention layers produce Q/K scores usable for importance scoring.
+\textbf{GPT-OSS} uses YarnRoPE~\cite{yarn} with a sliding-window attention pattern where alternating layers use \texttt{RotatingKVCache} retaining only the last 128 tokens.
+% ============================================================
+\section{Method}
+\label{sec:method}
+% ============================================================
+\subsection{Token Importance Scoring}
+\label{sec:scoring}
+Given a prompt of $M$ tokens, importance scoring proceeds in three phases:
+\begin{enumerate}
+  \item \textbf{Draft prefill.} The full prompt is prefilled into a small same-tokenizer draft model (e.g., Qwen3.5-2B at 4-bit quantization, 1.4\,GB) in chunks of 2{,}048 tokens, populating the draft's KV cache. The FLOP ratio thesis (Section~\ref{sec:cost_model}) predicts that minimizing the draft's active parameter count maximizes speedup, favoring the smallest available compatible model.
+  \item \textbf{Lookahead decode with attention capture.} Eight autoregressive decode steps are executed with \texttt{AttentionCapture} wrappers installed on each attention layer. These wrappers intercept post-RoPE query vectors via architecture-specific extractors (Section~\ref{sec:extractors}), appending them to a capture buffer before delegating to the original attention computation.
+  \item \textbf{Importance computation.} For each attention layer $\ell$ with captured queries $\{q^{(t)}\}_{t=1}^{8}$ and prompt keys $K_\ell \in \mathbb{R}^{h \times M \times d}$:
+    \begin{align}
+      S_\ell^{(t)} &= \text{softmax}\!\left(\frac{q^{(t)} K_\ell^T}{\sqrt{d}}\right) \in \mathbb{R}^{h \times M} \\
+      \bar{S}_\ell^{(t)} &= \text{AvgPool1D}(S_\ell^{(t)},\; \text{kernel}=13)
+    \end{align}
+    Scores are aggregated as $\max$ across layers and heads, then $\text{mean}$ across lookahead steps, yielding importance $I \in \mathbb{R}^M$.
+    Average pooling with kernel 13 smooths the signal, preventing isolated-token artifacts.
+    For GQA models, keys are expanded via \texttt{repeat} to match the query head count before scoring.
+\end{enumerate}
+Layers whose cache does not span the full prompt (e.g., sliding-window \texttt{RotatingKVCache} layers caching only 128 tokens) are skipped during importance computation.
+After scoring, the draft KV cache is explicitly freed and \texttt{mx.clear\_cache()} is called, reclaiming memory before target prefill begins.
+\subsubsection{Architecture-Specific Query Extraction}
+\label{sec:extractors}
+The query extractor is auto-detected at runtime based on the attention module's attributes:
+\begin{itemize}
+  \item \texttt{q\_norm} present $\rightarrow$ Qwen3.5 path
+  \item No \texttt{rope} attribute $\rightarrow$ Nemotron-H path
+  \item Otherwise $\rightarrow$ Standard (Llama/GPT-OSS) path
+\end{itemize}
+\paragraph{Qwen3.5: Gated queries with per-head normalization.}
+The \texttt{q\_proj} output has $2\times$ the expected width, containing both query and gate tensors concatenated along the head dimension.
+The output is reshaped to $(B, L, n_\text{heads}, 2 \cdot d_\text{head})$ and split at the midpoint.
+After splitting, \texttt{q\_norm} (a per-head RMSNorm) is applied before RoPE rotation.
+Treating this as a standard projection produces silent shape errors or incorrect scoring.
+\paragraph{Nemotron-H: Heterogeneous layer navigation.}
+Of 88 total layers (40 Mamba-2 + 8 attention + 40 MoE), only the 8 attention layers produce Q/K scores.
+\texttt{\_find\_attention\_layers} navigates the heterogeneous layer structure by inspecting \texttt{block\_type} annotations (\texttt{M} for Mamba, \texttt{*} for attention, \texttt{-} for MLP, \texttt{E} for MoE) and locating modules with a \texttt{mixer} attribute rather than the standard \texttt{self\_attn}.
+\texttt{\_build\_layer\_to\_cache\_map} constructs a compacted index because only Mamba and attention layers have cache entries.
+These attention layers have \textbf{no RoPE}---positional information comes entirely from the Mamba-2 SSM state.
+Queries are used as-is for content-based scoring.
+This is a non-trivial engineering challenge the original paper did not address: the draft model (Nano~4B) has 42 heterogeneous layers with only 4 attention layers among 21 Mamba and 17 MLP layers, all in a model where positional information is entirely implicit.
+\paragraph{GPT-OSS: RotatingKVCache awareness.}
+Standard query extraction applies, but importance computation must skip sliding-window layers whose \texttt{RotatingKVCache} contains only the last 128 tokens.
+Without this check, importance scores would be computed against a truncated key set, producing misleading rankings.
+Correct handling requires three cache-aware adaptations:
+(1)~layer-level cache introspection to distinguish full-context from sliding-window layers;
+(2)~skipping importance computation for layers whose cache does not span the full prompt;
+(3)~force-preserving the last \texttt{max\_size} positions during sparse selection to ensure sliding-window layers have valid recent context at decode time.
+\subsection{Chunk Selection}
+Tokens are grouped into non-overlapping chunks of $C = 32$ tokens.
+Each chunk is scored by the mean importance of its constituent tokens.
+The top $\lceil k \cdot M/C \rceil$ chunks by score are selected and their token indices returned in sorted order.
+This preserves spatial locality---coherent phrases are kept or dropped as units.
+At $k = 0.2$ (our optimal configuration for Qwen3.5-122B), 80\% of prefill computation is eliminated.
+\subsection{Sparse Prefill with Position-Mapped RoPE}
+\label{sec:sparse_prefill}
+The correctness of sparse prefill depends on maintaining correct RoPE angles despite non-contiguous token positions.
+If position angles are not re-mapped, the model perceives selected tokens as adjacent, destroying long-range coherence.
+\paragraph{Step 1: Sliding-window tail preservation.}
+For architectures using \texttt{RotatingKVCache} (GPT-OSS), the last \texttt{max\_size} positions from the prompt are force-merged into the selection set, ensuring sliding-window attention layers have valid recent context at decode time.
+This is auto-detected via cache type inspection.
+\paragraph{Step 2: Position mapping during prefill.}
+Each attention layer's \texttt{nn.RoPE} is replaced with \texttt{PositionMappedRoPE}, which maps contiguous batch positions $[0, 1, \ldots, N{-}1]$ to the original absolute positions $[p_0, p_1, \ldots, p_{N-1}]$ of the selected tokens.
+For models with custom RoPE variants (YarnRoPE with pre-computed frequencies, SuScaled RoPE with \texttt{mscale}), the replacement module captures and replays the original frequency tensors and scale factors.
+\paragraph{Step 3: Chunked forward pass.}
+Selected tokens are fed through the target model in chunks of \texttt{step\_size} (default 2{,}048), populating the KV cache with entries at correct absolute positions.
+\paragraph{Step 4: Decode offset adjustment.}
+After sparse prefill of $N$ selected tokens from $M$ total prompt tokens, the cache offset is $N$ but decode must start at position $M$.
+\texttt{OffsetAdjustedRoPE} wraps the original RoPE module and adds adjustment $\Delta = M - N$ to all offset calls:
+\begin{equation}
+  \text{RoPE\_position}(i) = N + i + (M - N) = M + i \quad \checkmark
+\end{equation}
+\paragraph{Step 5: Cleanup.}
+After generation completes, \texttt{cleanup\_rope()} traverses all attention layers and unwraps patched RoPE modules back to their originals, ensuring the model is unmodified for subsequent requests.
+\paragraph{Nemotron-H (no RoPE).}
+Steps 2 and 4 are skipped entirely---Nemotron-H's attention layers have no RoPE, deriving positional information from the Mamba-2 SSM state instead.
+The SSM layers are updated \emph{only} on retained tokens; skipped tokens do not contribute to state evolution.
+Concretely: the Mamba-2 recurrence $h_t = A h_{t-1} + B x_t$ advances only at selected positions, so the hidden state after processing $N$ selected tokens diverges from the state after processing all $M$ tokens.
+This alters the underlying state trajectory.
+However, in practice, the retained tokens---selected by attention importance---appear sufficient to preserve global semantics for long-context tasks: our server-side benchmarks show $2.10$--$2.19\times$ speedup across 8K--64K tokens with no observed quality degradation (Section~\ref{sec:quality}).
+Quantifying the SSM state drift (e.g., L2 distance between full and sparse hidden states) is left to future work.
+Despite this mismatch, empirical evaluation shows no quality degradation under our test suite (Section~\ref{sec:quality}).
+We attribute this to the hybrid architecture: the 8 full-attention layers retain the most important $N$ tokens with correct content-based scores, providing long-range context that compensates for gaps in the SSM's recurrent state.
+This is an empirical result, not a theoretical guarantee; extending \specprefill{} to pure SSM architectures would require additional analysis.
+\subsection{Unified Memory Cost Model}
+\label{sec:cost_model}
+We formalize the relationship between hardware architecture and \specprefill{} efficiency.
+On discrete-GPU systems, the cost of \specprefill{} includes a data transfer term $T$ (PCIe bandwidth, memory copies between draft and target devices):
+\begin{equation}
+\label{eq:speedup_gpu}
+  \text{Speedup}_{\text{GPU}} = \frac{C_t}{C_d + T + k \cdot C_t}
+\end{equation}
+On unified memory, $T = 0$, simplifying to Equation~\ref{eq:speedup}.
+The speedup is determined by the empirical wall-clock cost ratio $r = C_d / C_t$ and keep percentage $k$:
+\begin{equation}
+\label{eq:speedup_ratio}
+  \text{Speedup} = \frac{1}{r + k + \epsilon}
+\end{equation}
+where $\epsilon$ captures fixed overhead from chunk selection, RoPE patching, memory management (\texttt{mx.clear\_cache()}), and architecture-specific scoring costs.
+In practice, $\epsilon$ ranges from 0.03 (Qwen3.5, low overhead) to 0.30 (Nemotron-H, where navigating 88 heterogeneous layers adds cost).
+We emphasize that this model is \emph{descriptive}: it correctly predicts the \emph{ranking} of configurations by speedup (Table~\ref{tab:cost_model}) but does not predict exact magnitudes, as $\epsilon$ varies by architecture.
+The value of the model is in draft selection---given two candidate drafts, the one with lower $r$ will yield higher speedup---not in predicting absolute speedup from first principles.
+\textbf{Boundary conditions.}
+Equation~\ref{eq:speedup_ratio} assumes: (1)~sequential, single-request prefill (no batching); (2)~a FLOP-dominated regime where compute, not memory bandwidth, is the bottleneck; and (3)~negligible KV cache fragmentation cost.
+In bandwidth-bound regimes or under heavy concurrent batching, $\epsilon$ may dominate and reduce the model's predictive accuracy.
+However, $r$ is not simply the ratio of active parameters.
+On unified memory, all expert weights reside in the same address space---there is no discrete VRAM to overflow into.
+MoE forward passes incur costs beyond active-parameter FLOPs: router gating across all experts, weight loading for selected experts from unified memory, and memory bandwidth pressure from the full parameter footprint.
+As a result, the empirical wall-clock cost of a MoE forward pass on unified memory scales closer to \emph{total} parameters than to active parameters alone.
+A small dense 2B draft scoring against a 122B MoE target (10B active but 122B total) therefore achieves $r \approx 0.02$, reflecting the full-parameter cost disparity rather than the ${\sim}5\times$ active-parameter ratio.
+This is a finding specific to unified memory systems: on discrete GPUs where expert weights page between CPU and GPU memory, the effective cost ratio may differ.
+For MoE models where total parameters far exceed active parameters, a small dense draft has $r \ll 1$.
+This model assumes $C_t$ scales linearly with token count; at extreme context lengths (${\geq}$128K), the $O(n^2)$ attention component causes superlinear growth in $C_t$, and measured speedup can exceed the linear prediction (Table~\ref{tab:cost_model}).
+Table~\ref{tab:cost_model} compares predicted ($\epsilon = 0$) and measured speedups, with the gap $G = \text{Predicted} / \text{Measured}$ quantifying per-configuration overhead:
+\begin{table}[h]
+\centering
+\small
+\begin{tabular}{@{}lcccc@{}}
+\toprule
+\textbf{Configuration} & $\boldsymbol{r = C_d/C_t}$ & \textbf{Predicted} ($k{=}0.2, \epsilon{=}0$) & \textbf{Measured} & $\boldsymbol{G}$ \\
+\midrule
+4B / 122B MoE (10B active) & $\sim$0.03 & 4.3$\times$ & 2.90$\times$ & 1.5$\times$ \\
+2B / 122B MoE (10B active) & $\sim$0.02 & 4.5$\times$ & 4.11$\times$ & 1.1$\times$ \\
+2B / 122B MoE (10B active)$^\ddagger$ & $\sim$0.02 & 4.5$\times$ & 5.45$\times$ & 0.8$\times$ \\
+Qwen-4B / 35B MoE (3B active) & $\sim$0.10 & 3.3$\times$ & 1.86$\times$ & 1.8$\times$ \\
+4B / 120B Nemotron-H (12B active) & $\sim$0.03 & 4.3$\times$ & 2.17$\times$ & 2.0$\times$ \\
+20B / 120B GPT-OSS (120B active) & $\sim$0.17 & 2.7$\times$ & 1.28$\times$ & 2.1$\times$ \\
+\bottomrule
+\multicolumn{5}{l}{\footnotesize $^\ddagger$Measured at 128K tokens. Measured speedup exceeds prediction due to $O(n^2)$ attention.}
+\end{tabular}
+\caption{Cost model predictions ($\epsilon = 0$) vs.\ measured speedups at $k = 0.2$, 16K tokens unless noted. $G = \text{Predicted} / \text{Measured}$; values $> 1$ indicate overhead exceeding the model, $< 1$ indicates superlinear baseline growth benefiting \specprefill{} beyond the linear prediction.}
+\label{tab:cost_model}
+\end{table}
+The $G$ values reveal architecture-dependent overhead.
+Nemotron-H ($G = 2.0$) has the highest $\epsilon$: the target has 88 heterogeneous layers (40~Mamba + 8~attention + 40~MoE) and the draft has 42 (21~Mamba + 4~attention + 17~MLP), requiring architecture-aware layer navigation for both scoring and sparse prefill.
+GPT-OSS ($G = 2.1$) combines high $r$ (0.17, the 20B draft dominates the denominator) with sliding-window cache management overhead.
+The Qwen3.5-122B configurations ($G = 1.1$--$1.5$, 5 trials) have the lowest overhead, benefiting from uniform architecture and favorable MoE FLOP ratios; the 35B target ($G = 1.8$) sits higher due to its smaller active-parameter count (3B) narrowing the draft-to-target gap.
+At 128K, $G < 1$ because the superlinear baseline growth (Section~\ref{sec:experiments}) is not captured by the linear cost model.
+\paragraph{Comparison with GPU results.}
+The original \specprefill{} on discrete GPUs~\cite{specprefill} reports up to $7.66\times$ TTFT reduction on Llama-3.1-405B-FP8, benefiting from batch-level parallelism and GPU-optimized attention kernels not available on Apple Silicon.
+Our lower absolute speedups ($3.71$--$5.45\times$ on 122B) reflect the single-request, unbatched setting and MLX's Metal compute pipeline.
+The unified memory contribution is not higher \emph{absolute} speedup but a \emph{simpler cost model}: eliminating $T$ makes the FLOP ratio the sole dominant predictor, enabling principled draft model selection without profiling transfer overhead.
+% ============================================================
+\section{System Integration}
+\label{sec:system}
+% ============================================================
+\subsection{Composition with System Prompt KV Caching}
+\label{sec:composition}
+In production agentic workflows, the system prompt (tool definitions, instructions, context documents) often constitutes 10--20K tokens that remain identical across requests.
+System prompt KV caching snapshots this prefix on the first request and restores it for subsequent requests, eliminating redundant prefill.
+These optimizations operate on orthogonal axes: KV caching eliminates \emph{prefix} cost (identical system prompt); \specprefill{} reduces \emph{suffix} cost (variable user context).
+This is why they compose multiplicatively rather than additively.
+When both techniques are active, \specprefill{} operates on the \emph{suffix} only (user and assistant turns), receiving a \texttt{position\_offset} equal to the system token count.
+The scoring phase evaluates only suffix tokens; sparse prefill maps positions relative to the offset so selected tokens land at correct absolute positions in the full context.
+The threshold check uses suffix length, not full prompt length, ensuring \specprefill{} activates only when the suffix itself is long enough to benefit.
+\begin{table}[h]
+\centering
+\small
+\begin{tabular}{@{}lcc@{}}
+\toprule
+\textbf{Configuration} & \textbf{TTFT (s)} & \textbf{Speedup} \\
+\midrule
+Baseline (cold, full prefill) & 517.5 & 1.0$\times$ \\
+System KV cache only & 417.1 & 1.24$\times$ \\
+\textbf{Combined (SysKV + SP 20\%)} & \textbf{92.5} & \textbf{5.59$\times$} \\
+Combined (repeat) & 92.4 & 5.60$\times$ \\
+\bottomrule
+\end{tabular}
+\caption{Composition of system prompt KV caching and \specprefill{} on Qwen3.5-122B, 2B draft, M2~Ultra 128\,GB. The prompt (73K tokens) is a realistic agentic workload: ${\sim}$10K system prefix (tool definitions, instructions) + ${\sim}$63K user context. System KV cache saves the prefix; \specprefill{} sparse-prefills the suffix. The combined $5.6\times$ speedup exceeds either technique alone.}
+\label{tab:composition}
+\end{table}
+\subsection{The Speculative Stack}
+\specprefill{} is not a standalone optimization but one phase of a three-phase speculative pipeline:
+\begin{enumerate}
+  \item \textbf{Score} (\specprefill{}): Draft model (2B) identifies important tokens via attention scoring.
+  \item \textbf{Sparse Prefill}: Target model (122B) processes selected token chunks with position-mapped RoPE.
+  \item \textbf{MTP Decode}: Target model with Multi-Token Prediction heads generates output tokens speculatively (Qwen3.5 only; Nemotron-H and GPT-OSS skip this phase).
+\end{enumerate}
+The draft model used in Phase~1 is architecturally separate from MTP's prediction heads (which are part of the target model's weights).
+The draft KV cache is freed after Phase~1 (\texttt{mx.clear\_cache()}), and the draft model's static weights (1.4--3.0\,GB depending on draft size) remain resident for amortized startup cost across requests.
+At extreme context lengths (128K+), the draft's transient KV cache size becomes a practical constraint: the 2B draft produces 1.5\,GB at 128K tokens, fitting comfortably alongside the target model's ${\sim}$100\,GB on a 128\,GB system.
+\subsection{Per-Request API and Graceful Fallback}
+The OpenAI-compatible API accepts per-request overrides:
+\begin{lstlisting}[language=Python]
+extra_body = {
+    "specprefill": True,          # force enable (bypass threshold)
+    "specprefill_keep_pct": 0.15  # override server default
+}
+\end{lstlisting}
+The default threshold (8{,}192 tokens) is enforced only in server-default mode; explicit \texttt{specprefill: true} bypasses it.
+Any error during scoring or sparse prefill triggers graceful fallback to full prefill---no request fails due to \specprefill{}.
+% ============================================================
+\section{Experiments}
+\label{sec:experiments}
+% ============================================================
+\subsection{Setup}
+\paragraph{Hardware.} Apple M2~Ultra, 128\,GB unified memory, Mac Studio, macOS~26.3.1.
+\paragraph{Software.} MLX~0.31.1, vllm-mlx~0.2.6 with patches, Python~3.12.
+\paragraph{Sampling.} Qwen3.5 models: $\text{temp}=0.6$, $\text{top\_p}=0.95$, $\text{top\_k}=20$ (official thinking+coding profile~\cite{qwen35}).
+Nemotron-H: $\text{temp}=1.0$, $\text{top\_p}=0.95$ (NVIDIA model card: ``trained and evaluated with'').
+GPT-OSS: $\text{temp}=0.6$, $\text{top\_p}=0.95$.
+All models run with thinking mode enabled (\texttt{enable\_thinking=true}).
+Sampling parameters do not affect TTFT measurement (prefill is deterministic; sampling occurs only during generation).
+\paragraph{Methodology.} Five trials per configuration for Qwen3.5-122B (plus one warmup); two trials for remaining configurations. Server-side TTFT measured via streaming OpenAI-compatible API.
+Server restarted between configuration changes.
+\paragraph{Reproducibility.} Our \specprefill{} implementation is available as patches against vllm-mlx~0.2.6 (PRs~\#175, \#180) and mlx-lm~0.31.2 (PR~\#990), with benchmark scripts (\texttt{bench-specprefill}, \texttt{bench-specprefill-adversarial}, \texttt{bench-specprefill-perplexity}) included in the repository.
+All models are publicly available Qwen3.5 quantizations.
+Nemotron-H and GPT-OSS weights are available from their respective model hubs.
+Experiments require an Apple Silicon system with $\geq$128\,GB unified memory for the 122B configuration; the 35B configuration runs on 64\,GB systems.
+\begin{table}[h]
+\centering
+\small
+\begin{tabular}{@{}llllllr@{}}
+\toprule
+\textbf{Model} & \textbf{Role} & \textbf{Architecture} & \textbf{Params} & \textbf{Active} & \textbf{Quant} & \textbf{RAM} \\
+\midrule
+Qwen3.5-122B-VLM-MTP & Target & MoE & 122B & 10B & 5-bit & 79\,GB \\
+Qwen3.5-35B-VLM-MTP & Target & MoE & 35B & 3B & 8-bit & 38\,GB \\
+Nemotron-H 120B & Target & Mamba-2 + Attn + MoE & 120B & 12B & 5-bit & 83\,GB \\
+GPT-OSS 120B & Target & Dense + sliding window & 120B & 120B & 5-bit & 58\,GB \\
+\midrule
+Qwen3.5-4B-VLM-MTP & Draft & Dense hybrid & 4B & 4B & 4-bit & 3.0\,GB \\
+Qwen3.5-2B-OptiQ & Draft & Hybrid + MoE & 2B & $<$2B & 4-bit & 1.4\,GB \\
+Nemotron-H Nano 4B & Draft & Mamba-2 + Attn hybrid & 4B & 4B & 4-bit & 2.1\,GB \\
+GPT-OSS-20B & Draft & MoE & 20B & 3.6B & 4-bit & 10\,GB \\
+\bottomrule
+\end{tabular}
+\caption{Models evaluated. \specprefill{} requires draft and target to share the same tokenizer. Qwen3.5 drafts (2B, 4B) serve all Qwen3.5 targets (248K vocabulary); the 2B is the primary draft model. Nemotron-H Nano~4B serves Nemotron-H~120B (131K vocabulary). GPT-OSS-20B is the smallest available same-family draft for GPT-OSS-120B (201K vocabulary).}
+\label{tab:models}
+\end{table}
+\subsection{TTFT Benchmarks}
+\begin{table}[h]
+\centering
+\small
+\begin{tabular}{@{}llccccc@{}}
+\toprule
+\textbf{Model} & \textbf{Draft} & \textbf{8K} & \textbf{16K} & \textbf{32K} & \textbf{64K} & \textbf{128K} \\
+\midrule
+Qwen3.5-122B (MoE, 10B) & 4B$^\ddagger$ & 2.79$\times$ & 2.90$\times$ & ---$^\dagger$ & ---$^\dagger$ & ---$^\dagger$ \\
+Qwen3.5-122B (MoE, 10B) & 2B & 3.71$\times$ & 4.11$\times$ & 4.23$\times$ & 4.50$\times$ & 5.45$\times$ \\
+Qwen3.5-35B (MoE, 3B) & 4B & 1.81$\times$ & 1.86$\times$ & 1.85$\times$ & 1.84$\times$ & --- \\
+Nemotron-H 120B (hybrid) & Nano-4B & 2.10$\times$ & 2.17$\times$ & 2.19$\times$ & 2.19$\times$ & --- \\
+GPT-OSS 120B (dense) & 20B & 1.24$\times$ & 1.28$\times$ & --- & --- & --- \\
+\bottomrule
+\multicolumn{7}{l}{\footnotesize $^\dagger$Not measured at this context length.} \\
+\multicolumn{7}{l}{\footnotesize $^\ddagger$4B draft includes VLM weights (3.0\,GB); the 2B text-only draft (1.4\,GB) is the primary configuration.}
+\end{tabular}
+\caption{TTFT speedups at 20\% keep, 5 trials (mean). Speedup increases with prompt length as scoring overhead is amortized and $O(n^2)$ attention savings compound. Qwen3.5-122B and 4B rows use 5 trials; other rows use 2 trials.}
+\label{tab:ttft}
+\end{table}
+For Qwen3.5-122B with the 2B draft, the absolute TTFT at 64K tokens drops from $417.6 \pm 0.6$\,s to $92.8 \pm 0.8$\,s.
+At 128K tokens: $1{,}155.8 \pm 8.5$\,s (19.3~minutes) $\rightarrow$ $212.3 \pm 1.9$\,s (3.5~minutes), a \textbf{5.45$\times$} reduction.
+At 8K tokens: $45.0 \pm 0.1$\,s $\rightarrow$ $12.1 \pm 0.03$\,s.
+Standard deviations across 5 trials are $<$1\% of the mean at all context lengths, confirming measurement stability.
+Speedup increases monotonically with context length, from $3.71\times$ at 8K to $5.45\times$ at 128K, consistent with the amortization of fixed scoring overhead and the superlinear growth of baseline prefill cost.
+\paragraph{Nemotron-H: architecture-limited speedup plateau.}
+Nemotron-H shows a flat speedup profile ($2.10\times$ at 8K to $2.19\times$ at 64K), in contrast to Qwen3.5's monotonically increasing curve.
+This plateau is explained by the hybrid architecture: only 8 of 88 layers are attention---the remaining 80 layers (40~Mamba-2 SSM + 40~MoE feed-forward) scale linearly with token count regardless of \specprefill{}.
+The $O(n^2)$ attention component that drives Qwen3.5's compounding speedup at long contexts constitutes only ${\sim}9\%$ of Nemotron-H's total compute, so the quadratic savings are a small fraction of overall prefill cost.
+This confirms the architecture-dependent nature of the cost model: \specprefill{} benefit scales with the attention fraction of total computation.
+In absolute terms, Nemotron-H 120B TTFT drops from 58\,s to 27\,s at 16K and from 253\,s to 116\,s at 64K.
+For Qwen3.5-35B: 41\,s to 22\,s at 16K.
+\paragraph{Superlinear scaling at extreme context lengths.}
+The 128K baseline (1{,}156\,s) is $2.77\times$ the 64K baseline (418\,s), not the $2\times$ expected from linear scaling.
+This superlinear growth arises from the $O(n^2)$ attention component in chunked prefill: each 2{,}048-token chunk attends to all preceding tokens, so cumulative attention FLOPs grow quadratically.
+\specprefill{} benefits disproportionately: by reducing the effective sequence length from $N$ to $kN$, it reduces the cumulative attention FLOPs---which scale approximately as $N^2$ under full-prefix chunked prefill---by approximately $k^2$.
+At $k = 0.2$, this yields a ${\sim}25\times$ reduction in attention computation, not the $5\times$ that linear token-count scaling would suggest.
+Draft scoring remains $O(N)$, so its cost grows linearly while the attention savings grow quadratically, explaining why the measured 128K speedup ($5.45\times$) exceeds the linear prediction ($4.5\times$, Table~\ref{tab:cost_model}).
+MoE models exhibit stronger gains because sparse prefill reduces both the quadratic attention cost (sequence length) and the number of active expert evaluations---80\% fewer tokens means 80\% fewer expert routing and weight-loading operations.
+\paragraph{Draft model selection.}
+Draft models are selected to maximize FLOP asymmetry while maintaining tokenizer and architectural compatibility.
+The selection criteria are: (1)~same tokenizer as the target (required---token IDs are passed directly without translation); (2)~smallest available model in the family (to maximize the FLOP ratio $r$); (3)~presence of attention layers for importance scoring (at least 4 layers suffice per our Nemotron-H validation).
+Our primary configuration uses a 2B draft ($r \approx 0.02$, 1.4\,GB); complementary measurements with a 4B draft ($r \approx 0.03$, 3.0\,GB) at 8K--16K confirm the relationship: the lower-ratio 2B achieves $4.11\times$ vs.\ $2.90\times$ at 16K.
+The GPT-OSS result is a \textbf{negative result confirming the ratio thesis}: the 20B draft model (the smallest available in the GPT-OSS family) has an unfavorable FLOP ratio of $\sim$0.17, yielding only $1.24$--$1.28\times$ speedup.
+This validates that architecture is not the determining factor---the FLOP ratio is.
+A hypothetical 4B GPT-OSS draft ($r \approx 0.03$) would be predicted to achieve $\sim$2.5--3$\times$ under our cost model, but no such model exists in the GPT-OSS family (Section~\ref{sec:future}).
+\subsection{Draft-to-Target FLOP Ratio Analysis}
+\label{sec:ratio}
+\begin{figure}[h]
+\centering
+\includegraphics[width=0.9\textwidth]{figures/ratio-speedup.pdf}
+\caption{Measured speedup vs.\ draft-to-target FLOP ratio. The theoretical upper bound (Eq.~\ref{eq:speedup_ratio}) correctly predicts the ranking. Overhead from RoPE patching, chunk selection, and architecture-specific scoring reduces measured values below the theoretical curve.}
+\label{fig:ratio}
+\end{figure}
+The FLOP ratio $r = C_d / C_t$ is the dominant predictor of \specprefill{} benefit on unified memory.
+Across our configurations (Table~\ref{tab:cost_model}), $r$ spans from 0.02 (2B/122B MoE) to 0.17 (20B/120B dense), and measured speedup tracks this ratio monotonically.
+Complementary 4B draft measurements at 8K--16K ($r \approx 0.03$, $2.79$--$2.90\times$) confirm the ratio relationship: the lower-ratio 2B achieves higher speedup at the same context length.
+This relationship distinguishes unified-memory \specprefill{} from GPU-based implementations, where PCIe bandwidth introduces an additional term $T$ (Equation~\ref{eq:speedup_gpu}) that weakens the FLOP-ratio signal.
+\subsection{Keep Percentage Ablation}
+\begin{table}[h]
+\centering
+\small
+\begin{tabular}{@{}ccccc@{}}
+\toprule
+\textbf{Keep \%} & \textbf{TTFT (s)} & \textbf{Speedup} & \textbf{Needle} & \textbf{JSON} \\
+\midrule
+10\% & 23.90 & 3.85$\times$ & PASS & 1/2$^\S$ \\
+20\% & 31.30 & 2.94$\times$ & PASS & PASS \\
+30\% & 38.88 & 2.37$\times$ & PASS & PASS \\
+50\% & 53.78 & 1.71$\times$ & PASS & PASS \\
+100\% (baseline) & 91.97 & 1.0$\times$ & PASS & PASS \\
+\bottomrule
+\multicolumn{5}{l}{\footnotesize $^\S$JSON extraction at 10\% keep: one regression (1 of 3 values wrong) in trial 1, pass in trial 2.} \\
+\end{tabular}
+\caption{Keep percentage ablation on Qwen3.5-122B at ${\sim}$16K tokens with the 4B draft. ``Needle'' is UUID retrieval at all three depths (10\%, 50\%, 90\%); ``JSON'' is exact value extraction from an 80-record array. The 10\% row shows the quality boundary.}
+\label{tab:ablation}
+\end{table}
+The curve shows a clear knee at ${\sim}$20\%: all quality tests pass while delivering $2.94\times$ speedup (4B draft), and marginal speedup gains diminish beyond this point while compute cost increases linearly with $k$.
+This is our recommended operating point.
+At 10\%, speedup increases to $3.85\times$ but structured data extraction becomes unreliable (1 JSON regression in 2 trials), establishing 10\% as the quality boundary.
+At 50\%, the $1.71\times$ speedup is marginal relative to the scoring overhead incurred.
+\subsection{Quality Validation}
+\label{sec:quality}
+Because \specprefill{} retains the highest-scoring tokens by attention importance, the model keeps the tokens it would have attended to most heavily during generation.
+We validate this through three complementary evaluations, leading with the most concrete and falsifiable tests.
+\paragraph{Adversarial tests (primary).}
+Eight test types designed to expose sparse-prefill weaknesses: needle-in-a-haystack (UUID retrieval at 10\%, 50\%, 90\% depth), JSON value extraction from an 80-record array, code bug detection, back-reference, mixed-language retrieval, and XML structure extraction.
+At 20\% keep: \textbf{0/16 regressions} across 2 trials $\times$ 8 tests.
+All needle-in-a-haystack and JSON extraction tests pass under both baseline and \specprefill{}.
+\specprefill{} does not drop needles or corrupt structured data retrieval at 20\% keep.
+At 10\% keep: 1/16 regressions---a JSON extraction test returned one incorrect value out of three (trial 1 of 2; trial 2 passed).
+All needle tests pass at all depths even at 10\%, suggesting that high-importance tokens (which needles represent) are robustly retained; the failure mode at extreme sparsity is degraded recall of \emph{low-salience} structured data.
+\paragraph{ROUGE-L (lexical similarity).}
+We compare outputs from \specprefill{} (20\% keep) against full-prefill baselines on six real-task prompts (code generation, code review, summarization, reasoning, tutorial writing, tool use), each targeting $\sim$8K actual tokens on Qwen3.5-122B.
+To establish a variance floor, we first compare two independent baseline runs against each other:
+\begin{center}
+\small
+\begin{tabular}{lc}
+\toprule
+\textbf{Comparison} & \textbf{ROUGE-L F1} \\
+\midrule
+Baseline vs.\ baseline (variance floor) & 0.190 $\pm$ 0.174 \\
+\specprefill{} vs.\ baseline & 0.236 \\
+\bottomrule
+\end{tabular}
+\end{center}
+The high baseline-vs-baseline variance ($0.190 \pm 0.174$) demonstrates that lexical similarity between outputs is dominated by the model's inherent stochasticity at $\text{temp}=0.6$, not by any effect of \specprefill{}.
+The \specprefill{} similarity (0.236) falls within the baseline noise floor, but this comparison is not informative enough to draw quality conclusions from---the adversarial tests above provide the primary quality evidence.
+\paragraph{LLM-as-Judge (supporting).}
+Blinded A/B evaluation scores coherence, completeness, and accuracy (1--5 scale) for both baseline and \specprefill{} outputs, plus an overall equivalence rating:
+\begin{center}
+\small
+\begin{tabular}{lcc}
+\toprule
+\textbf{Comparison} & \textbf{Avg.\ Equivalence} \\
+\midrule
+Baseline vs.\ baseline & 3.0 / 5.0 \\
+\specprefill{} vs.\ baseline & 3.0 / 5.0 \\
+\bottomrule
+\end{tabular}
+\end{center}
+\paragraph{Perplexity (distributional).}
+We measure next-token perplexity on 256-token continuations after full vs.\ sparse prefill (20\% keep) on five documents at 8K context (code, documentation, LaTeX, mixed):
+\begin{center}
+\small
+\begin{tabular}{lccc}
+\toprule
+\textbf{Document type} & \textbf{PPL (full)} & \textbf{PPL (sparse)} & \textbf{Ratio} \\
+\midrule
+Python (engine code)      & 1.85 & 2.53 & 1.37 \\
+Python (benchmark script) & 1.66 & 1.74 & 1.05 \\
+Python (test harness)     & 1.49 & 1.58 & 1.06 \\
+LaTeX (this paper)        & 2.00 & 2.14 & 1.07 \\
+Mixed (concatenated)      & 2.76 & 3.17 & 1.15 \\
+\midrule
+\textbf{Mean}             & 1.95 & 2.23 & \textbf{1.14} \\
+\bottomrule
+\end{tabular}
+\end{center}
+Mean perplexity increases 14\%, though 4 of 5 documents show $\leq$7\% increase (median ratio 1.07).
+The outlier (dense engine code, 1.37$\times$) contains many local variable dependencies where discarded tokens carry predictive information.
+This distributional shift does not translate to generation quality degradation in our adversarial or LLM-judge evaluations above---sampling smooths over small distributional differences that perplexity measures precisely.
+16K context was not tested: loading both the 122B target and 2B draft for offline evaluation leaves insufficient headroom on 128\,GB unified memory.
+\paragraph{Limitations.}
+Six prompts, eight adversarial types, and five perplexity documents confirm no catastrophic quality loss and validate the methodology, but the sample size is insufficient for tight confidence intervals.
+We make no claim of statistical equivalence---only that \textbf{no measurable quality degradation was observed under our evaluation at 20\% keep}.
+The 10\% keep JSON regression demonstrates that the quality boundary is observable and characterizable within our framework.
+All quality evaluations were conducted on Qwen3.5-122B.
+Nemotron-H and GPT-OSS were validated via pipeline tests (Section~\ref{sec:method}) but lack server-side adversarial evaluation.
+Future work will include larger-scale evaluation on standardized long-context benchmarks (e.g., RULER, LongBench) and extend quality validation to non-Qwen architectures.
+\subsection{Memory Profile}
+\label{sec:memory}
+\begin{table}[h]
+\centering
+\small
+\begin{tabular}{@{}lrrl@{}}
+\toprule
+\textbf{Component} & \textbf{Memory} & \textbf{Cumulative} & \textbf{Notes} \\
+\midrule
+Target weights (122B, 5-bit) & 79\,GB & 79\,GB & Fixed at load \\
+Draft weights (2B, 4-bit) & 1.4\,GB & 80\,GB & Fixed at load \\
+MLX Metal cache limit & 4\,GB & 84\,GB & Computation scratch \\
+Target KV cache (128K) & 12\,GB & 96\,GB & 96\,KB/token $\times$ 127K \\
+Draft KV cache (128K, transient) & 1.5\,GB & 97\,GB & 12\,KB/token, freed after scoring \\
+OS + framework overhead & $\sim$25\,GB & $\sim$122\,GB & Observed via \texttt{memory\_pressure} \\
+\midrule
+\textbf{Peak (128K baseline)} & & \textbf{$\sim$122\,GB} & Of 128\,GB unified \\
+\textbf{Peak (128K \specprefill{})} & & \textbf{$\sim$122\,GB} & Draft KV transient, not additive \\
+\bottomrule
+\end{tabular}
+\caption{Memory budget for Qwen3.5-122B with 2B draft at 128K tokens on M2~Ultra 128\,GB. The draft KV cache is transient: allocated during scoring, freed via \texttt{mx.clear\_cache()} before target prefill begins. Peak \specprefill{} memory $\approx$ baseline peak because the draft and target KV caches are never resident simultaneously.}
+\label{tab:memory}
+\end{table}
+% ============================================================
+\section{Discussion}
+\label{sec:discussion}
+% ============================================================
+\subsection{The MoE Sweet Spot}
+\specprefill{} benefits MoE architectures more than dense models.
+In MoE models, each token is routed to a subset of experts during the forward pass, but the routing computation and expert weight loading occur for \emph{every} token regardless.
+By reducing the token count during prefill, \specprefill{} reduces the total number of expert activations---not just attention FLOPs, but the dominant feed-forward computation.
+The savings exceed the keep fraction: at $k = 0.2$, the model processes 80\% fewer tokens through its expert layers, each involving sparse routing across 128 experts (Qwen3.5-122B).
+Dense models, by contrast, apply the same computation to every token regardless of routing.
+The savings from \specprefill{} on dense models are proportional only to the reduced attention and MLP computation, which is less dramatic when the model is fully compute-bound.
+\subsection{When SpecPrefill Does Not Help}
+\paragraph{Dense models with large drafts.}
+When the FLOP ratio $r$ exceeds $\sim$0.15, scoring overhead consumes most of the potential savings.
+Our GPT-OSS result (20B draft, $r \approx 0.17$, speedup $1.28\times$) demonstrates this boundary.
+No smaller GPT-OSS model exists; the proprietary tokenizer (201K vocabulary) prevents cross-family draft substitution without a re-tokenization layer (Section~\ref{sec:future}).
+\paragraph{Short prompts.}
+Below $\sim$4K tokens, the fixed overhead of draft scoring and RoPE patching exceeds the savings from sparse prefill.
+Our default threshold of 8{,}192 tokens reflects this empirical boundary.
+\paragraph{Comparison with CritiPrefill.}
+CritiPrefill~\cite{critiprefill} achieves sparse prefill without a draft model by using the target's own attention scores from an initial partial prefill.
+On dense standard transformers where all layers are attention, CritiPrefill achieves 2.7--3.0$\times$ speedup (reported on Llama-3-8B and Yi-9B at 128K context).
+However, it only saves attention FLOPs---on MoE architectures where attention constitutes 7--9\% of total computation, the attention-only savings would be proportionally limited; our analysis estimates $\sim$1.03--1.08$\times$.
+\specprefill{} saves \emph{all} FLOPs (attention + routing + expert computation) for dropped tokens, yielding $3.7$--$5.5\times$ on MoE targets vs.\ the estimated ${\sim}1.03$--$1.08\times$ for attention-only savings.
+\subsection{Limitations}
+\begin{itemize}
+  \item \textbf{Draft model dependency.} Requires a small model with a compatible tokenizer. This limits applicability to model families with multiple size variants (Qwen3.5: 2B/4B/27B/35B/122B; GPT-OSS: only 20B/120B).
+  \item \textbf{Nemotron-H SSM state.} SSM layers are updated only on retained tokens; skipped tokens do not contribute to state evolution. The resulting state trajectory diverges from full prefill. Empirically safe under our evaluation ($2.10$--$2.19\times$ with no observed quality degradation), but the magnitude of state drift is not quantified.
+  \item \textbf{Quality validation scale.} Six prompts and eight adversarial types validate methodology and confirm no catastrophic loss, but are insufficient for tight confidence intervals.
+  \item \textbf{Single hardware platform.} Results on M2~Ultra (128\,GB). Memory bandwidth and compute characteristics differ on M3/M4 variants, and the optimal keep percentage may shift.
+  \item \textbf{Single-request evaluation.} The serving engine serializes requests via \texttt{asyncio.Lock}. We do not evaluate \specprefill{} under concurrent load.
+\end{itemize}
+% ============================================================
+\section{Related Work}
+\label{sec:related}
+% ============================================================
+\paragraph{Sparse prefill.}
+\citet{specprefill} introduce attention-based sparse prefill using a draft model on discrete GPU architectures.
+CritiPrefill~\cite{critiprefill} achieves draft-free sparse prefill using the target model's own attention.
+We extend the draft-based approach to unified memory hardware and non-transformer architectures.
+To our knowledge, no prior work addresses sparse prefill specifically for unified memory systems.
+\paragraph{Speculative decoding.}
+\citet{leviathan2023fast} and \citet{chen2023accelerating} propose using a draft model to speculatively generate candidate tokens verified by the target model.
+Multi-Token Prediction (MTP)~\cite{gloeckle2024better} uses auxiliary prediction heads within the target model itself.
+Our ``Speculative Stack'' composes prefill-phase speculation (\specprefill{}) with decode-phase speculation (MTP), operating in non-overlapping phases.
+\paragraph{Efficient attention.}
+FlashAttention~\cite{dao2022flashattention,dao2023flashattention2} optimizes attention computation through tiling and memory-efficient kernels.
+\specprefill{} is orthogonal: it reduces the token count \emph{before} attention, and can compose with efficient attention implementations.
+\paragraph{Serving systems.}
+vLLM~\cite{kwon2023efficient} introduces PagedAttention for efficient KV cache management.
+Our system (vllm-mlx) adapts continuous-batching and paged-attention concepts for Apple Silicon.
+\specprefill{} integrates as a per-request prefill optimization within this serving framework.
+\paragraph{MLX framework.}
+MLX~\cite{mlx} provides the unified-memory ML runtime enabling our zero-copy scoring approach.
+Prior MLX-based serving work has focused on standard inference optimization; we show that unified memory enables system-level optimizations, specifically zero-copy draft scoring, that are impractical on discrete-GPU architectures.
+% ============================================================
+\section{Future Work}
+\label{sec:future}
+% ============================================================
+\paragraph{Universal draft models with tokenizer translation.}
+Our results are constrained to model families where a small same-tokenizer draft exists.
+A \emph{universal draft model}---trained to score token importance across vocabularies via a learned re-tokenization layer---would decouple \specprefill{} from the family constraint.
+The translation layer would map target token IDs to text, re-tokenize with the universal draft's vocabulary, score importance, and project scores back to target token space via character-offset alignment.
+This is non-trivial (tokenization boundaries differ across BPE vocabularies) but would enable \specprefill{} for models like GPT-OSS where no small same-family draft exists.
+\paragraph{CritiPrefill for dense models.}
+For dense architectures where the FLOP ratio is unfavorable, CritiPrefill (draft-free) may be more practical.
+Published results show 2.7--3.0$\times$ on dense 8B--9B transformers at 128K context.
+On our MoE targets, where attention is a small fraction of compute, gains would be limited; the approach warrants investigation for dense models on unified memory where memory access patterns differ from GPU.
+\paragraph{SSM state drift analysis.}
+For hybrid models like Nemotron-H, quantifying the L2 distance between full-prefill and sparse-prefill SSM hidden states would characterize the information loss from skipping tokens in recurrent layers and establish quality guarantees beyond empirical testing.
+\paragraph{Continuous batching.}
+Our current implementation uses a single-request serialized engine.
+Integrating \specprefill{} with continuous batching would enable concurrent request handling, but is currently blocked by a Metal driver stability issue on macOS~26.3.1.
+\paragraph{Hardware generalization.}
+Apple's M3 and M4 generations have different memory bandwidth and compute characteristics.
+The optimal keep percentage and FLOP-ratio threshold may shift on these platforms.
+% ============================================================
+\section{Conclusion}
+\label{sec:conclusion}
+% ============================================================
+We have presented the first implementation of \specprefill{} on unified memory hardware, demonstrating that Apple Silicon's shared address space eliminates the data-transfer overhead that complicates draft-based sparse prefill on discrete GPUs.
+With transfer overhead removed, the cost equation reduces to a single dominant term: the draft-to-target FLOP ratio, validated across six configurations spanning MoE, Mamba-2 hybrid, and sliding-window dense architectures.
+On Qwen3.5-122B, \specprefill{} reduces TTFT by $3.71$--$5.45\times$ across 8K--128K tokens with a 1.4\,GB draft model and no observed quality degradation under our evaluation. At 128K tokens, prefill drops from 19.3~minutes to 3.5~minutes.
+Composed with system prompt KV caching, end-to-end speedup reaches $5.6\times$ on a 73K-token production workload.
+The implementation handles architecture-specific challenges (gated queries, heterogeneous SSM/attention layers, sliding-window caches) through auto-detecting adapters that require no user configuration.
+\specprefill{} is most effective on MoE and hybrid models where total parameters far exceed active computation, making a small dense draft model orders of magnitude cheaper than the target.
+As large models move to local hardware, reducing prefill cost through techniques like zero-copy draft scoring directly determines whether long-context inference is usable.
+% ============================================================
+% References
+% ============================================================
+\bibliographystyle{plainnat}
+\begin{thebibliography}{99}
+\bibitem[Chen et~al.(2023)]{chen2023accelerating}
+Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper.
+\newblock Accelerating large language model decoding with speculative sampling.
+\newblock \emph{arXiv preprint arXiv:2302.01318}, 2023.
+\bibitem[Dao(2023)]{dao2023flashattention2}
+Tri Dao.
+\newblock Flash{A}ttention-2: Faster attention with better parallelism and work partitioning.
+\newblock \emph{arXiv preprint arXiv:2307.08691}, 2023.
+\bibitem[Dao et~al.(2022)]{dao2022flashattention}
+Tri Dao, Daniel~Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\'{e}.
+\newblock Flash{A}ttention: Fast and memory-efficient exact attention with {IO}-awareness.
+\newblock In \emph{NeurIPS}, 2022.
+\bibitem[Gloeckle et~al.(2024)]{gloeckle2024better}
+Fabian Gloeckle, Badr Youbi~Idrissi, Baptiste Rozi\`{e}re, David Lopez-Paz, and Gabriel Synnaeve.
+\newblock Better \& faster large language models via multi-token prediction.
+\newblock \emph{arXiv preprint arXiv:2404.19737}, 2024.
+\bibitem[Kwon et~al.(2023)]{kwon2023efficient}
+Woosuk Kwon, Zhuohan Li, Sicheng Zhuang, Ying Sheng, Lianmin Zheng, Cody~Hao Yu, Joseph~E. Gonzalez, Hao Zhang, and Ion Stoica.
+\newblock Efficient memory management for large language model serving with {PagedAttention}.
+\newblock In \emph{SOSP}, 2023.
+\bibitem[Leviathan et~al.(2023)]{leviathan2023fast}
+Yaniv Leviathan, Matan Kalman, and Yossi Matias.
+\newblock Fast inference from transformers via speculative decoding.
+\newblock In \emph{ICML}, 2023.
+\bibitem[Apple(2023)]{mlx}
+Apple Machine Learning Research.
+\newblock {MLX}: An array framework for Apple Silicon.
+\newblock \url{https://github.com/ml-explore/mlx}, 2023.
+\bibitem[Peng et~al.(2024)]{yarn}
+Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole.
+\newblock {YaRN}: Efficient context window extension of large language models.
+\newblock In \emph{ICLR}, 2024.
+\bibitem[Su et~al.(2024)]{rope}
+Jianlin Su, Murtadha Ahmed, Yu~Lu, Shengfeng Pan, Wen Liu, and Bo~Liu.
+\newblock {RoFormer}: Enhanced transformer with rotary position embedding.
+\newblock \emph{Neurocomputing}, 568:127063, 2024.
+\bibitem[Yao et~al.(2025)]{specprefill}
+Ziteng Yao, Wei Chen, Yushi Huang, and others.
+\newblock {SpecPrefill}: Speculative prefilling for faster long-context {LLM} inference.
+\newblock \emph{arXiv preprint arXiv:2502.02789}, 2025.
+\bibitem[Qwen(2025)]{qwen35}
+Qwen Team.
+\newblock {Qwen3.5}: A series of large language models.
+\newblock \url{https://huggingface.co/Qwen/Qwen3.5-122B-A10B}, 2025.
+\bibitem[Zhang et~al.(2025)]{critiprefill}
+Junlin Zhang, Jiahao Li, and others.
+\newblock {CritiPrefill}: A segment-level critique framework for efficient long-context {LLM} prefilling.
+\newblock \emph{arXiv preprint}, 2025.
+\end{thebibliography}
+\end{document}