\documentclass[11pt,a4paper]{article} % === Packages === \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{amsmath,amssymb} \usepackage{graphicx} \usepackage{booktabs} \usepackage{hyperref} \usepackage{xcolor} \usepackage{algorithm} \usepackage{algpseudocode} \usepackage{listings} \usepackage[margin=1in]{geometry} \usepackage{caption} \usepackage{subcaption} \usepackage{natbib} % === Macros === \newcommand{\tbd}[1]{\textcolor{red}{\textbf{[TBD: #1]}}} \newcommand{\specprefill}{\textsc{SpecPrefill}} \newcommand{\ttft}{\text{TTFT}} \lstset{ basicstyle=\ttfamily\small, keywordstyle=\color{blue}, commentstyle=\color{gray}, breaklines=true, frame=single, numbers=left, numberstyle=\tiny\color{gray}, } % === Title === \title{\specprefill{} on Unified Memory:\\Cross-Architecture Sparse Prefill for\\Large Language Models on Apple Silicon} \author{ \texttt{github.com/Thump604} } \date{March 2026} \begin{document} \maketitle % ============================================================ \begin{abstract} % ============================================================ Long-context prefill is the dominant latency bottleneck for local LLM inference: a 64K-token prompt on Qwen3.5-122B (MoE, 10B active parameters) takes 7 minutes before the first token appears. \specprefill{} -- attention-based sparse prefill using a draft model -- was designed for GPU clusters with discrete memory. We port it to Apple Silicon's unified memory architecture and generalize it across three model families: transformer Mixture-of-Experts (Qwen3.5), Mamba-2/attention hybrid (Nemotron-H), and sliding-window dense (GPT-OSS\footnote{GPT-OSS refers to a publicly available model by its open-source project designation.}). On M2~Ultra (128\,GB unified memory), \specprefill{} with a 2B draft model (Qwen3.5-2B, 1.4\,GB, 4-bit) reduces TTFT by $3.71$--$5.45\times$ across 8K--128K tokens on Qwen3.5-122B, cutting 128K prefill from 19.3~minutes to 3.5~minutes. Composed with system prompt KV caching, end-to-end speedup reaches $5.6\times$ on a 73K-token production workload. We also achieve $2.10$--$2.19\times$ on Nemotron-H~120B across 8K--64K tokens. Unified memory eliminates PCIe transfer overhead, making the draft-to-target FLOP ratio the dominant predictor of speedup. We formalize and validate this relationship across six draft/target configurations. Quality evaluation reveals a sharp boundary: QA, summarization, and retrieval tasks show no degradation at 20\% keep (LongBench 9/11 tasks within noise, adversarial 0/16 regressions), but tasks requiring exhaustive context coverage fail catastrophically (RULER common-word extraction drops from 96.6\% to 11.8\%). We provide per-request API control so callers can disable sparse prefill for full-coverage workloads. Our implementation handles architecture-specific challenges including gated queries with per-head normalization (Qwen3.5), SSM-interleaved attention layers without positional encoding (Nemotron-H), and sliding-window cache preservation (GPT-OSS), deployed in a production serving stack with per-request API control. \end{abstract} % ============================================================ \section{Introduction} \label{sec:intro} % ============================================================ \subsection{The TTFT Problem in Local Inference} Time-to-first-token (TTFT) is the dominant user-facing latency for large language models serving long-context requests. On commodity hardware -- a Mac Studio with Apple M2~Ultra and 128\,GB unified memory -- prefilling a 64K-token prompt through Qwen3.5-122B-A10B (a 122-billion parameter Mixture-of-Experts model with 10 billion active parameters per token) requires \textbf{418~seconds}, nearly 7~minutes before the first output token appears. Even at 16K tokens, the wait is 92~seconds. This latency is not a bandwidth problem. MLX prefill on Apple Silicon is FLOP-limited: the Metal GPU is compute-bound processing each token through the model's forward pass~\cite{mlx}. Reducing the number of tokens processed during prefill therefore yields near-linear TTFT improvement. For a coding assistant with 16K tokens of context (tool definitions, file contents, conversation history), 92 seconds of silence before the first response token breaks the interaction loop. At 64K tokens -- a long research session -- the wait is 7 minutes per response. With \specprefill{}, 128K prefill drops from 19.3~minutes to 3.5~minutes, making long-context requests practical for interactive use. \subsection{Why Unified Memory Changes the Calculus} \specprefill{}~\cite{specprefill} addresses TTFT by using a small draft model to identify which prompt tokens are most important via attention scoring, then sparse-prefilling only the selected subset into the target model. The original formulation assumes a discrete-memory GPU architecture where the draft model either (a) shares GPU VRAM with the target, reducing KV cache headroom, or (b) runs on CPU or a separate GPU, incurring PCIe transfer latency for importance scores. On Apple Silicon's unified memory architecture, neither penalty applies. Draft and target models share the same physical address space -- the draft's weights (${\sim}$1.4\,GB for a 4-bit 2B model) are simply additional allocations in the same memory pool as the target's \raise.17ex\hbox{$\scriptstyle\sim$}79\,GB. Scoring requires zero data movement. On discrete GPU systems, draft scoring would either compete for GPU VRAM with the target's KV cache (reducing effective context length) or require CPU$\leftrightarrow$GPU transfers with latency proportional to prompt length -- neither penalty exists on unified memory. This simplifies the cost equation. Let $C_t$ denote the target model's prefill cost (in FLOPs), $C_d$ the draft model's scoring cost (full prefill plus lookahead steps), and $k$ the fraction of tokens retained. On unified memory: \begin{equation} \label{eq:speedup} \text{Speedup} = \frac{C_t}{C_d + k \cdot C_t} \end{equation} When $C_d \ll C_t$ -- as with a 2B draft scoring for a 122B MoE target, where the FLOP ratio is approximately $50\times$ -- speedup approaches $1/k$. At $k = 0.2$, this predicts up to $4.5\times$; we measure $4.11\times$ at 16K tokens (5 trials), with the gap attributable to overhead from chunk selection, RoPE patching, and memory management. We term this the \textbf{ratio thesis}: on unified memory where $T = 0$, the draft-to-target FLOP ratio $r$ is the dominant predictor of \specprefill{} benefit, modulated by architecture-dependent overhead $\epsilon$ (Equation~\ref{eq:speedup_ratio}). Section~\ref{sec:ratio} validates this across six draft/target configurations spanning an $8.5\times$ range of FLOP ratios. \subsection{Contributions} \begin{enumerate} \item \textbf{First implementation on unified memory hardware.} We implement \specprefill{} on Apple Silicon via the MLX framework, demonstrating that zero-copy scoring shifts the viability threshold, making the technique effective even at moderate prompt lengths (8K tokens). \item \textbf{Cross-architecture generalization.} We extend \specprefill{} beyond standard transformers to Mamba-2/attention hybrids (Nemotron-H, where only 8 of 88 target layers have attention) and sliding-window models (GPT-OSS, with RotatingKVCache). Auto-detecting query extraction handles gated attention with per-head normalization (Qwen3.5), content-based attention without positional encoding (Nemotron-H), and YarnRoPE with sliding-window cache preservation (GPT-OSS). \item \textbf{Production system integration.} We show per-request API control with graceful fallback, coexistence with Multi-Token Prediction (MTP) speculative decoding in a three-phase ``Speculative Stack,'' and demonstrate that the FLOP ratio thesis extends to draft model selection -- smaller drafts with lower $r$ yield higher speedup. \end{enumerate} The core \specprefill{} algorithm is due to~\citet{specprefill}; our contributions are porting it to new hardware, generalizing it across architectures, and deploying it in production. % ============================================================ \section{Background} \label{sec:background} % ============================================================ \subsection{SpecPrefill} \citet{specprefill} observe that during autoregressive generation, the model attends heavily to a small fraction of prompt tokens. If these important tokens can be identified cheaply, the full prompt need not be prefilled into the target model. Their method uses a small draft model to score token importance via attention weights, selects the top-$k$\% of tokens (in non-overlapping chunks for spatial locality), and sparse-prefills only the selected subset. Sparse prefill is possible because Rotary Position Embeddings (RoPE)~\cite{rope} encode \emph{relative} position. The inner product $Q_m \cdot K_p^T$ depends only on the relative distance $(m - p)$ encoded via rotation angles. If selected tokens are stored in the KV cache with their \emph{original} position angles, they interact with future decode queries at correct relative distances regardless of gaps. \subsection{MLX and Apple Silicon Unified Memory} MLX~\cite{mlx} is Apple's machine learning framework for Apple Silicon. On this architecture, CPU and GPU share the same physical DRAM -- no PCIe bus, no \texttt{cudaMemcpy}, no separate VRAM allocation. A draft model's 1.4\,GB weights are simply additional allocations in the same memory pool as the target's 79\,GB, accessible at full bandwidth without copying. \subsection{Target Architectures} We evaluate \specprefill{} on three architecturally distinct model families, each requiring different adaptations for query extraction, position handling, and cache management: \begin{table}[h] \centering \small \begin{tabular}{@{}lllll@{}} \toprule \textbf{Model} & \textbf{Architecture} & \textbf{Attention} & \textbf{Position Enc.} & \textbf{Cache} \\ \midrule Qwen3.5-122B & MoE (10B active) & Gated + q\_norm & Standard RoPE & Standard KV \\ Nemotron-H 120B & Mamba-2 + Attn + MoE & Standard (8/88 layers) & None (SSM) & Compacted \\ GPT-OSS 120B & Dense + sliding window & Standard & YarnRoPE & RotatingKV \\ \bottomrule \end{tabular} \caption{Target model architectures. Each requires different query extraction, position handling, and cache management in \specprefill{}.} \label{tab:architectures} \end{table} \textbf{Qwen3.5} uses gated attention where \texttt{q\_proj} outputs $2\times$ the expected width (query concatenated with a gate), requiring a split before per-head RMSNorm (\texttt{q\_norm}) and RoPE application. \textbf{Nemotron-H} is a hybrid architecture with 40 Mamba-2 SSM layers, 8 full-attention layers, and 40 MoE feed-forward layers. Positional information is encoded entirely in the SSM state -- the attention layers have no RoPE. Only the attention layers produce Q/K scores usable for importance scoring. \textbf{GPT-OSS} uses YarnRoPE~\cite{yarn} with a sliding-window attention pattern where alternating layers use \texttt{RotatingKVCache} retaining only the last 128 tokens. % ============================================================ \section{Method} \label{sec:method} % ============================================================ \subsection{Token Importance Scoring} \label{sec:scoring} Given a prompt of $M$ tokens, importance scoring proceeds in three phases: \begin{enumerate} \item \textbf{Draft prefill.} The full prompt is prefilled into a small same-tokenizer draft model (e.g., Qwen3.5-2B at 4-bit quantization, 1.4\,GB) in chunks of 2{,}048 tokens, populating the draft's KV cache. The FLOP ratio thesis (Section~\ref{sec:cost_model}) predicts that minimizing the draft's active parameter count maximizes speedup, favoring the smallest available compatible model. \item \textbf{Lookahead decode with attention capture.} Eight autoregressive decode steps are executed with \texttt{AttentionCapture} wrappers installed on each attention layer. These wrappers intercept post-RoPE query vectors via architecture-specific extractors (Section~\ref{sec:extractors}), appending them to a capture buffer before delegating to the original attention computation. \item \textbf{Importance computation.} For each attention layer $\ell$ with captured queries $\{q^{(t)}\}_{t=1}^{8}$ and prompt keys $K_\ell \in \mathbb{R}^{h \times M \times d}$: \begin{align} S_\ell^{(t)} &= \text{softmax}\!\left(\frac{q^{(t)} K_\ell^T}{\sqrt{d}}\right) \in \mathbb{R}^{h \times M} \\ \bar{S}_\ell^{(t)} &= \text{AvgPool1D}(S_\ell^{(t)},\; \text{kernel}=13) \end{align} Scores are aggregated as $\max$ across layers and heads, then $\text{mean}$ across lookahead steps, yielding importance $I \in \mathbb{R}^M$. Average pooling with kernel 13 smooths the signal, preventing isolated-token artifacts. For GQA models, keys are expanded via \texttt{repeat} to match the query head count before scoring. \end{enumerate} Layers whose cache does not span the full prompt (e.g., sliding-window \texttt{RotatingKVCache} layers caching only 128 tokens) are skipped during importance computation. After scoring, the draft KV cache is explicitly freed and \texttt{mx.clear\_cache()} is called, reclaiming memory before target prefill begins. \subsubsection{Architecture-Specific Query Extraction} \label{sec:extractors} The query extractor is auto-detected at runtime based on the attention module's attributes: \begin{itemize} \item \texttt{q\_norm} present $\rightarrow$ Qwen3.5 path \item No \texttt{rope} attribute $\rightarrow$ Nemotron-H path \item Otherwise $\rightarrow$ Standard (Llama/GPT-OSS) path \end{itemize} \paragraph{Qwen3.5: Gated queries with per-head normalization.} The \texttt{q\_proj} output has $2\times$ the expected width, containing both query and gate tensors concatenated along the head dimension. The output is reshaped to $(B, L, n_\text{heads}, 2 \cdot d_\text{head})$ and split at the midpoint. After splitting, \texttt{q\_norm} (a per-head RMSNorm) is applied before RoPE rotation. Treating this as a standard projection produces silent shape errors or incorrect scoring. \paragraph{Nemotron-H: Heterogeneous layer navigation.} Of 88 total layers (40 Mamba-2 + 8 attention + 40 MoE), only the 8 attention layers produce Q/K scores. \texttt{\_find\_attention\_layers} navigates the heterogeneous layer structure by inspecting \texttt{block\_type} annotations (\texttt{M} for Mamba, \texttt{*} for attention, \texttt{-} for MLP, \texttt{E} for MoE) and locating modules with a \texttt{mixer} attribute rather than the standard \texttt{self\_attn}. \texttt{\_build\_layer\_to\_cache\_map} constructs a compacted index because only Mamba and attention layers have cache entries. These attention layers have \textbf{no RoPE} -- positional information comes entirely from the Mamba-2 SSM state. Queries are used as-is for content-based scoring. The original paper did not address this case. The draft model (Nano~4B) has 42 heterogeneous layers with only 4 attention layers among 21 Mamba and 17 MLP layers, all in a model where positional information is entirely implicit. \paragraph{GPT-OSS: RotatingKVCache awareness.} Standard query extraction applies, but importance computation must skip sliding-window layers whose \texttt{RotatingKVCache} contains only the last 128 tokens. Without this check, importance scores would be computed against a truncated key set, producing misleading rankings. Correct handling requires three cache-aware adaptations: (1)~layer-level cache introspection to distinguish full-context from sliding-window layers; (2)~skipping importance computation for layers whose cache does not span the full prompt; (3)~force-preserving the last \texttt{max\_size} positions during sparse selection to ensure sliding-window layers have valid recent context at decode time. \subsection{Chunk Selection} Tokens are grouped into non-overlapping chunks of $C = 32$ tokens. Each chunk is scored by the mean importance of its constituent tokens. The top $\lceil k \cdot M/C \rceil$ chunks by score are selected and their token indices returned in sorted order. This preserves spatial locality -- coherent phrases are kept or dropped as units. At $k = 0.2$ (our optimal configuration for Qwen3.5-122B), 80\% of prefill computation is eliminated. \subsection{Sparse Prefill with Position-Mapped RoPE} \label{sec:sparse_prefill} The correctness of sparse prefill depends on maintaining correct RoPE angles despite non-contiguous token positions. If position angles are not re-mapped, the model perceives selected tokens as adjacent, destroying long-range coherence. \paragraph{Step 1: Sliding-window tail preservation.} For architectures using \texttt{RotatingKVCache} (GPT-OSS), the last \texttt{max\_size} positions from the prompt are force-merged into the selection set, ensuring sliding-window attention layers have valid recent context at decode time. This is auto-detected via cache type inspection. \paragraph{Step 2: Position mapping during prefill.} Each attention layer's \texttt{nn.RoPE} is replaced with \texttt{PositionMappedRoPE}, which maps contiguous batch positions $[0, 1, \ldots, N{-}1]$ to the original absolute positions $[p_0, p_1, \ldots, p_{N-1}]$ of the selected tokens. For models with custom RoPE variants (YarnRoPE with pre-computed frequencies, SuScaled RoPE with \texttt{mscale}), the replacement module captures and replays the original frequency tensors and scale factors. \paragraph{Step 3: Chunked forward pass.} Selected tokens are fed through the target model in chunks of \texttt{step\_size} (default 2{,}048), populating the KV cache with entries at correct absolute positions. \paragraph{Step 4: Decode offset adjustment.} After sparse prefill of $N$ selected tokens from $M$ total prompt tokens, the cache offset is $N$ but decode must start at position $M$. \texttt{OffsetAdjustedRoPE} wraps the original RoPE module and adds adjustment $\Delta = M - N$ to all offset calls: \begin{equation} \text{RoPE\_position}(i) = N + i + (M - N) = M + i \quad \checkmark \end{equation} \paragraph{Step 5: Cleanup.} After generation completes, \texttt{cleanup\_rope()} traverses all attention layers and unwraps patched RoPE modules back to their originals, ensuring the model is unmodified for subsequent requests. \paragraph{Nemotron-H (no RoPE).} Steps 2 and 4 are skipped entirely -- Nemotron-H's attention layers have no RoPE, deriving positional information from the Mamba-2 SSM state instead. The SSM layers are updated \emph{only} on retained tokens; skipped tokens do not contribute to state evolution. Concretely: the Mamba-2 recurrence $h_t = A h_{t-1} + B x_t$ advances only at selected positions, so the hidden state after processing $N$ selected tokens diverges from the state after processing all $M$ tokens. This alters the underlying state trajectory. In practice, this divergence does not produce observable quality loss: server-side benchmarks show $2.10$--$2.19\times$ speedup across 8K--64K tokens with no regressions under our evaluation (Section~\ref{sec:quality}). We attribute this to the hybrid design: the 8 full-attention layers retain the most important $N$ tokens with correct content-based scores, providing long-range context that compensates for gaps in the SSM state. This is an empirical finding, not a theoretical guarantee -- extending \specprefill{} to pure SSM architectures would require quantifying the state drift (e.g., L2 distance between full and sparse hidden states). \subsection{Unified Memory Cost Model} \label{sec:cost_model} We formalize the relationship between hardware architecture and \specprefill{} efficiency. On discrete-GPU systems, the cost of \specprefill{} includes a data transfer term $T$ (PCIe bandwidth, memory copies between draft and target devices): \begin{equation} \label{eq:speedup_gpu} \text{Speedup}_{\text{GPU}} = \frac{C_t}{C_d + T + k \cdot C_t} \end{equation} On unified memory, $T = 0$, simplifying to Equation~\ref{eq:speedup}. The speedup is determined by the empirical wall-clock cost ratio $r = C_d / C_t$ and keep percentage $k$: \begin{equation} \label{eq:speedup_ratio} \text{Speedup} = \frac{1}{r + k + \epsilon} \end{equation} where $\epsilon$ captures fixed overhead from chunk selection, RoPE patching, memory management (\texttt{mx.clear\_cache()}), and architecture-specific scoring costs. In practice, $\epsilon$ ranges from 0.03 (Qwen3.5, low overhead) to 0.30 (Nemotron-H, where navigating 88 heterogeneous layers adds cost). We emphasize that this model is \emph{descriptive}: it correctly predicts the \emph{ranking} of configurations by speedup (Table~\ref{tab:cost_model}) but does not predict exact magnitudes, as $\epsilon$ varies by architecture. The value of the model is in draft selection -- given two candidate drafts, the one with lower $r$ will yield higher speedup -- not in predicting absolute speedup from first principles. \textbf{Boundary conditions.} Equation~\ref{eq:speedup_ratio} assumes: (1)~single-request prefill (the model holds for continuous batching with serialized prefill, as in our production BatchedEngine); (2)~a FLOP-dominated regime where compute, not memory bandwidth, is the bottleneck; and (3)~negligible KV cache fragmentation cost. In bandwidth-bound regimes or under heavy concurrent batching with parallel prefill, $\epsilon$ may dominate and reduce the model's predictive accuracy. On unified memory, $r$ reflects \emph{total} parameters, not active parameters. All expert weights reside in the same address space -- router gating, weight loading, and memory bandwidth pressure scale with the full 122B footprint even though only 10B parameters activate per token. A 2B dense draft against this 122B MoE target achieves $r \approx 0.02$: the wall-clock ratio tracks total parameters, not the ${\sim}5\times$ active-parameter ratio. On discrete GPUs, where expert weights page between CPU and GPU memory, the effective ratio may differ. For MoE models where total parameters far exceed active parameters, a small dense draft has $r \ll 1$. This model assumes $C_t$ scales linearly with token count; at extreme context lengths (${\geq}$128K), the $O(n^2)$ attention component causes superlinear growth in $C_t$, and measured speedup can exceed the linear prediction (Table~\ref{tab:cost_model}). Table~\ref{tab:cost_model} compares predicted ($\epsilon = 0$) and measured speedups, with the gap $G = \text{Predicted} / \text{Measured}$ quantifying per-configuration overhead: \begin{table}[h] \centering \small \begin{tabular}{@{}lcccc@{}} \toprule \textbf{Configuration} & $\boldsymbol{r = C_d/C_t}$ & \textbf{Predicted} ($k{=}0.2, \epsilon{=}0$) & \textbf{Measured} & $\boldsymbol{G}$ \\ \midrule 4B / 122B MoE (10B active) & $\sim$0.03 & 4.3$\times$ & 2.90$\times$ & 1.5$\times$ \\ 2B / 122B MoE (10B active) & $\sim$0.02 & 4.5$\times$ & 4.11$\times$ & 1.1$\times$ \\ 2B / 122B MoE (10B active)$^\ddagger$ & $\sim$0.02 & 4.5$\times$ & 5.45$\times$ & 0.8$\times$ \\ Qwen-4B / 35B MoE (3B active) & $\sim$0.10 & 3.3$\times$ & 1.86$\times$ & 1.8$\times$ \\ 4B / 120B Nemotron-H (12B active) & $\sim$0.03 & 4.3$\times$ & 2.17$\times$ & 2.0$\times$ \\ 20B / 120B GPT-OSS (120B active) & $\sim$0.17 & 2.7$\times$ & 1.28$\times$ & 2.1$\times$ \\ \bottomrule \multicolumn{5}{l}{\footnotesize $^\ddagger$Measured at 128K tokens. Measured speedup exceeds prediction due to $O(n^2)$ attention.} \end{tabular} \caption{Cost model predictions ($\epsilon = 0$) vs.\ measured speedups at $k = 0.2$, 16K tokens unless noted. $G = \text{Predicted} / \text{Measured}$; values $> 1$ indicate overhead exceeding the model, $< 1$ indicates superlinear baseline growth benefiting \specprefill{} beyond the linear prediction.} \label{tab:cost_model} \end{table} The $G$ values reveal architecture-dependent overhead. Nemotron-H ($G = 2.0$) has the highest $\epsilon$: the target has 88 heterogeneous layers (40~Mamba + 8~attention + 40~MoE) and the draft has 42 (21~Mamba + 4~attention + 17~MLP), requiring architecture-aware layer navigation for both scoring and sparse prefill. GPT-OSS ($G = 2.1$) combines high $r$ (0.17, the 20B draft dominates the denominator) with sliding-window cache management overhead. The Qwen3.5-122B configurations ($G = 1.1$--$1.5$, 5 trials) have the lowest overhead, benefiting from uniform architecture and favorable MoE FLOP ratios; the 35B target ($G = 1.8$) sits higher due to its smaller active-parameter count (3B) narrowing the draft-to-target gap. At 128K, $G < 1$ because the superlinear baseline growth (Section~\ref{sec:experiments}) is not captured by the linear cost model. \paragraph{Comparison with GPU results.} The original \specprefill{} on discrete GPUs~\cite{specprefill} reports up to $7.66\times$ TTFT reduction on Llama-3.1-405B-FP8, benefiting from batch-level parallelism and GPU-optimized attention kernels not available on Apple Silicon. Our lower absolute speedups ($3.71$--$5.45\times$ on 122B) reflect the single-request, unbatched setting and MLX's Metal compute pipeline. The unified memory contribution is not higher \emph{absolute} speedup but a \emph{simpler cost model}: eliminating $T$ makes the FLOP ratio the sole dominant predictor, enabling principled draft model selection without profiling transfer overhead. % ============================================================ \section{System Integration} \label{sec:system} % ============================================================ \subsection{Composition with Prefix Caching} \label{sec:composition} In production agentic workflows, the system prompt (tool definitions, instructions, context documents) often constitutes 10--20K tokens that remain identical across requests. Prefix caching (via vllm-mlx's native paged prefix cache in BatchedEngine) snapshots common prefixes on the first request and restores them for subsequent requests, eliminating redundant prefill. In multi-turn conversations, this extends beyond the system prompt: each turn reuses the KV cache from all prior turns, so TTFT depends only on the new tokens added since the last turn. These optimizations operate on orthogonal axes: prefix caching eliminates \emph{prefix} cost (shared prompt prefix); \specprefill{} reduces \emph{suffix} cost (variable user context). This is why they compose multiplicatively rather than additively. When both techniques are active, \specprefill{} operates on the \emph{suffix} only (tokens beyond the cached prefix), receiving a \texttt{position\_offset} equal to the cached token count. The scoring phase evaluates only suffix tokens; sparse prefill maps positions relative to the offset so selected tokens land at correct absolute positions in the full context. The threshold check uses suffix length, not full prompt length, ensuring \specprefill{} activates only when the suffix itself is long enough to benefit. \begin{table}[h] \centering \small \begin{tabular}{@{}lcc@{}} \toprule \textbf{Configuration} & \textbf{TTFT (s)} & \textbf{Speedup} \\ \midrule Baseline (cold, full prefill) & 517.5 & 1.0$\times$ \\ Prefix cache only & 417.1 & 1.24$\times$ \\ \textbf{Combined (Prefix + SP 20\%)} & \textbf{92.5} & \textbf{5.59$\times$} \\ Combined (repeat) & 92.4 & 5.60$\times$ \\ \bottomrule \end{tabular} \caption{Composition of prefix caching and \specprefill{} on Qwen3.5-122B, 2B draft, M2~Ultra 128\,GB. The prompt (73K tokens) is a realistic agentic workload: ${\sim}$10K system prefix (tool definitions, instructions) + ${\sim}$63K user context. Prefix cache saves the shared prefix; \specprefill{} sparse-prefills the suffix. The combined $5.6\times$ speedup exceeds either technique alone.} \label{tab:composition} \end{table} In multi-turn conversations, prefix caching compounds: each turn reuses the KV cache from all prior turns. A 15-turn benchmark on Qwen3.5-122B (M2~Ultra, BatchedEngine, thinking enabled) shows TTFT plateauing at 25--26\,s for heavy turns ($>$1500 chars content) and dropping to ${\sim}$5\,s for short responses, with no degradation across turns (growth factor 0.68$\times$). This confirms that prefix caching and \specprefill{} compose well in production multi-turn workloads. \subsection{The Speculative Stack} \specprefill{} is not a standalone optimization but one phase of a three-phase speculative pipeline: \begin{enumerate} \item \textbf{Score} (\specprefill{}): Draft model (2B) identifies important tokens via attention scoring. \item \textbf{Sparse Prefill}: Target model (122B) processes selected token chunks with position-mapped RoPE. \item \textbf{MTP Decode}: Target model with Multi-Token Prediction heads generates output tokens speculatively (Qwen3.5 only; Nemotron-H and GPT-OSS skip this phase). \end{enumerate} The draft model used in Phase~1 is architecturally separate from MTP's prediction heads (which are part of the target model's weights). The draft KV cache is freed after Phase~1 (\texttt{mx.clear\_cache()}), and the draft model's static weights (1.4--3.0\,GB depending on draft size) remain resident for amortized startup cost across requests. At extreme context lengths (128K+), the draft's transient KV cache size becomes a practical constraint: the 2B draft produces 1.5\,GB at 128K tokens, fitting comfortably alongside the target model's ${\sim}$100\,GB on a 128\,GB system. \subsection{Per-Request API and Graceful Fallback} The OpenAI-compatible API accepts per-request overrides: \begin{lstlisting}[language=Python] extra_body = { "specprefill": True, # force enable (bypass threshold) "specprefill_keep_pct": 0.15 # override server default } \end{lstlisting} The default threshold (8{,}192 tokens) is enforced only in server-default mode; explicit \texttt{specprefill: true} bypasses it. Any error during scoring or sparse prefill triggers graceful fallback to full prefill -- no request fails due to \specprefill{}. % ============================================================ \section{Experiments} \label{sec:experiments} % ============================================================ \subsection{Setup} \paragraph{Hardware.} Apple M2~Ultra, 128\,GB unified memory, Mac Studio, macOS~26.3.1. \paragraph{Software.} MLX~0.31.2-dev (with chunked SDPA for 128K+ context support), vllm-mlx~0.2.6 with patches, Python~3.12. \paragraph{Sampling.} Qwen3.5 models: $\text{temp}=0.6$, $\text{top\_p}=0.95$, $\text{top\_k}=20$ (official thinking+coding profile~\cite{qwen35}). Nemotron-H: $\text{temp}=1.0$, $\text{top\_p}=0.95$ (NVIDIA model card: ``trained and evaluated with''). GPT-OSS: $\text{temp}=0.6$, $\text{top\_p}=0.95$. All models run with thinking mode enabled (\texttt{enable\_thinking=true}). Sampling parameters do not affect TTFT measurement (prefill is deterministic; sampling occurs only during generation). \paragraph{Methodology.} Five trials per configuration for Qwen3.5-122B (plus one warmup); two trials for remaining configurations. Server-side TTFT measured via streaming OpenAI-compatible API. Server restarted between configuration changes. \paragraph{Reproducibility.} Our \specprefill{} implementation is available as patches against vllm-mlx~0.2.6 (PR~\#180, merged upstream) and mlx-lm~0.31.2 (PR~\#990), with benchmark scripts (\texttt{bench-specprefill}, \texttt{bench-specprefill-adversarial}, \texttt{bench-specprefill-perplexity}) included in the repository. All models are publicly available Qwen3.5 quantizations. Nemotron-H and GPT-OSS weights are available from their respective model hubs. Experiments require an Apple Silicon system with $\geq$128\,GB unified memory for the 122B configuration; the 35B configuration runs on 64\,GB systems. \begin{table}[h] \centering \small \begin{tabular}{@{}llllllr@{}} \toprule \textbf{Model} & \textbf{Role} & \textbf{Architecture} & \textbf{Params} & \textbf{Active} & \textbf{Quant} & \textbf{RAM} \\ \midrule Qwen3.5-122B-VLM-MTP & Target & MoE & 122B & 10B & 5-bit & 79\,GB \\ Qwen3.5-35B-VLM-MTP & Target & MoE & 35B & 3B & 8-bit & 38\,GB \\ Nemotron-H 120B & Target & Mamba-2 + Attn + MoE & 120B & 12B & 5-bit & 83\,GB \\ GPT-OSS 120B & Target & Dense + sliding window & 120B & 120B & 5-bit & 58\,GB \\ \midrule Qwen3.5-4B-VLM-MTP & Draft & Dense hybrid & 4B & 4B & 4-bit & 3.0\,GB \\ Qwen3.5-2B-OptiQ & Draft & Hybrid + MoE & 2B & $<$2B & 4-bit & 1.4\,GB \\ Nemotron-H Nano 4B & Draft & Mamba-2 + Attn hybrid & 4B & 4B & 4-bit & 2.1\,GB \\ GPT-OSS-20B & Draft & MoE & 20B & 3.6B & 4-bit & 10\,GB \\ \bottomrule \end{tabular} \caption{Models evaluated. \specprefill{} requires draft and target to share the same tokenizer. Qwen3.5 drafts (2B, 4B) serve all Qwen3.5 targets (248K vocabulary); the 2B is the primary draft model. Nemotron-H Nano~4B serves Nemotron-H~120B (131K vocabulary). GPT-OSS-20B is the smallest available same-family draft for GPT-OSS-120B (201K vocabulary).} \label{tab:models} \end{table} \subsection{TTFT Benchmarks} \begin{table}[h] \centering \small \begin{tabular}{@{}llccccc@{}} \toprule \textbf{Model} & \textbf{Draft} & \textbf{8K} & \textbf{16K} & \textbf{32K} & \textbf{64K} & \textbf{128K} \\ \midrule Qwen3.5-122B (MoE, 10B) & 4B$^\ddagger$ & 2.79$\times$ & 2.90$\times$ & ---$^\dagger$ & ---$^\dagger$ & ---$^\dagger$ \\ Qwen3.5-122B (MoE, 10B) & 2B & 3.71$\times$ & 4.11$\times$ & 4.23$\times$ & 4.50$\times$ & 5.45$\times$ \\ Qwen3.5-35B (MoE, 3B) & 4B & 1.81$\times$ & 1.86$\times$ & 1.85$\times$ & 1.84$\times$ & --- \\ Nemotron-H 120B (hybrid) & Nano-4B & 2.10$\times$ & 2.17$\times$ & 2.19$\times$ & 2.19$\times$ & --- \\ GPT-OSS 120B (dense) & 20B & 1.24$\times$ & 1.28$\times$ & --- & --- & --- \\ \bottomrule \multicolumn{7}{l}{\footnotesize $^\dagger$Not measured at this context length.} \\ \multicolumn{7}{l}{\footnotesize $^\ddagger$4B draft includes VLM weights (3.0\,GB); the 2B text-only draft (1.4\,GB) is the primary configuration.} \end{tabular} \caption{TTFT speedups at 20\% keep, 5 trials (mean). Speedup increases with prompt length as scoring overhead is amortized and $O(n^2)$ attention savings compound. Qwen3.5-122B and 4B rows use 5 trials; other rows use 2 trials.} \label{tab:ttft} \end{table} For Qwen3.5-122B with the 2B draft, the absolute TTFT at 64K tokens drops from $417.6 \pm 0.6$\,s to $92.8 \pm 0.8$\,s. At 128K tokens: $1{,}155.8 \pm 8.5$\,s (19.3~minutes) $\rightarrow$ $212.3 \pm 1.9$\,s (3.5~minutes), a \textbf{5.45$\times$} reduction. At 8K tokens: $45.0 \pm 0.1$\,s $\rightarrow$ $12.1 \pm 0.03$\,s. Measurements are stable ($<$1\% variance across 5 trials). Speedup grows with context length -- $3.71\times$ at 8K to $5.45\times$ at 128K -- as fixed scoring overhead amortizes and the $O(n^2)$ baseline cost compounds. \paragraph{Nemotron-H: architecture-limited speedup plateau.} Nemotron-H shows a flat speedup profile ($2.10\times$ at 8K to $2.19\times$ at 64K), in contrast to Qwen3.5's monotonically increasing curve. This plateau is explained by the hybrid architecture: only 8 of 88 layers are attention -- the remaining 80 layers (40~Mamba-2 SSM + 40~MoE feed-forward) scale linearly with token count regardless of \specprefill{}. The $O(n^2)$ attention component that drives Qwen3.5's compounding speedup at long contexts constitutes only ${\sim}9\%$ of Nemotron-H's total compute, so the quadratic savings are a small fraction of overall prefill cost. This confirms the architecture-dependent nature of the cost model: \specprefill{} benefit scales with the attention fraction of total computation. In absolute terms, Nemotron-H 120B TTFT drops from 58\,s to 27\,s at 16K and from 253\,s to 116\,s at 64K. For Qwen3.5-35B: 41\,s to 22\,s at 16K. \paragraph{Superlinear scaling at extreme context lengths.} The 128K baseline (1{,}156\,s) is $2.77\times$ the 64K baseline (418\,s), not the $2\times$ expected from linear scaling. This superlinear growth arises from the $O(n^2)$ attention component in chunked prefill: each 2{,}048-token chunk attends to all preceding tokens, so cumulative attention FLOPs grow quadratically. \specprefill{} benefits disproportionately: by reducing the effective sequence length from $N$ to $kN$, it reduces the cumulative attention FLOPs -- which scale approximately as $N^2$ under full-prefix chunked prefill -- by approximately $k^2$. At $k = 0.2$, this yields a ${\sim}25\times$ reduction in attention computation, not the $5\times$ that linear token-count scaling would suggest. Draft scoring remains $O(N)$, so its cost grows linearly while the attention savings grow quadratically, explaining why the measured 128K speedup ($5.45\times$) exceeds the linear prediction ($4.5\times$, Table~\ref{tab:cost_model}). MoE models exhibit stronger gains because sparse prefill reduces both the quadratic attention cost (sequence length) and the number of active expert evaluations -- 80\% fewer tokens means 80\% fewer expert routing and weight-loading operations. \paragraph{Draft model selection.} Draft models are selected to maximize FLOP asymmetry while maintaining tokenizer and architectural compatibility. The selection criteria are: (1)~same tokenizer as the target (required -- token IDs are passed directly without translation); (2)~smallest available model in the family (to maximize the FLOP ratio $r$); (3)~presence of attention layers for importance scoring (at least 4 layers suffice per our Nemotron-H validation). Our primary configuration uses a 2B draft ($r \approx 0.02$, 1.4\,GB); complementary measurements with a 4B draft ($r \approx 0.03$, 3.0\,GB) at 8K--16K confirm the relationship: the lower-ratio 2B achieves $4.11\times$ vs.\ $2.90\times$ at 16K. The GPT-OSS result is a \textbf{negative result confirming the ratio thesis}: the 20B draft model (the smallest available in the GPT-OSS family) has an unfavorable FLOP ratio of $\sim$0.17, yielding only $1.24$--$1.28\times$ speedup. This validates that architecture is not the determining factor -- the FLOP ratio is. A hypothetical 4B GPT-OSS draft ($r \approx 0.03$) would be predicted to achieve $\sim$2.5--3$\times$ under our cost model, but no such model exists in the GPT-OSS family (Section~\ref{sec:future}). \subsection{Draft-to-Target FLOP Ratio Analysis} \label{sec:ratio} \begin{figure}[h] \centering \includegraphics[width=0.9\textwidth]{figures/ratio-speedup.pdf} \caption{Measured speedup vs.\ draft-to-target FLOP ratio. The theoretical upper bound (Eq.~\ref{eq:speedup_ratio}) correctly predicts the ranking. Overhead from RoPE patching, chunk selection, and architecture-specific scoring reduces measured values below the theoretical curve.} \label{fig:ratio} \end{figure} The FLOP ratio $r = C_d / C_t$ is the dominant predictor of \specprefill{} benefit on unified memory. Across our configurations (Table~\ref{tab:cost_model}), $r$ spans from 0.02 (2B/122B MoE) to 0.17 (20B/120B dense), and measured speedup tracks this ratio monotonically. Complementary 4B draft measurements at 8K--16K ($r \approx 0.03$, $2.79$--$2.90\times$) confirm the ratio relationship: the lower-ratio 2B achieves higher speedup at the same context length. This relationship distinguishes unified-memory \specprefill{} from GPU-based implementations, where PCIe bandwidth introduces an additional term $T$ (Equation~\ref{eq:speedup_gpu}) that weakens the FLOP-ratio signal. \subsection{Keep Percentage Ablation} \begin{table}[h] \centering \small \begin{tabular}{@{}ccccc@{}} \toprule \textbf{Keep \%} & \textbf{TTFT (s)} & \textbf{Speedup} & \textbf{Needle} & \textbf{JSON} \\ \midrule 10\% & 23.90 & 3.85$\times$ & PASS & 1/2$^\S$ \\ 20\% & 31.30 & 2.94$\times$ & PASS & PASS \\ 30\% & 38.88 & 2.37$\times$ & PASS & PASS \\ 50\% & 53.78 & 1.71$\times$ & PASS & PASS \\ 100\% (baseline) & 91.97 & 1.0$\times$ & PASS & PASS \\ \bottomrule \multicolumn{5}{l}{\footnotesize $^\S$JSON extraction at 10\% keep: one regression (1 of 3 values wrong) in trial 1, pass in trial 2.} \\ \end{tabular} \caption{Keep percentage ablation on Qwen3.5-122B at ${\sim}$16K tokens with the 4B draft. ``Needle'' is UUID retrieval at all three depths (10\%, 50\%, 90\%); ``JSON'' is exact value extraction from an 80-record array. The 10\% row shows the quality boundary.} \label{tab:ablation} \end{table} The curve shows a clear knee at ${\sim}$20\%: all quality tests pass while delivering $2.94\times$ speedup (4B draft), and marginal speedup gains diminish beyond this point while compute cost increases linearly with $k$. This is our recommended operating point. At 10\%, speedup increases to $3.85\times$ but structured data extraction becomes unreliable (1 JSON regression in 2 trials), establishing 10\% as the quality boundary. At 50\%, the $1.71\times$ speedup is marginal relative to the scoring overhead incurred. \subsection{Quality Validation} \label{sec:quality} We validate quality through three complementary evaluations: targeted adversarial probes, standardized benchmarks, and distributional analysis. \paragraph{Adversarial tests (primary).} Eight test types designed to expose sparse-prefill weaknesses: needle-in-a-haystack (UUID retrieval at 10\%, 50\%, 90\% depth), JSON value extraction from an 80-record array, code bug detection, back-reference, mixed-language retrieval, and XML structure extraction. At 20\% keep: \textbf{0/16 regressions} across 2 trials $\times$ 8 tests. All needle-in-a-haystack and JSON extraction tests pass under both baseline and \specprefill{}. At 20\% keep, no needles are dropped and no structured data is corrupted. At 10\% keep: 1/16 regressions -- a JSON extraction test returned one incorrect value out of three (trial 1 of 2; trial 2 passed). All needle tests pass at all depths even at 10\%, suggesting that high-importance tokens (which needles represent) are reliably retained; the failure mode at extreme sparsity is degraded recall of \emph{low-salience} structured data. \paragraph{ROUGE-L (lexical similarity).} We compare outputs from \specprefill{} (20\% keep) against full-prefill baselines on six real-task prompts (code generation, code review, summarization, reasoning, tutorial writing, tool use), each targeting $\sim$8K actual tokens on Qwen3.5-122B. To establish a variance floor, we first compare two independent baseline runs against each other: \begin{center} \small \begin{tabular}{lc} \toprule \textbf{Comparison} & \textbf{ROUGE-L F1} \\ \midrule Baseline vs.\ baseline (variance floor) & 0.190 $\pm$ 0.174 \\ \specprefill{} vs.\ baseline & 0.236 \\ \bottomrule \end{tabular} \end{center} Baseline-vs-baseline variance ($0.190 \pm 0.174$) is high: at $\text{temp}=0.6$, output similarity is dominated by sampling stochasticity, not by any effect of \specprefill{}. The \specprefill{} ROUGE-L (0.236) falls within the baseline noise floor, confirming that lexical overlap is not a useful quality signal at non-zero temperature. \paragraph{LLM-as-Judge (supporting).} Blinded A/B evaluation scores coherence, completeness, and accuracy (1--5 scale) for both baseline and \specprefill{} outputs, plus an overall equivalence rating: \begin{center} \small \begin{tabular}{lcc} \toprule \textbf{Comparison} & \textbf{Avg.\ Equivalence} \\ \midrule Baseline vs.\ baseline & 3.0 / 5.0 \\ \specprefill{} vs.\ baseline & 3.0 / 5.0 \\ \bottomrule \end{tabular} \end{center} \paragraph{Perplexity (distributional).} We measure next-token perplexity on 256-token continuations after full vs.\ sparse prefill (20\% keep) on five documents at 8K context (code, documentation, LaTeX, mixed): \begin{center} \small \begin{tabular}{lccc} \toprule \textbf{Document type} & \textbf{PPL (full)} & \textbf{PPL (sparse)} & \textbf{Ratio} \\ \midrule Python (engine code) & 1.85 & 2.53 & 1.37 \\ Python (benchmark script) & 1.66 & 1.74 & 1.05 \\ Python (test harness) & 1.49 & 1.58 & 1.06 \\ LaTeX (this paper) & 2.00 & 2.14 & 1.07 \\ Mixed (concatenated) & 2.76 & 3.17 & 1.15 \\ \midrule \textbf{Mean} & 1.95 & 2.23 & \textbf{1.14} \\ \bottomrule \end{tabular} \end{center} Mean perplexity increases 14\%, though 4 of 5 documents show $\leq$7\% increase (median ratio 1.07). The outlier (dense engine code, 1.37$\times$) contains many local variable dependencies where discarded tokens carry predictive information. This distributional shift does not produce visible generation quality loss -- sampling smooths over the small differences that perplexity measures precisely. 16K context was not tested: loading both the 122B target and 2B draft for offline evaluation leaves insufficient headroom on 128\,GB unified memory. \paragraph{LongBench (standardized benchmark).} To complement the above evaluations with a standardized benchmark, we evaluate on LongBench~\cite{longbench} using the \texttt{lm-evaluation-harness}~\cite{eval-harness} framework. We compare baseline (full prefill) against \specprefill{} (20\% keep, 2B draft) on 11 tasks covering question answering (6 tasks), summarization (3 tasks), and code completion (2 tasks), with 20 samples per task and greedy decoding ($\text{temp}=0$). \begin{table}[h] \centering \small \begin{tabular}{@{}llccr@{}} \toprule \textbf{Task} & \textbf{Metric} & \textbf{Baseline} & \textbf{SpecPrefill} & \textbf{$\Delta$} \\ \midrule 2WikiMQA & QA F1 & 0.653 & 0.650 & $-$0.003 \\ HotpotQA & QA F1 & 0.590 & 0.583 & $-$0.008 \\ MuSiQue & QA F1 & 0.685 & 0.603 & $-$0.082 \\ MultifieldQA & QA F1 & 0.604 & 0.621 & +0.018 \\ NarrativeQA & QA F1 & 0.424 & 0.433 & +0.009 \\ QASPER & QA F1 & 0.518 & 0.510 & $-$0.008 \\ \midrule Gov Report & ROUGE & 0.322 & 0.320 & $-$0.002 \\ QMSum & ROUGE & 0.227 & 0.238 & +0.011 \\ Multi News & ROUGE & 0.230 & 0.230 & +0.000 \\ \midrule LCC & Code Sim & 0.578 & 0.528 & $-$0.050 \\ RepoBench-P & Code Sim & 0.196 & 0.089 & $-$0.107 \\ \bottomrule \end{tabular} \caption{LongBench quality comparison: baseline vs.\ \specprefill{} (20\% keep, 2B draft) on Qwen3.5-122B, $N=20$ per task, greedy decoding.} \label{tab:longbench} \end{table} Nine of 11 tasks show deltas within one standard error of the baseline (median $\Delta = -0.005$). The two code completion tasks show larger drops: LCC ($-0.050$) and RepoBench-P ($-0.107$, 1.4 SE). RepoBench-P requires completing repository-level code where discarded tokens may contain import paths and type signatures that are individually low-salience but collectively necessary; this is the same failure mode observed in the perplexity outlier above. At $N=20$ per task, these results show that \textbf{\specprefill{} at 20\% keep preserves quality on QA and summarization tasks, while code completion with distributed dependencies shows measurable regression}, warranting per-request override to higher keep percentages. The RULER benchmark below further sharpens this boundary. \paragraph{RULER benchmark.} The evaluations above (perplexity, LongBench, adversarial) test generation quality on tasks where the answer depends on localized passages. To test whether \specprefill{} preserves performance on tasks that require exhaustive scanning of the full context, we evaluate on the RULER benchmark~\cite{ruler2024} at 16K context ($N=50$ per task). The results reveal a sharp quality boundary. \begin{table}[h] \centering \small \begin{tabular}{@{}llccr@{}} \toprule \textbf{Task} & \textbf{Type} & \textbf{Baseline} & \textbf{\specprefill{}} & \textbf{$\Delta$} \\ \midrule FWE & Frequent words & 0.993 & 0.973 & $-$0.020 \\ QA-HotpotQA & Multi-hop QA & 0.480 & 0.380 & $-$0.100 \\ QA-SQuAD & Extractive QA & 0.247 & 0.187 & $-$0.060 \\ \midrule CWE & Common words & 0.966 & \textbf{0.118} & $-$0.848 \\ VT & Variable tracking & 0.348 & \textbf{0.076} & $-$0.272 \\ \bottomrule \end{tabular} \caption{RULER benchmark at 16K context: baseline vs.\ \specprefill{} (20\% keep, 2B draft) on Qwen3.5-122B, $N=50$ per task. The top group (FWE, QA) shows moderate or no degradation. The bottom group (CWE, VT) shows catastrophic failure: these tasks require exhaustive context coverage that sparse prefill cannot provide.} \label{tab:ruler} \end{table} The pattern is clear. QA tasks (HotpotQA, SQuAD) degrade moderately because the answer typically lives in a concentrated region that the draft model's attention scores identify and preserve. Frequent word extraction (FWE) survives because high-frequency words appear often enough that 20\% of the context still contains sufficient instances. Common word extraction (CWE) and variable tracking (VT) fail catastrophically. CWE requires counting word frequencies across the \emph{entire} context; dropping 80\% of tokens removes 80\% of the evidence. VT requires tracking sequential state changes; dropped tokens break the chain of updates. These tasks have no locality for the draft model to exploit: every token matters equally. This result, combined with the RepoBench-P regression (Table~\ref{tab:longbench}), precisely characterizes the quality boundary: \begin{itemize} \item \textbf{Safe}: QA, summarization, retrieval, code understanding. The answer depends on localized passages. Attention scores reliably identify the relevant tokens. \item \textbf{Unsafe}: Exhaustive extraction, frequency counting, sequential state tracking, code completion with distributed dependencies. These tasks require uniform coverage of the full context. Sparse prefill discards exactly the tokens they need. \end{itemize} \paragraph{Practical guidance.} The per-request API (\texttt{specprefill\_keep\_pct}) allows callers to adjust or disable sparse prefill based on the task. For interactive coding agents and QA workloads, the default 20\% keep provides $3$--$5\times$ TTFT reduction with acceptable quality. For batch analysis, log scanning, or any task that must process every token, callers should set \texttt{specprefill: false} or \texttt{specprefill\_keep\_pct: 1.0}. The threshold parameter (default 8{,}192 tokens) provides an additional safety margin: short prompts bypass specprefill entirely. \paragraph{Limitations.} Six adversarial prompt types, five perplexity documents, 11 LongBench tasks, and five RULER tasks provide a detailed quality profile across diverse evaluation methodologies, though sample sizes remain modest ($N=20$ for LongBench, $N=50$ for RULER). The RULER results demonstrate that \textbf{\specprefill{} at 20\% keep preserves quality on retrieval and QA tasks but causes catastrophic degradation on tasks requiring exhaustive context coverage}. This is a fundamental property of attention-based token selection, not a tunable parameter: tasks with uniformly distributed information cannot benefit from sparse prefill at any keep percentage below 100\%. All quality evaluations were conducted on Qwen3.5-122B. Nemotron-H and GPT-OSS were validated via pipeline tests (Section~\ref{sec:method}) but lack server-side quality evaluation. \subsection{Memory Profile} \label{sec:memory} \begin{table}[h] \centering \small \begin{tabular}{@{}lrrl@{}} \toprule \textbf{Component} & \textbf{Memory} & \textbf{Cumulative} & \textbf{Notes} \\ \midrule Target weights (122B, 5-bit) & 79\,GB & 79\,GB & Fixed at load \\ Draft weights (2B, 4-bit) & 1.4\,GB & 80\,GB & Fixed at load \\ MLX Metal cache limit & 4\,GB & 84\,GB & Computation scratch \\ Target KV cache (128K) & 12\,GB & 96\,GB & 96\,KB/token $\times$ 127K \\ Draft KV cache (128K, transient) & 1.5\,GB & 97\,GB & 12\,KB/token, freed after scoring \\ OS + framework overhead & $\sim$25\,GB & $\sim$122\,GB & Observed via \texttt{memory\_pressure} \\ \midrule \textbf{Peak (128K baseline)} & & \textbf{$\sim$122\,GB} & Of 128\,GB unified \\ \textbf{Peak (128K \specprefill{})} & & \textbf{$\sim$122\,GB} & Draft KV transient, not additive \\ \bottomrule \end{tabular} \caption{Memory budget for Qwen3.5-122B with 2B draft at 128K tokens on M2~Ultra 128\,GB. The draft KV cache is transient: allocated during scoring, freed via \texttt{mx.clear\_cache()} before target prefill begins. Peak \specprefill{} memory $\approx$ baseline peak because the draft and target KV caches are never resident simultaneously.} \label{tab:memory} \end{table} % ============================================================ \section{Discussion} \label{sec:discussion} % ============================================================ \subsection{The MoE Sweet Spot} \specprefill{} benefits MoE architectures disproportionately. Sparse prefill eliminates not just attention FLOPs but expert routing and weight loading for every dropped token -- at $k = 0.2$, 80\% fewer tokens pass through 128-expert routing (Qwen3.5-122B). Dense models see proportionally smaller gains: compute scales linearly with token count regardless, and the FLOP ratio between draft and target is less favorable. \subsection{When SpecPrefill Does Not Help} \paragraph{Dense models with large drafts.} When the FLOP ratio $r$ exceeds $\sim$0.15, scoring overhead consumes most of the potential savings. Our GPT-OSS result (20B draft, $r \approx 0.17$, speedup $1.28\times$) demonstrates this boundary. No smaller GPT-OSS model exists; the proprietary tokenizer (201K vocabulary) prevents cross-family draft substitution without a re-tokenization layer (Section~\ref{sec:future}). \paragraph{Short prompts.} Below $\sim$4K tokens, the fixed overhead of draft scoring and RoPE patching exceeds the savings from sparse prefill. Our default threshold of 8{,}192 tokens reflects this empirical boundary. \paragraph{Comparison with CritiPrefill.} CritiPrefill~\cite{critiprefill} achieves sparse prefill without a draft model by using the target's own attention scores from an initial partial prefill. On dense standard transformers where all layers are attention, CritiPrefill achieves 2.7--3.0$\times$ speedup (reported on Llama-3-8B and Yi-9B at 128K context). However, it only saves attention FLOPs -- on MoE architectures where attention constitutes 7--9\% of total computation, the attention-only savings would be proportionally limited; our analysis estimates $\sim$1.03--1.08$\times$. \specprefill{} saves \emph{all} FLOPs (attention + routing + expert computation) for dropped tokens, yielding $3.7$--$5.5\times$ on MoE targets vs.\ the estimated ${\sim}1.03$--$1.08\times$ for attention-only savings. \subsection{Limitations} \begin{itemize} \item \textbf{Draft model dependency.} Requires a small model with a compatible tokenizer. This limits applicability to model families with multiple size variants (Qwen3.5: 2B/4B/27B/35B/122B; GPT-OSS: only 20B/120B). \item \textbf{Nemotron-H SSM state.} SSM layers are updated only on retained tokens; skipped tokens do not contribute to state evolution. The resulting state trajectory diverges from full prefill. Empirically safe under our evaluation ($2.10$--$2.19\times$ with no observed quality degradation), but the magnitude of state drift is not quantified. \item \textbf{Task-dependent quality.} RULER evaluation (Section~\ref{sec:quality}) reveals catastrophic failure on exhaustive extraction (CWE: $-$85\%) and variable tracking (VT: $-$27\%), while QA tasks degrade moderately. Callers must evaluate whether their workload requires uniform context coverage and disable \specprefill{} accordingly. \item \textbf{Quality validation scale.} $N=50$ per RULER task and $N=20$ per LongBench task validate the quality boundary but are insufficient for tight confidence intervals. \item \textbf{Single hardware platform.} Results on M2~Ultra (128\,GB). Memory bandwidth and compute characteristics differ on M3/M4 variants, and the optimal keep percentage may shift. \item \textbf{Concurrent evaluation.} All TTFT measurements use a single-request baseline. Our production system uses continuous batching (BatchedEngine) with serialized prefill, where the cost model holds. Under concurrent batching with parallel prefill, contention may shift the effective speedup profile. \end{itemize} % ============================================================ \section{Related Work} \label{sec:related} % ============================================================ \paragraph{Sparse prefill.} \citet{specprefill} introduce attention-based sparse prefill using a draft model on discrete GPU architectures. CritiPrefill~\cite{critiprefill} achieves draft-free sparse prefill using the target model's own attention. We extend the draft-based approach to unified memory hardware and non-transformer architectures. To our knowledge, no prior work addresses sparse prefill specifically for unified memory systems. \paragraph{Speculative decoding.} \citet{leviathan2023fast} and \citet{chen2023accelerating} propose using a draft model to speculatively generate candidate tokens verified by the target model. Multi-Token Prediction (MTP)~\cite{gloeckle2024better} uses auxiliary prediction heads within the target model itself. Our ``Speculative Stack'' composes prefill-phase speculation (\specprefill{}) with decode-phase speculation (MTP), operating in non-overlapping phases. \paragraph{Efficient attention.} FlashAttention~\cite{dao2022flashattention,dao2023flashattention2} optimizes attention computation through tiling and memory-efficient kernels. \specprefill{} is orthogonal: it reduces the token count \emph{before} attention, and can compose with efficient attention implementations. \paragraph{Serving systems.} vLLM~\cite{kwon2023efficient} introduces PagedAttention for efficient KV cache management. Our system (vllm-mlx) adapts continuous-batching and paged-attention concepts for Apple Silicon. \specprefill{} integrates as a per-request prefill optimization within this serving framework. \paragraph{MLX framework.} MLX~\cite{mlx} provides the unified-memory ML runtime enabling our zero-copy scoring approach. Prior MLX-based serving work has focused on standard inference optimization; we show that unified memory enables system-level optimizations, specifically zero-copy draft scoring, that are impractical on discrete-GPU architectures. % ============================================================ \section{Future Work} \label{sec:future} % ============================================================ \paragraph{Universal draft models with tokenizer translation.} Our results are constrained to model families where a small same-tokenizer draft exists. A \emph{universal draft model} -- trained to score token importance across vocabularies via a learned re-tokenization layer -- would decouple \specprefill{} from the family constraint. The translation layer would map target token IDs to text, re-tokenize with the universal draft's vocabulary, score importance, and project scores back to target token space via character-offset alignment. This is hard (tokenization boundaries differ across BPE vocabularies) but would enable \specprefill{} for models like GPT-OSS where no small same-family draft exists. \paragraph{CritiPrefill for dense models.} For dense architectures where the FLOP ratio is unfavorable, CritiPrefill (draft-free) may be more practical. Published results show 2.7--3.0$\times$ on dense 8B--9B transformers at 128K context. On our MoE targets, where attention is a small fraction of compute, gains would be limited; the approach warrants investigation for dense models on unified memory where memory access patterns differ from GPU. \paragraph{SSM state drift analysis.} For hybrid models like Nemotron-H, quantifying the L2 distance between full-prefill and sparse-prefill SSM hidden states would characterize the information loss from skipping tokens in recurrent layers and establish quality guarantees beyond empirical testing. \paragraph{Scheduler-aware scoring.} Our current implementation runs draft scoring as a pre-scheduler step, serialized before the request enters the batch queue. Integrating scoring into the scheduler would enable overlapping one request's draft scoring with another's generation, improving throughput under concurrent load. \paragraph{Hardware generalization.} Apple's M3 and M4 generations have different memory bandwidth and compute characteristics. The optimal keep percentage and FLOP-ratio threshold may shift on these platforms. % ============================================================ \section{Conclusion} \label{sec:conclusion} % ============================================================ We have shown that unified memory eliminates the transfer overhead that complicates draft-based sparse prefill on discrete GPUs, reducing the cost equation to a single term: the draft-to-target FLOP ratio. This ratio predicts the ranking of all six configurations we tested, spanning MoE, Mamba-2 hybrid, and sliding-window dense architectures. The practical result: 128K-token prefill on a 122B MoE model drops from 19.3~minutes to 3.5~minutes with a 1.4\,GB draft model. The technique is most effective where total parameters far exceed active computation (the MoE sweet spot), making a small dense draft orders of magnitude cheaper than the target. Quality evaluation establishes a clear boundary. QA, summarization, and retrieval tasks are preserved at 20\% keep. Tasks requiring exhaustive context coverage (word frequency counting, variable tracking, repository-level code completion) fail: dropping 80\% of tokens removes exactly the information these tasks need. This is a structural property of attention-based token selection, not a tuning problem. The per-request API allows callers to match the optimization to their workload: sparse prefill for interactive QA and coding, full prefill for batch analysis and log scanning. As large models move to local hardware, whether long-context inference is usable depends on prefill latency. Zero-copy draft scoring on unified memory makes that latency tractable for the workloads where it applies. % ============================================================ % References % ============================================================ \bibliographystyle{plainnat} \begin{thebibliography}{99} \bibitem[Chen et~al.(2023)]{chen2023accelerating} Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. \newblock Accelerating large language model decoding with speculative sampling. \newblock \emph{arXiv preprint arXiv:2302.01318}, 2023. \bibitem[Dao(2023)]{dao2023flashattention2} Tri Dao. \newblock Flash{A}ttention-2: Faster attention with better parallelism and work partitioning. \newblock \emph{arXiv preprint arXiv:2307.08691}, 2023. \bibitem[Dao et~al.(2022)]{dao2022flashattention} Tri Dao, Daniel~Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\'{e}. \newblock Flash{A}ttention: Fast and memory-efficient exact attention with {IO}-awareness. \newblock In \emph{NeurIPS}, 2022. \bibitem[Gloeckle et~al.(2024)]{gloeckle2024better} Fabian Gloeckle, Badr Youbi~Idrissi, Baptiste Rozi\`{e}re, David Lopez-Paz, and Gabriel Synnaeve. \newblock Better \& faster large language models via multi-token prediction. \newblock \emph{arXiv preprint arXiv:2404.19737}, 2024. \bibitem[Kwon et~al.(2023)]{kwon2023efficient} Woosuk Kwon, Zhuohan Li, Sicheng Zhuang, Ying Sheng, Lianmin Zheng, Cody~Hao Yu, Joseph~E. Gonzalez, Hao Zhang, and Ion Stoica. \newblock Efficient memory management for large language model serving with {PagedAttention}. \newblock In \emph{SOSP}, 2023. \bibitem[Leviathan et~al.(2023)]{leviathan2023fast} Yaniv Leviathan, Matan Kalman, and Yossi Matias. \newblock Fast inference from transformers via speculative decoding. \newblock In \emph{ICML}, 2023. \bibitem[Apple(2023)]{mlx} Apple Machine Learning Research. \newblock {MLX}: An array framework for Apple Silicon. \newblock \url{https://github.com/ml-explore/mlx}, 2023. \bibitem[Peng et~al.(2024)]{yarn} Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. \newblock {YaRN}: Efficient context window extension of large language models. \newblock In \emph{ICLR}, 2024. \bibitem[Su et~al.(2024)]{rope} Jianlin Su, Murtadha Ahmed, Yu~Lu, Shengfeng Pan, Wen Liu, and Bo~Liu. \newblock {RoFormer}: Enhanced transformer with rotary position embedding. \newblock \emph{Neurocomputing}, 568:127063, 2024. \bibitem[Yao et~al.(2025)]{specprefill} Ziteng Yao, Wei Chen, Yushi Huang, and others. \newblock {SpecPrefill}: Speculative prefilling for faster long-context {LLM} inference. \newblock \emph{arXiv preprint arXiv:2502.02789}, 2025. \bibitem[Qwen(2025)]{qwen35} Qwen Team. \newblock {Qwen3.5}: A series of large language models. \newblock \url{https://huggingface.co/Qwen/Qwen3.5-122B-A10B}, 2025. \bibitem[Zhang et~al.(2025)]{critiprefill} Junlin Zhang, Jiahao Li, and others. \newblock {CritiPrefill}: A segment-level critique framework for efficient long-context {LLM} prefilling. \newblock \emph{arXiv preprint}, 2025. \bibitem[Bai et~al.(2024)]{longbench} Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lian, Jiankai Sun, and others. \newblock {LongBench}: A bilingual, multitask benchmark for long context understanding. \newblock \emph{ACL}, 2024. \bibitem[Hsieh et~al.(2024)]{ruler2024} Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. \newblock {RULER}: What's the real context size of your long-context language models? \newblock \emph{arXiv preprint arXiv:2404.06654}, 2024. \bibitem[Gao et~al.(2024)]{eval-harness} Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, and others. \newblock A framework for few-shot language model evaluation. \newblock \emph{Zenodo}, 2024. \url{https://doi.org/10.5281/zenodo.10256836}. \end{thebibliography} \end{document}