Spaces:
Running
Running
| \documentclass[11pt,a4paper]{article} | |
| % === Packages === | |
| \usepackage[utf8]{inputenc} | |
| \usepackage[T1]{fontenc} | |
| \usepackage{amsmath,amssymb} | |
| \usepackage{graphicx} | |
| \usepackage{booktabs} | |
| \usepackage{hyperref} | |
| \usepackage{xcolor} | |
| \usepackage{algorithm} | |
| \usepackage{algpseudocode} | |
| \usepackage{listings} | |
| \usepackage[margin=1in]{geometry} | |
| \usepackage{caption} | |
| \usepackage{subcaption} | |
| \usepackage{natbib} | |
| % === Macros === | |
| \newcommand{\tbd}[1]{\textcolor{red}{\textbf{[TBD: #1]}}} | |
| \newcommand{\specprefill}{\textsc{SpecPrefill}} | |
| \newcommand{\ttft}{\text{TTFT}} | |
| \lstset{ | |
| basicstyle=\ttfamily\small, | |
| keywordstyle=\color{blue}, | |
| commentstyle=\color{gray}, | |
| breaklines=true, | |
| frame=single, | |
| numbers=left, | |
| numberstyle=\tiny\color{gray}, | |
| } | |
| % === Title === | |
| \title{\specprefill{} on Unified Memory:\\Cross-Architecture Sparse Prefill for\\Large Language Models on Apple Silicon} | |
| \author{ | |
| \texttt{github.com/Thump604} | |
| } | |
| \date{March 2026} | |
| \begin{document} | |
| \maketitle | |
| % ============================================================ | |
| \begin{abstract} | |
| % ============================================================ | |
| Long-context prefill is the dominant latency bottleneck for local LLM inference: a 64K-token prompt on Qwen3.5-122B (MoE, 10B active parameters) takes 7 minutes before the first token appears. | |
| \specprefill{}---attention-based sparse prefill using a draft model---was designed for GPU clusters with discrete memory. | |
| We port it to Apple Silicon's unified memory architecture and generalize it across three model families: transformer Mixture-of-Experts (Qwen3.5), Mamba-2/attention hybrid (Nemotron-H), and sliding-window dense (GPT-OSS\footnote{GPT-OSS refers to a publicly available model by its open-source project designation.}). | |
| On M2~Ultra (128\,GB unified memory), \specprefill{} with a 2B draft model (Qwen3.5-2B, 1.4\,GB, 4-bit) reduces TTFT by $3.71$--$5.45\times$ across 8K--128K tokens on Qwen3.5-122B, cutting 128K prefill from 19.3~minutes to 3.5~minutes. | |
| Composed with system prompt KV caching, end-to-end speedup reaches $5.6\times$ on a 73K-token production workload. | |
| We also achieve $2.10$--$2.19\times$ on Nemotron-H~120B across 8K--64K tokens. | |
| Unified memory eliminates PCIe transfer overhead, making the draft-to-target FLOP ratio the dominant predictor of speedup. We formalize and validate this relationship across six draft/target configurations. | |
| Under adversarial evaluation (0/16 regressions at 20\% keep), LLM-judge, and perplexity analysis, we observe no quality degradation at our recommended operating point. | |
| Our implementation handles architecture-specific challenges including gated queries with per-head normalization (Qwen3.5), SSM-interleaved attention layers without positional encoding (Nemotron-H), and sliding-window cache preservation (GPT-OSS), deployed in a production serving stack with per-request API control. | |
| \end{abstract} | |
| % ============================================================ | |
| \section{Introduction} | |
| \label{sec:intro} | |
| % ============================================================ | |
| \subsection{The TTFT Problem in Local Inference} | |
| Time-to-first-token (TTFT) is the dominant user-facing latency for large language models serving long-context requests. | |
| On commodity hardware---a Mac Studio with Apple M2~Ultra and 128\,GB unified memory---prefilling a 64K-token prompt through Qwen3.5-122B-A10B (a 122-billion parameter Mixture-of-Experts model with 10 billion active parameters per token) requires \textbf{418~seconds}, nearly 7~minutes before the first output token appears. | |
| Even at 16K tokens, the wait is 92~seconds. | |
| This latency is not a bandwidth problem. | |
| MLX prefill on Apple Silicon is FLOP-limited: the Metal GPU is compute-bound processing each token through the model's forward pass~\cite{mlx}. | |
| Reducing the number of tokens processed during prefill therefore yields near-linear TTFT improvement. | |
| In local inference, the cost of long TTFT is not measured in dollars-per-token but in user time. | |
| A 16K-token context (typical for an IDE coding assistant with tool definitions and file contents) means 92 seconds of waiting before the first response token. | |
| A long creative writing session or research conversation that accumulates 64K tokens of history means 7 minutes per response. | |
| On a serialized single-request engine, every second of prefill also delays all queued requests. | |
| \noindent With \specprefill{}, 128K prefill on the 122B model drops from 19.5~minutes to 3.5~minutes, making long-context requests practical for interactive use. | |
| \subsection{Why Unified Memory Changes the Calculus} | |
| \specprefill{}~\cite{specprefill} addresses TTFT by using a small draft model to identify which prompt tokens are most important via attention scoring, then sparse-prefilling only the selected subset into the target model. | |
| The original formulation assumes a discrete-memory GPU architecture where the draft model either (a) shares GPU VRAM with the target, reducing KV cache headroom, or (b) runs on CPU or a separate GPU, incurring PCIe transfer latency for importance scores. | |
| On Apple Silicon's unified memory architecture, neither penalty applies. | |
| Draft and target models share the same physical address space---the draft's weights (${\sim}$1.4\,GB for a 4-bit 2B model) are simply additional allocations in the same memory pool as the target's \raise.17ex\hbox{$\scriptstyle\sim$}79\,GB. | |
| Scoring requires zero data movement. | |
| On discrete GPU systems, draft scoring would either compete for GPU VRAM with the target's KV cache (reducing effective context length) or require CPU$\leftrightarrow$GPU transfers with latency proportional to prompt length---neither penalty exists on unified memory. | |
| This simplifies the cost equation. | |
| Let $C_t$ denote the target model's prefill cost (in FLOPs), $C_d$ the draft model's scoring cost (full prefill plus lookahead steps), and $k$ the fraction of tokens retained. | |
| On unified memory: | |
| \begin{equation} | |
| \label{eq:speedup} | |
| \text{Speedup} = \frac{C_t}{C_d + k \cdot C_t} | |
| \end{equation} | |
| When $C_d \ll C_t$---as with a 2B draft scoring for a 122B MoE target, where the FLOP ratio is approximately $50\times$---speedup approaches $1/k$. | |
| At $k = 0.2$, this predicts up to $4.5\times$; we measure $4.11\times$ at 16K tokens (5 trials), with the gap attributable to overhead from chunk selection, RoPE patching, and memory management. | |
| We term this the \textbf{ratio thesis}: on unified memory where $T = 0$, the draft-to-target FLOP ratio $r$ is the dominant predictor of \specprefill{} benefit, modulated by architecture-dependent overhead $\epsilon$ (Equation~\ref{eq:speedup_ratio}). | |
| Section~\ref{sec:ratio} validates this across six draft/target configurations spanning an $8.5\times$ range of FLOP ratios. | |
| \subsection{Contributions} | |
| \begin{enumerate} | |
| \item \textbf{First implementation on unified memory hardware.} | |
| We implement \specprefill{} on Apple Silicon via the MLX framework, demonstrating that zero-copy scoring shifts the viability threshold, making the technique effective even at moderate prompt lengths (8K tokens). | |
| \item \textbf{Cross-architecture generalization.} | |
| We extend \specprefill{} beyond standard transformers to Mamba-2/attention hybrids (Nemotron-H, where only 8 of 88 target layers have attention) and sliding-window models (GPT-OSS, with RotatingKVCache). | |
| Auto-detecting query extraction handles gated attention with per-head normalization (Qwen3.5), content-based attention without positional encoding (Nemotron-H), and YarnRoPE with sliding-window cache preservation (GPT-OSS). | |
| \item \textbf{Production system integration.} | |
| We show per-request API control with graceful fallback, coexistence with Multi-Token Prediction (MTP) speculative decoding in a three-phase ``Speculative Stack,'' and demonstrate that the FLOP ratio thesis extends to draft model selection---smaller drafts with lower $r$ yield higher speedup. | |
| \end{enumerate} | |
| This is a \textbf{systems paper with algorithmic adaptations}, not a claim of a new algorithm. | |
| The core \specprefill{} idea is due to~\citet{specprefill}; our contributions are making it work across architectures on new hardware with real deployment. | |
| % ============================================================ | |
| \section{Background} | |
| \label{sec:background} | |
| % ============================================================ | |
| \subsection{SpecPrefill} | |
| \citet{specprefill} observe that during autoregressive generation, the model attends heavily to a small fraction of prompt tokens. | |
| If these important tokens can be identified cheaply, the full prompt need not be prefilled into the target model. | |
| Their method uses a small draft model to score token importance via attention weights, selects the top-$k$\% of tokens (in non-overlapping chunks for spatial locality), and sparse-prefills only the selected subset. | |
| Sparse prefill is possible because Rotary Position Embeddings (RoPE)~\cite{rope} encode \emph{relative} position. | |
| The inner product $Q_m \cdot K_p^T$ depends only on the relative distance $(m - p)$ encoded via rotation angles. | |
| If selected tokens are stored in the KV cache with their \emph{original} position angles, they interact with future decode queries at correct relative distances regardless of gaps. | |
| \subsection{MLX and Apple Silicon Unified Memory} | |
| MLX~\cite{mlx} is Apple's machine learning framework designed for Apple Silicon. | |
| Apple Silicon uses \emph{unified memory}: CPU and GPU share the same physical DRAM through a common memory controller. | |
| There is no PCIe bus, no \texttt{cudaMemcpy}, no distinct VRAM allocation. | |
| Tensors created by any compute unit are immediately accessible to any other. | |
| MLX exploits this with lazy evaluation and reference-counted memory management. | |
| Metal compute shaders execute matrix operations on the GPU. | |
| In practice, a draft model's weights are additional allocations in the same memory pool, accessible at full bandwidth from any compute unit without copying. | |
| \subsection{Target Architectures} | |
| We evaluate \specprefill{} on three architecturally distinct model families, establishing that generalization requires non-trivial adaptations: | |
| \begin{table}[h] | |
| \centering | |
| \small | |
| \begin{tabular}{@{}lllll@{}} | |
| \toprule | |
| \textbf{Model} & \textbf{Architecture} & \textbf{Attention} & \textbf{Position Enc.} & \textbf{Cache} \\ | |
| \midrule | |
| Qwen3.5-122B & MoE (10B active) & Gated + q\_norm & Standard RoPE & Standard KV \\ | |
| Nemotron-H 120B & Mamba-2 + Attn + MoE & Standard (8/88 layers) & None (SSM) & Compacted \\ | |
| GPT-OSS 120B & Dense + sliding window & Standard & YarnRoPE & RotatingKV \\ | |
| \bottomrule | |
| \end{tabular} | |
| \caption{Target model architectures. Each requires different query extraction, position handling, and cache management in \specprefill{}.} | |
| \label{tab:architectures} | |
| \end{table} | |
| \textbf{Qwen3.5} uses gated attention where \texttt{q\_proj} outputs $2\times$ the expected width (query concatenated with a gate), requiring a split before per-head RMSNorm (\texttt{q\_norm}) and RoPE application. | |
| \textbf{Nemotron-H} is a hybrid architecture with 40 Mamba-2 SSM layers, 8 full-attention layers, and 40 MoE feed-forward layers. | |
| Positional information is encoded entirely in the SSM state---the attention layers have no RoPE. | |
| Only the attention layers produce Q/K scores usable for importance scoring. | |
| \textbf{GPT-OSS} uses YarnRoPE~\cite{yarn} with a sliding-window attention pattern where alternating layers use \texttt{RotatingKVCache} retaining only the last 128 tokens. | |
| % ============================================================ | |
| \section{Method} | |
| \label{sec:method} | |
| % ============================================================ | |
| \subsection{Token Importance Scoring} | |
| \label{sec:scoring} | |
| Given a prompt of $M$ tokens, importance scoring proceeds in three phases: | |
| \begin{enumerate} | |
| \item \textbf{Draft prefill.} The full prompt is prefilled into a small same-tokenizer draft model (e.g., Qwen3.5-2B at 4-bit quantization, 1.4\,GB) in chunks of 2{,}048 tokens, populating the draft's KV cache. The FLOP ratio thesis (Section~\ref{sec:cost_model}) predicts that minimizing the draft's active parameter count maximizes speedup, favoring the smallest available compatible model. | |
| \item \textbf{Lookahead decode with attention capture.} Eight autoregressive decode steps are executed with \texttt{AttentionCapture} wrappers installed on each attention layer. These wrappers intercept post-RoPE query vectors via architecture-specific extractors (Section~\ref{sec:extractors}), appending them to a capture buffer before delegating to the original attention computation. | |
| \item \textbf{Importance computation.} For each attention layer $\ell$ with captured queries $\{q^{(t)}\}_{t=1}^{8}$ and prompt keys $K_\ell \in \mathbb{R}^{h \times M \times d}$: | |
| \begin{align} | |
| S_\ell^{(t)} &= \text{softmax}\!\left(\frac{q^{(t)} K_\ell^T}{\sqrt{d}}\right) \in \mathbb{R}^{h \times M} \\ | |
| \bar{S}_\ell^{(t)} &= \text{AvgPool1D}(S_\ell^{(t)},\; \text{kernel}=13) | |
| \end{align} | |
| Scores are aggregated as $\max$ across layers and heads, then $\text{mean}$ across lookahead steps, yielding importance $I \in \mathbb{R}^M$. | |
| Average pooling with kernel 13 smooths the signal, preventing isolated-token artifacts. | |
| For GQA models, keys are expanded via \texttt{repeat} to match the query head count before scoring. | |
| \end{enumerate} | |
| Layers whose cache does not span the full prompt (e.g., sliding-window \texttt{RotatingKVCache} layers caching only 128 tokens) are skipped during importance computation. | |
| After scoring, the draft KV cache is explicitly freed and \texttt{mx.clear\_cache()} is called, reclaiming memory before target prefill begins. | |
| \subsubsection{Architecture-Specific Query Extraction} | |
| \label{sec:extractors} | |
| The query extractor is auto-detected at runtime based on the attention module's attributes: | |
| \begin{itemize} | |
| \item \texttt{q\_norm} present $\rightarrow$ Qwen3.5 path | |
| \item No \texttt{rope} attribute $\rightarrow$ Nemotron-H path | |
| \item Otherwise $\rightarrow$ Standard (Llama/GPT-OSS) path | |
| \end{itemize} | |
| \paragraph{Qwen3.5: Gated queries with per-head normalization.} | |
| The \texttt{q\_proj} output has $2\times$ the expected width, containing both query and gate tensors concatenated along the head dimension. | |
| The output is reshaped to $(B, L, n_\text{heads}, 2 \cdot d_\text{head})$ and split at the midpoint. | |
| After splitting, \texttt{q\_norm} (a per-head RMSNorm) is applied before RoPE rotation. | |
| Treating this as a standard projection produces silent shape errors or incorrect scoring. | |
| \paragraph{Nemotron-H: Heterogeneous layer navigation.} | |
| Of 88 total layers (40 Mamba-2 + 8 attention + 40 MoE), only the 8 attention layers produce Q/K scores. | |
| \texttt{\_find\_attention\_layers} navigates the heterogeneous layer structure by inspecting \texttt{block\_type} annotations (\texttt{M} for Mamba, \texttt{*} for attention, \texttt{-} for MLP, \texttt{E} for MoE) and locating modules with a \texttt{mixer} attribute rather than the standard \texttt{self\_attn}. | |
| \texttt{\_build\_layer\_to\_cache\_map} constructs a compacted index because only Mamba and attention layers have cache entries. | |
| These attention layers have \textbf{no RoPE}---positional information comes entirely from the Mamba-2 SSM state. | |
| Queries are used as-is for content-based scoring. | |
| This is a non-trivial engineering challenge the original paper did not address: the draft model (Nano~4B) has 42 heterogeneous layers with only 4 attention layers among 21 Mamba and 17 MLP layers, all in a model where positional information is entirely implicit. | |
| \paragraph{GPT-OSS: RotatingKVCache awareness.} | |
| Standard query extraction applies, but importance computation must skip sliding-window layers whose \texttt{RotatingKVCache} contains only the last 128 tokens. | |
| Without this check, importance scores would be computed against a truncated key set, producing misleading rankings. | |
| Correct handling requires three cache-aware adaptations: | |
| (1)~layer-level cache introspection to distinguish full-context from sliding-window layers; | |
| (2)~skipping importance computation for layers whose cache does not span the full prompt; | |
| (3)~force-preserving the last \texttt{max\_size} positions during sparse selection to ensure sliding-window layers have valid recent context at decode time. | |
| \subsection{Chunk Selection} | |
| Tokens are grouped into non-overlapping chunks of $C = 32$ tokens. | |
| Each chunk is scored by the mean importance of its constituent tokens. | |
| The top $\lceil k \cdot M/C \rceil$ chunks by score are selected and their token indices returned in sorted order. | |
| This preserves spatial locality---coherent phrases are kept or dropped as units. | |
| At $k = 0.2$ (our optimal configuration for Qwen3.5-122B), 80\% of prefill computation is eliminated. | |
| \subsection{Sparse Prefill with Position-Mapped RoPE} | |
| \label{sec:sparse_prefill} | |
| The correctness of sparse prefill depends on maintaining correct RoPE angles despite non-contiguous token positions. | |
| If position angles are not re-mapped, the model perceives selected tokens as adjacent, destroying long-range coherence. | |
| \paragraph{Step 1: Sliding-window tail preservation.} | |
| For architectures using \texttt{RotatingKVCache} (GPT-OSS), the last \texttt{max\_size} positions from the prompt are force-merged into the selection set, ensuring sliding-window attention layers have valid recent context at decode time. | |
| This is auto-detected via cache type inspection. | |
| \paragraph{Step 2: Position mapping during prefill.} | |
| Each attention layer's \texttt{nn.RoPE} is replaced with \texttt{PositionMappedRoPE}, which maps contiguous batch positions $[0, 1, \ldots, N{-}1]$ to the original absolute positions $[p_0, p_1, \ldots, p_{N-1}]$ of the selected tokens. | |
| For models with custom RoPE variants (YarnRoPE with pre-computed frequencies, SuScaled RoPE with \texttt{mscale}), the replacement module captures and replays the original frequency tensors and scale factors. | |
| \paragraph{Step 3: Chunked forward pass.} | |
| Selected tokens are fed through the target model in chunks of \texttt{step\_size} (default 2{,}048), populating the KV cache with entries at correct absolute positions. | |
| \paragraph{Step 4: Decode offset adjustment.} | |
| After sparse prefill of $N$ selected tokens from $M$ total prompt tokens, the cache offset is $N$ but decode must start at position $M$. | |
| \texttt{OffsetAdjustedRoPE} wraps the original RoPE module and adds adjustment $\Delta = M - N$ to all offset calls: | |
| \begin{equation} | |
| \text{RoPE\_position}(i) = N + i + (M - N) = M + i \quad \checkmark | |
| \end{equation} | |
| \paragraph{Step 5: Cleanup.} | |
| After generation completes, \texttt{cleanup\_rope()} traverses all attention layers and unwraps patched RoPE modules back to their originals, ensuring the model is unmodified for subsequent requests. | |
| \paragraph{Nemotron-H (no RoPE).} | |
| Steps 2 and 4 are skipped entirely---Nemotron-H's attention layers have no RoPE, deriving positional information from the Mamba-2 SSM state instead. | |
| The SSM layers are updated \emph{only} on retained tokens; skipped tokens do not contribute to state evolution. | |
| Concretely: the Mamba-2 recurrence $h_t = A h_{t-1} + B x_t$ advances only at selected positions, so the hidden state after processing $N$ selected tokens diverges from the state after processing all $M$ tokens. | |
| This alters the underlying state trajectory. | |
| However, in practice, the retained tokens---selected by attention importance---appear sufficient to preserve global semantics for long-context tasks: our server-side benchmarks show $2.10$--$2.19\times$ speedup across 8K--64K tokens with no observed quality degradation (Section~\ref{sec:quality}). | |
| Quantifying the SSM state drift (e.g., L2 distance between full and sparse hidden states) is left to future work. | |
| Despite this mismatch, empirical evaluation shows no quality degradation under our test suite (Section~\ref{sec:quality}). | |
| We attribute this to the hybrid architecture: the 8 full-attention layers retain the most important $N$ tokens with correct content-based scores, providing long-range context that compensates for gaps in the SSM's recurrent state. | |
| This is an empirical result, not a theoretical guarantee; extending \specprefill{} to pure SSM architectures would require additional analysis. | |
| \subsection{Unified Memory Cost Model} | |
| \label{sec:cost_model} | |
| We formalize the relationship between hardware architecture and \specprefill{} efficiency. | |
| On discrete-GPU systems, the cost of \specprefill{} includes a data transfer term $T$ (PCIe bandwidth, memory copies between draft and target devices): | |
| \begin{equation} | |
| \label{eq:speedup_gpu} | |
| \text{Speedup}_{\text{GPU}} = \frac{C_t}{C_d + T + k \cdot C_t} | |
| \end{equation} | |
| On unified memory, $T = 0$, simplifying to Equation~\ref{eq:speedup}. | |
| The speedup is determined by the empirical wall-clock cost ratio $r = C_d / C_t$ and keep percentage $k$: | |
| \begin{equation} | |
| \label{eq:speedup_ratio} | |
| \text{Speedup} = \frac{1}{r + k + \epsilon} | |
| \end{equation} | |
| where $\epsilon$ captures fixed overhead from chunk selection, RoPE patching, memory management (\texttt{mx.clear\_cache()}), and architecture-specific scoring costs. | |
| In practice, $\epsilon$ ranges from 0.03 (Qwen3.5, low overhead) to 0.30 (Nemotron-H, where navigating 88 heterogeneous layers adds cost). | |
| We emphasize that this model is \emph{descriptive}: it correctly predicts the \emph{ranking} of configurations by speedup (Table~\ref{tab:cost_model}) but does not predict exact magnitudes, as $\epsilon$ varies by architecture. | |
| The value of the model is in draft selection---given two candidate drafts, the one with lower $r$ will yield higher speedup---not in predicting absolute speedup from first principles. | |
| \textbf{Boundary conditions.} | |
| Equation~\ref{eq:speedup_ratio} assumes: (1)~sequential, single-request prefill (no batching); (2)~a FLOP-dominated regime where compute, not memory bandwidth, is the bottleneck; and (3)~negligible KV cache fragmentation cost. | |
| In bandwidth-bound regimes or under heavy concurrent batching, $\epsilon$ may dominate and reduce the model's predictive accuracy. | |
| However, $r$ is not simply the ratio of active parameters. | |
| On unified memory, all expert weights reside in the same address space---there is no discrete VRAM to overflow into. | |
| MoE forward passes incur costs beyond active-parameter FLOPs: router gating across all experts, weight loading for selected experts from unified memory, and memory bandwidth pressure from the full parameter footprint. | |
| As a result, the empirical wall-clock cost of a MoE forward pass on unified memory scales closer to \emph{total} parameters than to active parameters alone. | |
| A small dense 2B draft scoring against a 122B MoE target (10B active but 122B total) therefore achieves $r \approx 0.02$, reflecting the full-parameter cost disparity rather than the ${\sim}5\times$ active-parameter ratio. | |
| This is a finding specific to unified memory systems: on discrete GPUs where expert weights page between CPU and GPU memory, the effective cost ratio may differ. | |
| For MoE models where total parameters far exceed active parameters, a small dense draft has $r \ll 1$. | |
| This model assumes $C_t$ scales linearly with token count; at extreme context lengths (${\geq}$128K), the $O(n^2)$ attention component causes superlinear growth in $C_t$, and measured speedup can exceed the linear prediction (Table~\ref{tab:cost_model}). | |
| Table~\ref{tab:cost_model} compares predicted ($\epsilon = 0$) and measured speedups, with the gap $G = \text{Predicted} / \text{Measured}$ quantifying per-configuration overhead: | |
| \begin{table}[h] | |
| \centering | |
| \small | |
| \begin{tabular}{@{}lcccc@{}} | |
| \toprule | |
| \textbf{Configuration} & $\boldsymbol{r = C_d/C_t}$ & \textbf{Predicted} ($k{=}0.2, \epsilon{=}0$) & \textbf{Measured} & $\boldsymbol{G}$ \\ | |
| \midrule | |
| 4B / 122B MoE (10B active) & $\sim$0.03 & 4.3$\times$ & 2.90$\times$ & 1.5$\times$ \\ | |
| 2B / 122B MoE (10B active) & $\sim$0.02 & 4.5$\times$ & 4.11$\times$ & 1.1$\times$ \\ | |
| 2B / 122B MoE (10B active)$^\ddagger$ & $\sim$0.02 & 4.5$\times$ & 5.45$\times$ & 0.8$\times$ \\ | |
| Qwen-4B / 35B MoE (3B active) & $\sim$0.10 & 3.3$\times$ & 1.86$\times$ & 1.8$\times$ \\ | |
| 4B / 120B Nemotron-H (12B active) & $\sim$0.03 & 4.3$\times$ & 2.17$\times$ & 2.0$\times$ \\ | |
| 20B / 120B GPT-OSS (120B active) & $\sim$0.17 & 2.7$\times$ & 1.28$\times$ & 2.1$\times$ \\ | |
| \bottomrule | |
| \multicolumn{5}{l}{\footnotesize $^\ddagger$Measured at 128K tokens. Measured speedup exceeds prediction due to $O(n^2)$ attention.} | |
| \end{tabular} | |
| \caption{Cost model predictions ($\epsilon = 0$) vs.\ measured speedups at $k = 0.2$, 16K tokens unless noted. $G = \text{Predicted} / \text{Measured}$; values $> 1$ indicate overhead exceeding the model, $< 1$ indicates superlinear baseline growth benefiting \specprefill{} beyond the linear prediction.} | |
| \label{tab:cost_model} | |
| \end{table} | |
| The $G$ values reveal architecture-dependent overhead. | |
| Nemotron-H ($G = 2.0$) has the highest $\epsilon$: the target has 88 heterogeneous layers (40~Mamba + 8~attention + 40~MoE) and the draft has 42 (21~Mamba + 4~attention + 17~MLP), requiring architecture-aware layer navigation for both scoring and sparse prefill. | |
| GPT-OSS ($G = 2.1$) combines high $r$ (0.17, the 20B draft dominates the denominator) with sliding-window cache management overhead. | |
| The Qwen3.5-122B configurations ($G = 1.1$--$1.5$, 5 trials) have the lowest overhead, benefiting from uniform architecture and favorable MoE FLOP ratios; the 35B target ($G = 1.8$) sits higher due to its smaller active-parameter count (3B) narrowing the draft-to-target gap. | |
| At 128K, $G < 1$ because the superlinear baseline growth (Section~\ref{sec:experiments}) is not captured by the linear cost model. | |
| \paragraph{Comparison with GPU results.} | |
| The original \specprefill{} on discrete GPUs~\cite{specprefill} reports up to $7.66\times$ TTFT reduction on Llama-3.1-405B-FP8, benefiting from batch-level parallelism and GPU-optimized attention kernels not available on Apple Silicon. | |
| Our lower absolute speedups ($3.71$--$5.45\times$ on 122B) reflect the single-request, unbatched setting and MLX's Metal compute pipeline. | |
| The unified memory contribution is not higher \emph{absolute} speedup but a \emph{simpler cost model}: eliminating $T$ makes the FLOP ratio the sole dominant predictor, enabling principled draft model selection without profiling transfer overhead. | |
| % ============================================================ | |
| \section{System Integration} | |
| \label{sec:system} | |
| % ============================================================ | |
| \subsection{Composition with System Prompt KV Caching} | |
| \label{sec:composition} | |
| In production agentic workflows, the system prompt (tool definitions, instructions, context documents) often constitutes 10--20K tokens that remain identical across requests. | |
| System prompt KV caching snapshots this prefix on the first request and restores it for subsequent requests, eliminating redundant prefill. | |
| These optimizations operate on orthogonal axes: KV caching eliminates \emph{prefix} cost (identical system prompt); \specprefill{} reduces \emph{suffix} cost (variable user context). | |
| This is why they compose multiplicatively rather than additively. | |
| When both techniques are active, \specprefill{} operates on the \emph{suffix} only (user and assistant turns), receiving a \texttt{position\_offset} equal to the system token count. | |
| The scoring phase evaluates only suffix tokens; sparse prefill maps positions relative to the offset so selected tokens land at correct absolute positions in the full context. | |
| The threshold check uses suffix length, not full prompt length, ensuring \specprefill{} activates only when the suffix itself is long enough to benefit. | |
| \begin{table}[h] | |
| \centering | |
| \small | |
| \begin{tabular}{@{}lcc@{}} | |
| \toprule | |
| \textbf{Configuration} & \textbf{TTFT (s)} & \textbf{Speedup} \\ | |
| \midrule | |
| Baseline (cold, full prefill) & 517.5 & 1.0$\times$ \\ | |
| System KV cache only & 417.1 & 1.24$\times$ \\ | |
| \textbf{Combined (SysKV + SP 20\%)} & \textbf{92.5} & \textbf{5.59$\times$} \\ | |
| Combined (repeat) & 92.4 & 5.60$\times$ \\ | |
| \bottomrule | |
| \end{tabular} | |
| \caption{Composition of system prompt KV caching and \specprefill{} on Qwen3.5-122B, 2B draft, M2~Ultra 128\,GB. The prompt (73K tokens) is a realistic agentic workload: ${\sim}$10K system prefix (tool definitions, instructions) + ${\sim}$63K user context. System KV cache saves the prefix; \specprefill{} sparse-prefills the suffix. The combined $5.6\times$ speedup exceeds either technique alone.} | |
| \label{tab:composition} | |
| \end{table} | |
| \subsection{The Speculative Stack} | |
| \specprefill{} is not a standalone optimization but one phase of a three-phase speculative pipeline: | |
| \begin{enumerate} | |
| \item \textbf{Score} (\specprefill{}): Draft model (2B) identifies important tokens via attention scoring. | |
| \item \textbf{Sparse Prefill}: Target model (122B) processes selected token chunks with position-mapped RoPE. | |
| \item \textbf{MTP Decode}: Target model with Multi-Token Prediction heads generates output tokens speculatively (Qwen3.5 only; Nemotron-H and GPT-OSS skip this phase). | |
| \end{enumerate} | |
| The draft model used in Phase~1 is architecturally separate from MTP's prediction heads (which are part of the target model's weights). | |
| The draft KV cache is freed after Phase~1 (\texttt{mx.clear\_cache()}), and the draft model's static weights (1.4--3.0\,GB depending on draft size) remain resident for amortized startup cost across requests. | |
| At extreme context lengths (128K+), the draft's transient KV cache size becomes a practical constraint: the 2B draft produces 1.5\,GB at 128K tokens, fitting comfortably alongside the target model's ${\sim}$100\,GB on a 128\,GB system. | |
| \subsection{Per-Request API and Graceful Fallback} | |
| The OpenAI-compatible API accepts per-request overrides: | |
| \begin{lstlisting}[language=Python] | |
| extra_body = { | |
| "specprefill": True, # force enable (bypass threshold) | |
| "specprefill_keep_pct": 0.15 # override server default | |
| } | |
| \end{lstlisting} | |
| The default threshold (8{,}192 tokens) is enforced only in server-default mode; explicit \texttt{specprefill: true} bypasses it. | |
| Any error during scoring or sparse prefill triggers graceful fallback to full prefill---no request fails due to \specprefill{}. | |
| % ============================================================ | |
| \section{Experiments} | |
| \label{sec:experiments} | |
| % ============================================================ | |
| \subsection{Setup} | |
| \paragraph{Hardware.} Apple M2~Ultra, 128\,GB unified memory, Mac Studio, macOS~26.3.1. | |
| \paragraph{Software.} MLX~0.31.1, vllm-mlx~0.2.6 with patches, Python~3.12. | |
| \paragraph{Sampling.} Qwen3.5 models: $\text{temp}=0.6$, $\text{top\_p}=0.95$, $\text{top\_k}=20$ (official thinking+coding profile~\cite{qwen35}). | |
| Nemotron-H: $\text{temp}=1.0$, $\text{top\_p}=0.95$ (NVIDIA model card: ``trained and evaluated with''). | |
| GPT-OSS: $\text{temp}=0.6$, $\text{top\_p}=0.95$. | |
| All models run with thinking mode enabled (\texttt{enable\_thinking=true}). | |
| Sampling parameters do not affect TTFT measurement (prefill is deterministic; sampling occurs only during generation). | |
| \paragraph{Methodology.} Five trials per configuration for Qwen3.5-122B (plus one warmup); two trials for remaining configurations. Server-side TTFT measured via streaming OpenAI-compatible API. | |
| Server restarted between configuration changes. | |
| \paragraph{Reproducibility.} Our \specprefill{} implementation is available as patches against vllm-mlx~0.2.6 (PRs~\#175, \#180) and mlx-lm~0.31.2 (PR~\#990), with benchmark scripts (\texttt{bench-specprefill}, \texttt{bench-specprefill-adversarial}, \texttt{bench-specprefill-perplexity}) included in the repository. | |
| All models are publicly available Qwen3.5 quantizations. | |
| Nemotron-H and GPT-OSS weights are available from their respective model hubs. | |
| Experiments require an Apple Silicon system with $\geq$128\,GB unified memory for the 122B configuration; the 35B configuration runs on 64\,GB systems. | |
| \begin{table}[h] | |
| \centering | |
| \small | |
| \begin{tabular}{@{}llllllr@{}} | |
| \toprule | |
| \textbf{Model} & \textbf{Role} & \textbf{Architecture} & \textbf{Params} & \textbf{Active} & \textbf{Quant} & \textbf{RAM} \\ | |
| \midrule | |
| Qwen3.5-122B-VLM-MTP & Target & MoE & 122B & 10B & 5-bit & 79\,GB \\ | |
| Qwen3.5-35B-VLM-MTP & Target & MoE & 35B & 3B & 8-bit & 38\,GB \\ | |
| Nemotron-H 120B & Target & Mamba-2 + Attn + MoE & 120B & 12B & 5-bit & 83\,GB \\ | |
| GPT-OSS 120B & Target & Dense + sliding window & 120B & 120B & 5-bit & 58\,GB \\ | |
| \midrule | |
| Qwen3.5-4B-VLM-MTP & Draft & Dense hybrid & 4B & 4B & 4-bit & 3.0\,GB \\ | |
| Qwen3.5-2B-OptiQ & Draft & Hybrid + MoE & 2B & $<$2B & 4-bit & 1.4\,GB \\ | |
| Nemotron-H Nano 4B & Draft & Mamba-2 + Attn hybrid & 4B & 4B & 4-bit & 2.1\,GB \\ | |
| GPT-OSS-20B & Draft & MoE & 20B & 3.6B & 4-bit & 10\,GB \\ | |
| \bottomrule | |
| \end{tabular} | |
| \caption{Models evaluated. \specprefill{} requires draft and target to share the same tokenizer. Qwen3.5 drafts (2B, 4B) serve all Qwen3.5 targets (248K vocabulary); the 2B is the primary draft model. Nemotron-H Nano~4B serves Nemotron-H~120B (131K vocabulary). GPT-OSS-20B is the smallest available same-family draft for GPT-OSS-120B (201K vocabulary).} | |
| \label{tab:models} | |
| \end{table} | |
| \subsection{TTFT Benchmarks} | |
| \begin{table}[h] | |
| \centering | |
| \small | |
| \begin{tabular}{@{}llccccc@{}} | |
| \toprule | |
| \textbf{Model} & \textbf{Draft} & \textbf{8K} & \textbf{16K} & \textbf{32K} & \textbf{64K} & \textbf{128K} \\ | |
| \midrule | |
| Qwen3.5-122B (MoE, 10B) & 4B$^\ddagger$ & 2.79$\times$ & 2.90$\times$ & ---$^\dagger$ & ---$^\dagger$ & ---$^\dagger$ \\ | |
| Qwen3.5-122B (MoE, 10B) & 2B & 3.71$\times$ & 4.11$\times$ & 4.23$\times$ & 4.50$\times$ & 5.45$\times$ \\ | |
| Qwen3.5-35B (MoE, 3B) & 4B & 1.81$\times$ & 1.86$\times$ & 1.85$\times$ & 1.84$\times$ & --- \\ | |
| Nemotron-H 120B (hybrid) & Nano-4B & 2.10$\times$ & 2.17$\times$ & 2.19$\times$ & 2.19$\times$ & --- \\ | |
| GPT-OSS 120B (dense) & 20B & 1.24$\times$ & 1.28$\times$ & --- & --- & --- \\ | |
| \bottomrule | |
| \multicolumn{7}{l}{\footnotesize $^\dagger$Not measured at this context length.} \\ | |
| \multicolumn{7}{l}{\footnotesize $^\ddagger$4B draft includes VLM weights (3.0\,GB); the 2B text-only draft (1.4\,GB) is the primary configuration.} | |
| \end{tabular} | |
| \caption{TTFT speedups at 20\% keep, 5 trials (mean). Speedup increases with prompt length as scoring overhead is amortized and $O(n^2)$ attention savings compound. Qwen3.5-122B and 4B rows use 5 trials; other rows use 2 trials.} | |
| \label{tab:ttft} | |
| \end{table} | |
| For Qwen3.5-122B with the 2B draft, the absolute TTFT at 64K tokens drops from $417.6 \pm 0.6$\,s to $92.8 \pm 0.8$\,s. | |
| At 128K tokens: $1{,}155.8 \pm 8.5$\,s (19.3~minutes) $\rightarrow$ $212.3 \pm 1.9$\,s (3.5~minutes), a \textbf{5.45$\times$} reduction. | |
| At 8K tokens: $45.0 \pm 0.1$\,s $\rightarrow$ $12.1 \pm 0.03$\,s. | |
| Standard deviations across 5 trials are $<$1\% of the mean at all context lengths, confirming measurement stability. | |
| Speedup increases monotonically with context length, from $3.71\times$ at 8K to $5.45\times$ at 128K, consistent with the amortization of fixed scoring overhead and the superlinear growth of baseline prefill cost. | |
| \paragraph{Nemotron-H: architecture-limited speedup plateau.} | |
| Nemotron-H shows a flat speedup profile ($2.10\times$ at 8K to $2.19\times$ at 64K), in contrast to Qwen3.5's monotonically increasing curve. | |
| This plateau is explained by the hybrid architecture: only 8 of 88 layers are attention---the remaining 80 layers (40~Mamba-2 SSM + 40~MoE feed-forward) scale linearly with token count regardless of \specprefill{}. | |
| The $O(n^2)$ attention component that drives Qwen3.5's compounding speedup at long contexts constitutes only ${\sim}9\%$ of Nemotron-H's total compute, so the quadratic savings are a small fraction of overall prefill cost. | |
| This confirms the architecture-dependent nature of the cost model: \specprefill{} benefit scales with the attention fraction of total computation. | |
| In absolute terms, Nemotron-H 120B TTFT drops from 58\,s to 27\,s at 16K and from 253\,s to 116\,s at 64K. | |
| For Qwen3.5-35B: 41\,s to 22\,s at 16K. | |
| \paragraph{Superlinear scaling at extreme context lengths.} | |
| The 128K baseline (1{,}156\,s) is $2.77\times$ the 64K baseline (418\,s), not the $2\times$ expected from linear scaling. | |
| This superlinear growth arises from the $O(n^2)$ attention component in chunked prefill: each 2{,}048-token chunk attends to all preceding tokens, so cumulative attention FLOPs grow quadratically. | |
| \specprefill{} benefits disproportionately: by reducing the effective sequence length from $N$ to $kN$, it reduces the cumulative attention FLOPs---which scale approximately as $N^2$ under full-prefix chunked prefill---by approximately $k^2$. | |
| At $k = 0.2$, this yields a ${\sim}25\times$ reduction in attention computation, not the $5\times$ that linear token-count scaling would suggest. | |
| Draft scoring remains $O(N)$, so its cost grows linearly while the attention savings grow quadratically, explaining why the measured 128K speedup ($5.45\times$) exceeds the linear prediction ($4.5\times$, Table~\ref{tab:cost_model}). | |
| MoE models exhibit stronger gains because sparse prefill reduces both the quadratic attention cost (sequence length) and the number of active expert evaluations---80\% fewer tokens means 80\% fewer expert routing and weight-loading operations. | |
| \paragraph{Draft model selection.} | |
| Draft models are selected to maximize FLOP asymmetry while maintaining tokenizer and architectural compatibility. | |
| The selection criteria are: (1)~same tokenizer as the target (required---token IDs are passed directly without translation); (2)~smallest available model in the family (to maximize the FLOP ratio $r$); (3)~presence of attention layers for importance scoring (at least 4 layers suffice per our Nemotron-H validation). | |
| Our primary configuration uses a 2B draft ($r \approx 0.02$, 1.4\,GB); complementary measurements with a 4B draft ($r \approx 0.03$, 3.0\,GB) at 8K--16K confirm the relationship: the lower-ratio 2B achieves $4.11\times$ vs.\ $2.90\times$ at 16K. | |
| The GPT-OSS result is a \textbf{negative result confirming the ratio thesis}: the 20B draft model (the smallest available in the GPT-OSS family) has an unfavorable FLOP ratio of $\sim$0.17, yielding only $1.24$--$1.28\times$ speedup. | |
| This validates that architecture is not the determining factor---the FLOP ratio is. | |
| A hypothetical 4B GPT-OSS draft ($r \approx 0.03$) would be predicted to achieve $\sim$2.5--3$\times$ under our cost model, but no such model exists in the GPT-OSS family (Section~\ref{sec:future}). | |
| \subsection{Draft-to-Target FLOP Ratio Analysis} | |
| \label{sec:ratio} | |
| \begin{figure}[h] | |
| \centering | |
| \includegraphics[width=0.9\textwidth]{figures/ratio-speedup.pdf} | |
| \caption{Measured speedup vs.\ draft-to-target FLOP ratio. The theoretical upper bound (Eq.~\ref{eq:speedup_ratio}) correctly predicts the ranking. Overhead from RoPE patching, chunk selection, and architecture-specific scoring reduces measured values below the theoretical curve.} | |
| \label{fig:ratio} | |
| \end{figure} | |
| The FLOP ratio $r = C_d / C_t$ is the dominant predictor of \specprefill{} benefit on unified memory. | |
| Across our configurations (Table~\ref{tab:cost_model}), $r$ spans from 0.02 (2B/122B MoE) to 0.17 (20B/120B dense), and measured speedup tracks this ratio monotonically. | |
| Complementary 4B draft measurements at 8K--16K ($r \approx 0.03$, $2.79$--$2.90\times$) confirm the ratio relationship: the lower-ratio 2B achieves higher speedup at the same context length. | |
| This relationship distinguishes unified-memory \specprefill{} from GPU-based implementations, where PCIe bandwidth introduces an additional term $T$ (Equation~\ref{eq:speedup_gpu}) that weakens the FLOP-ratio signal. | |
| \subsection{Keep Percentage Ablation} | |
| \begin{table}[h] | |
| \centering | |
| \small | |
| \begin{tabular}{@{}ccccc@{}} | |
| \toprule | |
| \textbf{Keep \%} & \textbf{TTFT (s)} & \textbf{Speedup} & \textbf{Needle} & \textbf{JSON} \\ | |
| \midrule | |
| 10\% & 23.90 & 3.85$\times$ & PASS & 1/2$^\S$ \\ | |
| 20\% & 31.30 & 2.94$\times$ & PASS & PASS \\ | |
| 30\% & 38.88 & 2.37$\times$ & PASS & PASS \\ | |
| 50\% & 53.78 & 1.71$\times$ & PASS & PASS \\ | |
| 100\% (baseline) & 91.97 & 1.0$\times$ & PASS & PASS \\ | |
| \bottomrule | |
| \multicolumn{5}{l}{\footnotesize $^\S$JSON extraction at 10\% keep: one regression (1 of 3 values wrong) in trial 1, pass in trial 2.} \\ | |
| \end{tabular} | |
| \caption{Keep percentage ablation on Qwen3.5-122B at ${\sim}$16K tokens with the 4B draft. ``Needle'' is UUID retrieval at all three depths (10\%, 50\%, 90\%); ``JSON'' is exact value extraction from an 80-record array. The 10\% row shows the quality boundary.} | |
| \label{tab:ablation} | |
| \end{table} | |
| The curve shows a clear knee at ${\sim}$20\%: all quality tests pass while delivering $2.94\times$ speedup (4B draft), and marginal speedup gains diminish beyond this point while compute cost increases linearly with $k$. | |
| This is our recommended operating point. | |
| At 10\%, speedup increases to $3.85\times$ but structured data extraction becomes unreliable (1 JSON regression in 2 trials), establishing 10\% as the quality boundary. | |
| At 50\%, the $1.71\times$ speedup is marginal relative to the scoring overhead incurred. | |
| \subsection{Quality Validation} | |
| \label{sec:quality} | |
| Because \specprefill{} retains the highest-scoring tokens by attention importance, the model keeps the tokens it would have attended to most heavily during generation. | |
| We validate this through three complementary evaluations, leading with the most concrete and falsifiable tests. | |
| \paragraph{Adversarial tests (primary).} | |
| Eight test types designed to expose sparse-prefill weaknesses: needle-in-a-haystack (UUID retrieval at 10\%, 50\%, 90\% depth), JSON value extraction from an 80-record array, code bug detection, back-reference, mixed-language retrieval, and XML structure extraction. | |
| At 20\% keep: \textbf{0/16 regressions} across 2 trials $\times$ 8 tests. | |
| All needle-in-a-haystack and JSON extraction tests pass under both baseline and \specprefill{}. | |
| \specprefill{} does not drop needles or corrupt structured data retrieval at 20\% keep. | |
| At 10\% keep: 1/16 regressions---a JSON extraction test returned one incorrect value out of three (trial 1 of 2; trial 2 passed). | |
| All needle tests pass at all depths even at 10\%, suggesting that high-importance tokens (which needles represent) are robustly retained; the failure mode at extreme sparsity is degraded recall of \emph{low-salience} structured data. | |
| \paragraph{ROUGE-L (lexical similarity).} | |
| We compare outputs from \specprefill{} (20\% keep) against full-prefill baselines on six real-task prompts (code generation, code review, summarization, reasoning, tutorial writing, tool use), each targeting $\sim$8K actual tokens on Qwen3.5-122B. | |
| To establish a variance floor, we first compare two independent baseline runs against each other: | |
| \begin{center} | |
| \small | |
| \begin{tabular}{lc} | |
| \toprule | |
| \textbf{Comparison} & \textbf{ROUGE-L F1} \\ | |
| \midrule | |
| Baseline vs.\ baseline (variance floor) & 0.190 $\pm$ 0.174 \\ | |
| \specprefill{} vs.\ baseline & 0.236 \\ | |
| \bottomrule | |
| \end{tabular} | |
| \end{center} | |
| The high baseline-vs-baseline variance ($0.190 \pm 0.174$) demonstrates that lexical similarity between outputs is dominated by the model's inherent stochasticity at $\text{temp}=0.6$, not by any effect of \specprefill{}. | |
| The \specprefill{} similarity (0.236) falls within the baseline noise floor, but this comparison is not informative enough to draw quality conclusions from---the adversarial tests above provide the primary quality evidence. | |
| \paragraph{LLM-as-Judge (supporting).} | |
| Blinded A/B evaluation scores coherence, completeness, and accuracy (1--5 scale) for both baseline and \specprefill{} outputs, plus an overall equivalence rating: | |
| \begin{center} | |
| \small | |
| \begin{tabular}{lcc} | |
| \toprule | |
| \textbf{Comparison} & \textbf{Avg.\ Equivalence} \\ | |
| \midrule | |
| Baseline vs.\ baseline & 3.0 / 5.0 \\ | |
| \specprefill{} vs.\ baseline & 3.0 / 5.0 \\ | |
| \bottomrule | |
| \end{tabular} | |
| \end{center} | |
| \paragraph{Perplexity (distributional).} | |
| We measure next-token perplexity on 256-token continuations after full vs.\ sparse prefill (20\% keep) on five documents at 8K context (code, documentation, LaTeX, mixed): | |
| \begin{center} | |
| \small | |
| \begin{tabular}{lccc} | |
| \toprule | |
| \textbf{Document type} & \textbf{PPL (full)} & \textbf{PPL (sparse)} & \textbf{Ratio} \\ | |
| \midrule | |
| Python (engine code) & 1.85 & 2.53 & 1.37 \\ | |
| Python (benchmark script) & 1.66 & 1.74 & 1.05 \\ | |
| Python (test harness) & 1.49 & 1.58 & 1.06 \\ | |
| LaTeX (this paper) & 2.00 & 2.14 & 1.07 \\ | |
| Mixed (concatenated) & 2.76 & 3.17 & 1.15 \\ | |
| \midrule | |
| \textbf{Mean} & 1.95 & 2.23 & \textbf{1.14} \\ | |
| \bottomrule | |
| \end{tabular} | |
| \end{center} | |
| Mean perplexity increases 14\%, though 4 of 5 documents show $\leq$7\% increase (median ratio 1.07). | |
| The outlier (dense engine code, 1.37$\times$) contains many local variable dependencies where discarded tokens carry predictive information. | |
| This distributional shift does not translate to generation quality degradation in our adversarial or LLM-judge evaluations above---sampling smooths over small distributional differences that perplexity measures precisely. | |
| 16K context was not tested: loading both the 122B target and 2B draft for offline evaluation leaves insufficient headroom on 128\,GB unified memory. | |
| \paragraph{Limitations.} | |
| Six prompts, eight adversarial types, and five perplexity documents confirm no catastrophic quality loss and validate the methodology, but the sample size is insufficient for tight confidence intervals. | |
| We make no claim of statistical equivalence---only that \textbf{no measurable quality degradation was observed under our evaluation at 20\% keep}. | |
| The 10\% keep JSON regression demonstrates that the quality boundary is observable and characterizable within our framework. | |
| All quality evaluations were conducted on Qwen3.5-122B. | |
| Nemotron-H and GPT-OSS were validated via pipeline tests (Section~\ref{sec:method}) but lack server-side adversarial evaluation. | |
| Future work will include larger-scale evaluation on standardized long-context benchmarks (e.g., RULER, LongBench) and extend quality validation to non-Qwen architectures. | |
| \subsection{Memory Profile} | |
| \label{sec:memory} | |
| \begin{table}[h] | |
| \centering | |
| \small | |
| \begin{tabular}{@{}lrrl@{}} | |
| \toprule | |
| \textbf{Component} & \textbf{Memory} & \textbf{Cumulative} & \textbf{Notes} \\ | |
| \midrule | |
| Target weights (122B, 5-bit) & 79\,GB & 79\,GB & Fixed at load \\ | |
| Draft weights (2B, 4-bit) & 1.4\,GB & 80\,GB & Fixed at load \\ | |
| MLX Metal cache limit & 4\,GB & 84\,GB & Computation scratch \\ | |
| Target KV cache (128K) & 12\,GB & 96\,GB & 96\,KB/token $\times$ 127K \\ | |
| Draft KV cache (128K, transient) & 1.5\,GB & 97\,GB & 12\,KB/token, freed after scoring \\ | |
| OS + framework overhead & $\sim$25\,GB & $\sim$122\,GB & Observed via \texttt{memory\_pressure} \\ | |
| \midrule | |
| \textbf{Peak (128K baseline)} & & \textbf{$\sim$122\,GB} & Of 128\,GB unified \\ | |
| \textbf{Peak (128K \specprefill{})} & & \textbf{$\sim$122\,GB} & Draft KV transient, not additive \\ | |
| \bottomrule | |
| \end{tabular} | |
| \caption{Memory budget for Qwen3.5-122B with 2B draft at 128K tokens on M2~Ultra 128\,GB. The draft KV cache is transient: allocated during scoring, freed via \texttt{mx.clear\_cache()} before target prefill begins. Peak \specprefill{} memory $\approx$ baseline peak because the draft and target KV caches are never resident simultaneously.} | |
| \label{tab:memory} | |
| \end{table} | |
| % ============================================================ | |
| \section{Discussion} | |
| \label{sec:discussion} | |
| % ============================================================ | |
| \subsection{The MoE Sweet Spot} | |
| \specprefill{} benefits MoE architectures more than dense models. | |
| In MoE models, each token is routed to a subset of experts during the forward pass, but the routing computation and expert weight loading occur for \emph{every} token regardless. | |
| By reducing the token count during prefill, \specprefill{} reduces the total number of expert activations---not just attention FLOPs, but the dominant feed-forward computation. | |
| The savings exceed the keep fraction: at $k = 0.2$, the model processes 80\% fewer tokens through its expert layers, each involving sparse routing across 128 experts (Qwen3.5-122B). | |
| Dense models, by contrast, apply the same computation to every token regardless of routing. | |
| The savings from \specprefill{} on dense models are proportional only to the reduced attention and MLP computation, which is less dramatic when the model is fully compute-bound. | |
| \subsection{When SpecPrefill Does Not Help} | |
| \paragraph{Dense models with large drafts.} | |
| When the FLOP ratio $r$ exceeds $\sim$0.15, scoring overhead consumes most of the potential savings. | |
| Our GPT-OSS result (20B draft, $r \approx 0.17$, speedup $1.28\times$) demonstrates this boundary. | |
| No smaller GPT-OSS model exists; the proprietary tokenizer (201K vocabulary) prevents cross-family draft substitution without a re-tokenization layer (Section~\ref{sec:future}). | |
| \paragraph{Short prompts.} | |
| Below $\sim$4K tokens, the fixed overhead of draft scoring and RoPE patching exceeds the savings from sparse prefill. | |
| Our default threshold of 8{,}192 tokens reflects this empirical boundary. | |
| \paragraph{Comparison with CritiPrefill.} | |
| CritiPrefill~\cite{critiprefill} achieves sparse prefill without a draft model by using the target's own attention scores from an initial partial prefill. | |
| On dense standard transformers where all layers are attention, CritiPrefill achieves 2.7--3.0$\times$ speedup (reported on Llama-3-8B and Yi-9B at 128K context). | |
| However, it only saves attention FLOPs---on MoE architectures where attention constitutes 7--9\% of total computation, the attention-only savings would be proportionally limited; our analysis estimates $\sim$1.03--1.08$\times$. | |
| \specprefill{} saves \emph{all} FLOPs (attention + routing + expert computation) for dropped tokens, yielding $3.7$--$5.5\times$ on MoE targets vs.\ the estimated ${\sim}1.03$--$1.08\times$ for attention-only savings. | |
| \subsection{Limitations} | |
| \begin{itemize} | |
| \item \textbf{Draft model dependency.} Requires a small model with a compatible tokenizer. This limits applicability to model families with multiple size variants (Qwen3.5: 2B/4B/27B/35B/122B; GPT-OSS: only 20B/120B). | |
| \item \textbf{Nemotron-H SSM state.} SSM layers are updated only on retained tokens; skipped tokens do not contribute to state evolution. The resulting state trajectory diverges from full prefill. Empirically safe under our evaluation ($2.10$--$2.19\times$ with no observed quality degradation), but the magnitude of state drift is not quantified. | |
| \item \textbf{Quality validation scale.} Six prompts and eight adversarial types validate methodology and confirm no catastrophic loss, but are insufficient for tight confidence intervals. | |
| \item \textbf{Single hardware platform.} Results on M2~Ultra (128\,GB). Memory bandwidth and compute characteristics differ on M3/M4 variants, and the optimal keep percentage may shift. | |
| \item \textbf{Single-request evaluation.} The serving engine serializes requests via \texttt{asyncio.Lock}. We do not evaluate \specprefill{} under concurrent load. | |
| \end{itemize} | |
| % ============================================================ | |
| \section{Related Work} | |
| \label{sec:related} | |
| % ============================================================ | |
| \paragraph{Sparse prefill.} | |
| \citet{specprefill} introduce attention-based sparse prefill using a draft model on discrete GPU architectures. | |
| CritiPrefill~\cite{critiprefill} achieves draft-free sparse prefill using the target model's own attention. | |
| We extend the draft-based approach to unified memory hardware and non-transformer architectures. | |
| To our knowledge, no prior work addresses sparse prefill specifically for unified memory systems. | |
| \paragraph{Speculative decoding.} | |
| \citet{leviathan2023fast} and \citet{chen2023accelerating} propose using a draft model to speculatively generate candidate tokens verified by the target model. | |
| Multi-Token Prediction (MTP)~\cite{gloeckle2024better} uses auxiliary prediction heads within the target model itself. | |
| Our ``Speculative Stack'' composes prefill-phase speculation (\specprefill{}) with decode-phase speculation (MTP), operating in non-overlapping phases. | |
| \paragraph{Efficient attention.} | |
| FlashAttention~\cite{dao2022flashattention,dao2023flashattention2} optimizes attention computation through tiling and memory-efficient kernels. | |
| \specprefill{} is orthogonal: it reduces the token count \emph{before} attention, and can compose with efficient attention implementations. | |
| \paragraph{Serving systems.} | |
| vLLM~\cite{kwon2023efficient} introduces PagedAttention for efficient KV cache management. | |
| Our system (vllm-mlx) adapts continuous-batching and paged-attention concepts for Apple Silicon. | |
| \specprefill{} integrates as a per-request prefill optimization within this serving framework. | |
| \paragraph{MLX framework.} | |
| MLX~\cite{mlx} provides the unified-memory ML runtime enabling our zero-copy scoring approach. | |
| Prior MLX-based serving work has focused on standard inference optimization; we show that unified memory enables system-level optimizations, specifically zero-copy draft scoring, that are impractical on discrete-GPU architectures. | |
| % ============================================================ | |
| \section{Future Work} | |
| \label{sec:future} | |
| % ============================================================ | |
| \paragraph{Universal draft models with tokenizer translation.} | |
| Our results are constrained to model families where a small same-tokenizer draft exists. | |
| A \emph{universal draft model}---trained to score token importance across vocabularies via a learned re-tokenization layer---would decouple \specprefill{} from the family constraint. | |
| The translation layer would map target token IDs to text, re-tokenize with the universal draft's vocabulary, score importance, and project scores back to target token space via character-offset alignment. | |
| This is non-trivial (tokenization boundaries differ across BPE vocabularies) but would enable \specprefill{} for models like GPT-OSS where no small same-family draft exists. | |
| \paragraph{CritiPrefill for dense models.} | |
| For dense architectures where the FLOP ratio is unfavorable, CritiPrefill (draft-free) may be more practical. | |
| Published results show 2.7--3.0$\times$ on dense 8B--9B transformers at 128K context. | |
| On our MoE targets, where attention is a small fraction of compute, gains would be limited; the approach warrants investigation for dense models on unified memory where memory access patterns differ from GPU. | |
| \paragraph{SSM state drift analysis.} | |
| For hybrid models like Nemotron-H, quantifying the L2 distance between full-prefill and sparse-prefill SSM hidden states would characterize the information loss from skipping tokens in recurrent layers and establish quality guarantees beyond empirical testing. | |
| \paragraph{Continuous batching.} | |
| Our current implementation uses a single-request serialized engine. | |
| Integrating \specprefill{} with continuous batching would enable concurrent request handling, but is currently blocked by a Metal driver stability issue on macOS~26.3.1. | |
| \paragraph{Hardware generalization.} | |
| Apple's M3 and M4 generations have different memory bandwidth and compute characteristics. | |
| The optimal keep percentage and FLOP-ratio threshold may shift on these platforms. | |
| % ============================================================ | |
| \section{Conclusion} | |
| \label{sec:conclusion} | |
| % ============================================================ | |
| We have presented the first implementation of \specprefill{} on unified memory hardware, demonstrating that Apple Silicon's shared address space eliminates the data-transfer overhead that complicates draft-based sparse prefill on discrete GPUs. | |
| With transfer overhead removed, the cost equation reduces to a single dominant term: the draft-to-target FLOP ratio, validated across six configurations spanning MoE, Mamba-2 hybrid, and sliding-window dense architectures. | |
| On Qwen3.5-122B, \specprefill{} reduces TTFT by $3.71$--$5.45\times$ across 8K--128K tokens with a 1.4\,GB draft model and no observed quality degradation under our evaluation. At 128K tokens, prefill drops from 19.3~minutes to 3.5~minutes. | |
| Composed with system prompt KV caching, end-to-end speedup reaches $5.6\times$ on a 73K-token production workload. | |
| The implementation handles architecture-specific challenges (gated queries, heterogeneous SSM/attention layers, sliding-window caches) through auto-detecting adapters that require no user configuration. | |
| \specprefill{} is most effective on MoE and hybrid models where total parameters far exceed active computation, making a small dense draft model orders of magnitude cheaper than the target. | |
| As large models move to local hardware, reducing prefill cost through techniques like zero-copy draft scoring directly determines whether long-context inference is usable. | |
| % ============================================================ | |
| % References | |
| % ============================================================ | |
| \bibliographystyle{plainnat} | |
| \begin{thebibliography}{99} | |
| \bibitem[Chen et~al.(2023)]{chen2023accelerating} | |
| Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. | |
| \newblock Accelerating large language model decoding with speculative sampling. | |
| \newblock \emph{arXiv preprint arXiv:2302.01318}, 2023. | |
| \bibitem[Dao(2023)]{dao2023flashattention2} | |
| Tri Dao. | |
| \newblock Flash{A}ttention-2: Faster attention with better parallelism and work partitioning. | |
| \newblock \emph{arXiv preprint arXiv:2307.08691}, 2023. | |
| \bibitem[Dao et~al.(2022)]{dao2022flashattention} | |
| Tri Dao, Daniel~Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R\'{e}. | |
| \newblock Flash{A}ttention: Fast and memory-efficient exact attention with {IO}-awareness. | |
| \newblock In \emph{NeurIPS}, 2022. | |
| \bibitem[Gloeckle et~al.(2024)]{gloeckle2024better} | |
| Fabian Gloeckle, Badr Youbi~Idrissi, Baptiste Rozi\`{e}re, David Lopez-Paz, and Gabriel Synnaeve. | |
| \newblock Better \& faster large language models via multi-token prediction. | |
| \newblock \emph{arXiv preprint arXiv:2404.19737}, 2024. | |
| \bibitem[Kwon et~al.(2023)]{kwon2023efficient} | |
| Woosuk Kwon, Zhuohan Li, Sicheng Zhuang, Ying Sheng, Lianmin Zheng, Cody~Hao Yu, Joseph~E. Gonzalez, Hao Zhang, and Ion Stoica. | |
| \newblock Efficient memory management for large language model serving with {PagedAttention}. | |
| \newblock In \emph{SOSP}, 2023. | |
| \bibitem[Leviathan et~al.(2023)]{leviathan2023fast} | |
| Yaniv Leviathan, Matan Kalman, and Yossi Matias. | |
| \newblock Fast inference from transformers via speculative decoding. | |
| \newblock In \emph{ICML}, 2023. | |
| \bibitem[Apple(2023)]{mlx} | |
| Apple Machine Learning Research. | |
| \newblock {MLX}: An array framework for Apple Silicon. | |
| \newblock \url{https://github.com/ml-explore/mlx}, 2023. | |
| \bibitem[Peng et~al.(2024)]{yarn} | |
| Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. | |
| \newblock {YaRN}: Efficient context window extension of large language models. | |
| \newblock In \emph{ICLR}, 2024. | |
| \bibitem[Su et~al.(2024)]{rope} | |
| Jianlin Su, Murtadha Ahmed, Yu~Lu, Shengfeng Pan, Wen Liu, and Bo~Liu. | |
| \newblock {RoFormer}: Enhanced transformer with rotary position embedding. | |
| \newblock \emph{Neurocomputing}, 568:127063, 2024. | |
| \bibitem[Yao et~al.(2025)]{specprefill} | |
| Ziteng Yao, Wei Chen, Yushi Huang, and others. | |
| \newblock {SpecPrefill}: Speculative prefilling for faster long-context {LLM} inference. | |
| \newblock \emph{arXiv preprint arXiv:2502.02789}, 2025. | |
| \bibitem[Qwen(2025)]{qwen35} | |
| Qwen Team. | |
| \newblock {Qwen3.5}: A series of large language models. | |
| \newblock \url{https://huggingface.co/Qwen/Qwen3.5-122B-A10B}, 2025. | |
| \bibitem[Zhang et~al.(2025)]{critiprefill} | |
| Junlin Zhang, Jiahao Li, and others. | |
| \newblock {CritiPrefill}: A segment-level critique framework for efficient long-context {LLM} prefilling. | |
| \newblock \emph{arXiv preprint}, 2025. | |
| \end{thebibliography} | |
| \end{document} | |