Heads-up for Ampere users: tq-t4nc recipe + MTP doesn't work with CUDA graphs
Thanks for the great quant and especially for bundling the BF16 mtp.fc — on 2× RTX 3090 (Ampere SM 86) we're getting 80 TPS code single-card
and 92 TPS code TP=2 following your card (great numbers, beating your RTX 5090 quoted numbers on our 3090s). Wanted to flag a caveat for other
Ampere users who try the --kv-cache-dtype tq-t4nc recipe:
On Ampere, TurboQuant KV combined with --speculative-config method=mtp + --enable-chunked-prefill crashes vLLM at engine warmup:
turboquant_attn.py:570: qsl = query_start_loc.tolist()
RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture
unless the CPU tensor is pinned.
The .tolist() in the continuation-prefill branch forces a GPU→CPU sync that's illegal inside CUDA graph capture. vLLM PR #40092 (2026-04-23)
unblocked the fast path but left the continuation-prefill branch untouched, so any spec-decode + chunked-prefill combo still hits this on
Ampere. Distinct from the hybrid-model NotImplementedError at arg_utils.py:1652 (covered by the
Sandermage/genesis-vllm-patches monkey-patcher).
Filed upstream as vllm-project/vllm#40807.
Your card's suggested workaround (--compilation-config.cudagraph_mode none) boots but costs ~55% of short-prompt TPS and ends up slower than
llama.cpp mainline at long context, so it's net-negative on this hardware today. We wrote a small disk-edit patch that wraps both .tolist()
sites with torch.cuda.is_current_stream_capturing() guards — falls back to the graph-safe fast path only during capture, real inference is
unchanged. Source + reproducible recipe at
github.com/noonghunna/qwen36-27b-single-3090.
What worked for us on 2× RTX 3090 (benched against your recipe):
--kv-cache-dtype turboquant_3bit_ncinstead oftq-t4nc→ smaller bytes-per-token, lets us fit 125K context on a single card (KV pool 198K
tokens with vision enabled)--speculative-config '{"method":"mtp","num_speculative_tokens":3}'— n=3 beats your recommended n=1 by ~30% on code (79.7 vs ~60 TPS
single-card, 92.2 TPS TP=2) with mean AL 3.4 and 67–92% per-position acceptance. n=4 regresses (position-4 accept collapses to ~8%).--gpu-memory-utilization 0.97,--max-num-seqs 1,--max-num-batched-tokens 4128- Vision enabled (no
--language-model-only): 54 TPS on image+text, only 18% slower than pure text narrative - CUDA graphs ON via our patch → keeping this is what takes us from ~30 TPS (eager workaround) to 85 TPS sustained / 106 peak
Once the upstream .tolist() sync gets pinned or precomputed (pending review at #40807), your tq-t4nc recipe should unlock 262K context on
Ampere too — we confirmed the Genesis-patched hybrid path gives a GPU KV cache of 874,368 tokens (5× fp8) on this model. Standing by to
retest when upstream lands.
Minor note: your card recommends --tool-call-parser qwen3_xml but qwen3_coder worked better for us with OpenAI-compat clients (Open WebUI,
LM Studio, Cline). YMMV.
Thanks again for the great release — the BF16 mtp.fc design decision is what makes single-card MTP possible at all.