Heads-up for Ampere users: tq-t4nc recipe + MTP doesn't work with CUDA graphs

#2
by wasifb - opened

Thanks for the great quant and especially for bundling the BF16 mtp.fc — on 2× RTX 3090 (Ampere SM 86) we're getting 80 TPS code single-card
and 92 TPS code TP=2 following your card (great numbers, beating your RTX 5090 quoted numbers on our 3090s). Wanted to flag a caveat for other
Ampere users who try the --kv-cache-dtype tq-t4nc recipe:

On Ampere, TurboQuant KV combined with --speculative-config method=mtp + --enable-chunked-prefill crashes vLLM at engine warmup:

turboquant_attn.py:570: qsl = query_start_loc.tolist()
RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture
unless the CPU tensor is pinned.

The .tolist() in the continuation-prefill branch forces a GPU→CPU sync that's illegal inside CUDA graph capture. vLLM PR #40092 (2026-04-23)
unblocked the fast path but left the continuation-prefill branch untouched, so any spec-decode + chunked-prefill combo still hits this on
Ampere. Distinct from the hybrid-model NotImplementedError at arg_utils.py:1652 (covered by the
Sandermage/genesis-vllm-patches monkey-patcher).

Filed upstream as vllm-project/vllm#40807.

Your card's suggested workaround (--compilation-config.cudagraph_mode none) boots but costs ~55% of short-prompt TPS and ends up slower than
llama.cpp mainline at long context, so it's net-negative on this hardware today. We wrote a small disk-edit patch that wraps both .tolist()
sites with torch.cuda.is_current_stream_capturing() guards — falls back to the graph-safe fast path only during capture, real inference is
unchanged. Source + reproducible recipe at
github.com/noonghunna/qwen36-27b-single-3090.

What worked for us on 2× RTX 3090 (benched against your recipe):

  • --kv-cache-dtype turboquant_3bit_nc instead of tq-t4nc → smaller bytes-per-token, lets us fit 125K context on a single card (KV pool 198K
    tokens with vision enabled)
  • --speculative-config '{"method":"mtp","num_speculative_tokens":3}'n=3 beats your recommended n=1 by ~30% on code (79.7 vs ~60 TPS
    single-card, 92.2 TPS TP=2) with mean AL 3.4 and 67–92% per-position acceptance. n=4 regresses (position-4 accept collapses to ~8%).
  • --gpu-memory-utilization 0.97, --max-num-seqs 1, --max-num-batched-tokens 4128
  • Vision enabled (no --language-model-only): 54 TPS on image+text, only 18% slower than pure text narrative
  • CUDA graphs ON via our patch → keeping this is what takes us from ~30 TPS (eager workaround) to 85 TPS sustained / 106 peak

Once the upstream .tolist() sync gets pinned or precomputed (pending review at #40807), your tq-t4nc recipe should unlock 262K context on
Ampere too — we confirmed the Genesis-patched hybrid path gives a GPU KV cache of 874,368 tokens (5× fp8) on this model. Standing by to
retest when upstream lands.

Minor note: your card recommends --tool-call-parser qwen3_xml but qwen3_coder worked better for us with OpenAI-compat clients (Open WebUI,
LM Studio, Cline). YMMV.

Thanks again for the great release — the BF16 mtp.fc design decision is what makes single-card MTP possible at all.

wasifb changed discussion status to closed
wasifb changed discussion status to open

@wasifb can you please share your patch_tolist_cudagraph.py file

@wasifb can you please share your patch_tolist_cudagraph.py file

The patch is now available on git. links updated in the article.

Sign up or log in to comment