Fails to load on Ampere (sm_86) at TP=2: Marlin kernel rejects 32-dim weight slice

#3
by wasifb - opened

Hi β€” thanks for publishing this quant. Reporting an Ampere-specific loading failure so other users don't hit the same wall, and suggesting a
possible source-side fix.

Summary

On 2Γ— NVIDIA RTX 3090 (Ampere, compute capability 8.6, PCIe-only), this model fails to load at tensor-parallel-size=2 on both vLLM 0.19.1 and
SGLang 0.5.10.post1. Loading at TP=1 works fine and benches at ~156 TPS short-prompt decode on a single 3090. Your Gemma-4 AutoRound quant
(Intel/gemma-4-31B-it-int4-AutoRound) loads cleanly at TP=2, so this appears specific to a tensor dim in this Qwen3.6 quant.

Reproduction

docker run --rm --gpus all --ipc host --shm-size 16g -p 8000:8000
-v /path/to/hf:/root/.cache/huggingface
vllm/vllm-openai:latest
--model /root/.cache/huggingface/Intel/Qwen3.6-35B-A3B-int4-AutoRound
--tensor-parallel-size 2 --max-model-len 32768
--gpu-memory-utilization 0.92 --disable-custom-all-reduce
--trust-remote-code

Same failure on vllm/vllm-openai:gemma4-cu130.

Error (vLLM kernel fallback chain)

ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons:
CutlassW4A8LinearKernel requires capability 90, current compute capability is 86
MacheteLinearKernel requires capability 90, current compute capability is 86
AllSparkLinearKernel cannot implement due to: For Ampere GPU, AllSpark does not support group_size = 128. Only group_size = -1 are supported.
MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 32 is not divisible by min_thread_n = 64. Consider reducing
tensor_parallel_size or running with --quantization gptq.
ConchLinearKernel cannot implement due to: conch-triton-kernels is not installed, please install it via pip install conch-triton-kernels
and try again!
ExllamaLinearKernel cannot implement due to: Exllama only supports float16 activations

SGLang emits an analogous error from gptq_marlin_repack.cuh:309: size_n = 32 is not divisible by tile_n_size = 64.

Root cause (from the stack trace)

The failure occurs in vllm/model_executor/layers/mamba/gdn_linear_attn.py β†’ MergedColumnParallelLinear β†’ gptq_marlin.create_weights β†’
choose_mp_linear_kernel. One tensor in this quant has a natural output dim of 64, which halves to 32 per rank at TP=2, and Marlin's Ampere
kernel requires the output dim β‰₯ 64 (its thread tile width).

Why other paths don't help on Ampere

  • Machete / CutlassW4A8 β€” Hopper sm_90+ only
  • AllSpark β€” requires group_size = -1 (this quant uses 128)
  • Conch-triton β€” not included in standard vLLM/SGLang images
  • Exllama β€” fp16-activations only, not applicable to W4A16

So Ampere users currently have no supported kernel at TP=2 without building a custom image with conch-triton-kernels.

Suggested fix (source-side)

Pad the offending tensor to a multiple of 128 during quantization (cleanly covers TP=2 and TP=4; pad to 256 if you also want TP=8 to work).
This is a zero-compute-cost fix and would let the existing Marlin kernel handle TP=2 on Ampere without any user-side workaround.

Alternatively, a note in the model card listing "TP=1 only on Ampere sm_86 until padded" would save future users the debugging time.

Context for reference

  • Hardware: 2Γ— RTX 3090 (24 GB each), PCIe Gen4 x16, no NVLink
  • Driver: NVIDIA 595.58.03 / CUDA 13.2
  • vLLM 0.19.1 (also tested :gemma4-cu130)
  • SGLang 0.5.10.post1
  • TP=1 fallback works and is fast (156 TPS short-prompt decode), so this is not blocking for single-card users. But dual-card users on consumer
    Ampere (3090/3090 Ti/A5000) can't use the quant at TP=2.

Thanks!

Intel org

Thanks for reporting the issue. However, it’s difficult for us to address this since the model is used across different tensor parallel (TP) configurations. For example, padding to 256 for TP=8 would introduce unnecessary overhead for TP=1. Given this trade-off, I’d suggest raising an issue with vLLM and asking whether padding can be handled during the inference phase.

Thanks for a prompt response. I'll raise a ticket with them. However, will still appreciate TP=2 support out of the box as a minimum. TP=1 leaves us thin on context size.

P.S. i've raised the case here https://github.com/vllm-project/vllm/issues/40354 and also a PR. with padding enabled in vllm, i'm able to hit ~170 tps across 2x3090.

Sign up or log in to comment