Unable to run on 2x RTX Pro 6000 (DEEP_GEMM problem)
This model at first glance seems like a perfect fit for a 192GB of VRAM setup thanks to the FP4 MoE.
I followed the vLLM recipe (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash), but unfortunately it doesn't seem to support SM120 GPUs. The package vllm/utils/deep_gemm.py throws the assertion error: Unsupported architecture.
I tried turning off deep gemm using: VLLM_USE_DEEP_GEMM=0, but it doesn't help.
Does anybody know of any way to get it working on SM120?
Cheers!
it was teased a lot that huawei support would be first, so probs just wait?
Same here, doesn't work with the recipe.
I wouldn't mind those Huawei 300i Duo - I've seriously considered buying a few more than once, but every time I can't find enough information that assures me I won't just have 10k in paper weights.
Same error here: (EngineCore_DP0 pid=256) ERROR 04-24 15:20:17 [core.py:1110] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/hyperconnection.hpp:56): Unsupported architecture', please check the stack trace above for the root cause
seems like a problem with the mhc arch. I mean it handles attention signals fundamently differeng, so no wonder theres no support yet. Too bad, because its always deepseek, whos hinderd by that. (MoE, MTP etc.) Though I heard they wrote custom kernels for mhc or N gram, so maybe they habe a fix. idk though. you may just have to wait
This model at first glance seems like a perfect fit for a 192GB of VRAM setup thanks to the FP4 MoE.
I followed the vLLM recipe (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash), but unfortunately it doesn't seem to support SM120 GPUs. The package vllm/utils/deep_gemm.py throws the assertion error: Unsupported architecture.
I tried turning off deep gemm using: VLLM_USE_DEEP_GEMM=0, but it doesn't help.Does anybody know of any way to get it working on SM120?
Cheers!
It doesnt support sm120. Only supports are b300 from blackwell, a100, h100and h200
I had to get Claude to knock out a few Triton kernels to patch the vLLM cu130 docker image. It took a few hours and I ended up with this monstrosity:
#!/bin/bash
# Run DeepSeek V4 Flash on SM120 (RTX PRO 6000 Blackwell) GPUs.
#
# SM120 issues and workarounds:
# 1. DeepGEMM's C++ kernels reject SM120 at runtime → patched deep_gemm.py
# dispatches to Triton kernels for lightning indexer (paged + non-paged)
# and keeps PyTorch fallbacks for tf32_hc_prenorm_gemm / fp8_einsum.
# 2. CUTLASS block-scaled FP8 kernel can't handle e8m0fnu scales through
# the stable C API → disabled so Triton kernel is used instead.
# 3. Triton doesn't know float8_e8m0fnu dtype → patched triton.py converts
# e8m0fnu scales to bfloat16 on-the-fly (lossless, same exponent range).
# 4. fused_inv_rope_fp8_quant produces packed INT32 UE8M0 scales on SM100+ →
# patched deepseek_v4_attention.py forces SM90-style FP32 scales so the
# PyTorch fp8_einsum fallback can dequantize them.
# 5. FlashMLA sparse_decode_fwd / sparse_prefill_fwd reject SM120 → patched
# flash_mla_interface.py dispatches to our Triton kernels.
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
PATCH_DIR="$SCRIPT_DIR/sm120-patches"
docker run --gpus all \
--ipc=host -p 8000:8000 \
-e VLLM_DISABLED_KERNELS=CutlassFp8BlockScaledMMKernel \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v "$PATCH_DIR/deep_gemm.py":/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py:ro \
-v "$PATCH_DIR/triton.py":/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/scaled_mm/triton.py:ro \
-v "$PATCH_DIR/deepseek_v4_attention.py":/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py:ro \
-v "$PATCH_DIR/flash_mla_interface.py":/usr/local/lib/python3.12/dist-packages/vllm/third_party/flashmla/flash_mla_interface.py:ro \
-v "$PATCH_DIR/triton_kernels":/usr/local/lib/python3.12/dist-packages/sm120_triton_kernels:ro \
-v "$PATCH_DIR/chat_template.jinja":/patches/chat_template.jinja:ro \
vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Flash \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--max-model-len 200000 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.92 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--compilation-config='{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1, 2, 4, 8, 16]}' \
--enable-auto-tool-choice \
--default-chat-template-kwargs '{"thinking": true, "reasoning_effort": "high"}' \
--reasoning-parser deepseek_v4 \
--chat-template /patches/chat_template.jinja \
--attention-backend FLASHMLA_SPARSE
You're probably better off waiting for official support
Currently, GEMM does not have a plan to be compatible with sm120. When will we be able to do that?
Currently, GEMM does not have a plan to be compatible with sm120. When will we be able to do that?
I'm crying as 4090 48g user, which is sm_89 and perhaps won't work with deepgemm forever.
Community fork of vllm that supports sm120
https://github.com/jasl/vllm/tree/ds4-sm120
https://github.com/vllm-project/vllm/pull/40991
Was anyone able to make it work with 2x6000 Pros??
Patiently waiting with my 2x rtx 6000 pro
Patiently waiting with my 2x rtx 6000 pro
Same here