Unable to run on 2x RTX Pro 6000 (DEEP_GEMM problem)

#15
by stev236 - opened

This model at first glance seems like a perfect fit for a 192GB of VRAM setup thanks to the FP4 MoE.
I followed the vLLM recipe (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash), but unfortunately it doesn't seem to support SM120 GPUs. The package vllm/utils/deep_gemm.py throws the assertion error: Unsupported architecture.
I tried turning off deep gemm using: VLLM_USE_DEEP_GEMM=0, but it doesn't help.

Does anybody know of any way to get it working on SM120?
Cheers!

it was teased a lot that huawei support would be first, so probs just wait?

Same here, doesn't work with the recipe.

I wouldn't mind those Huawei 300i Duo - I've seriously considered buying a few more than once, but every time I can't find enough information that assures me I won't just have 10k in paper weights.

Same error here: (EngineCore_DP0 pid=256) ERROR 04-24 15:20:17 [core.py:1110] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/hyperconnection.hpp:56): Unsupported architecture', please check the stack trace above for the root cause

seems like a problem with the mhc arch. I mean it handles attention signals fundamently differeng, so no wonder theres no support yet. Too bad, because its always deepseek, whos hinderd by that. (MoE, MTP etc.) Though I heard they wrote custom kernels for mhc or N gram, so maybe they habe a fix. idk though. you may just have to wait

This model at first glance seems like a perfect fit for a 192GB of VRAM setup thanks to the FP4 MoE.
I followed the vLLM recipe (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash), but unfortunately it doesn't seem to support SM120 GPUs. The package vllm/utils/deep_gemm.py throws the assertion error: Unsupported architecture.
I tried turning off deep gemm using: VLLM_USE_DEEP_GEMM=0, but it doesn't help.

Does anybody know of any way to get it working on SM120?
Cheers!

It doesnt support sm120. Only supports are b300 from blackwell, a100, h100and h200

I had to get Claude to knock out a few Triton kernels to patch the vLLM cu130 docker image. It took a few hours and I ended up with this monstrosity:

#!/bin/bash
# Run DeepSeek V4 Flash on SM120 (RTX PRO 6000 Blackwell) GPUs.
#
# SM120 issues and workarounds:
# 1. DeepGEMM's C++ kernels reject SM120 at runtime → patched deep_gemm.py
#    dispatches to Triton kernels for lightning indexer (paged + non-paged)
#    and keeps PyTorch fallbacks for tf32_hc_prenorm_gemm / fp8_einsum.
# 2. CUTLASS block-scaled FP8 kernel can't handle e8m0fnu scales through
#    the stable C API → disabled so Triton kernel is used instead.
# 3. Triton doesn't know float8_e8m0fnu dtype → patched triton.py converts
#    e8m0fnu scales to bfloat16 on-the-fly (lossless, same exponent range).
# 4. fused_inv_rope_fp8_quant produces packed INT32 UE8M0 scales on SM100+ →
#    patched deepseek_v4_attention.py forces SM90-style FP32 scales so the
#    PyTorch fp8_einsum fallback can dequantize them.
# 5. FlashMLA sparse_decode_fwd / sparse_prefill_fwd reject SM120 → patched
#    flash_mla_interface.py dispatches to our Triton kernels.

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
PATCH_DIR="$SCRIPT_DIR/sm120-patches"

docker run --gpus all \
  --ipc=host -p 8000:8000 \
  -e VLLM_DISABLED_KERNELS=CutlassFp8BlockScaledMMKernel \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$PATCH_DIR/deep_gemm.py":/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py:ro \
  -v "$PATCH_DIR/triton.py":/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/scaled_mm/triton.py:ro \
  -v "$PATCH_DIR/deepseek_v4_attention.py":/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py:ro \
  -v "$PATCH_DIR/flash_mla_interface.py":/usr/local/lib/python3.12/dist-packages/vllm/third_party/flashmla/flash_mla_interface.py:ro \
  -v "$PATCH_DIR/triton_kernels":/usr/local/lib/python3.12/dist-packages/sm120_triton_kernels:ro \
  -v "$PATCH_DIR/chat_template.jinja":/patches/chat_template.jinja:ro \
  vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Flash \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --max-model-len 200000 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --compilation-config='{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1, 2, 4, 8, 16]}' \
  --enable-auto-tool-choice \
  --default-chat-template-kwargs '{"thinking": true, "reasoning_effort": "high"}' \
  --reasoning-parser deepseek_v4 \
  --chat-template /patches/chat_template.jinja \
  --attention-backend FLASHMLA_SPARSE

You're probably better off waiting for official support

Currently, GEMM does not have a plan to be compatible with sm120. When will we be able to do that?

Currently, GEMM does not have a plan to be compatible with sm120. When will we be able to do that?

I'm crying as 4090 48g user, which is sm_89 and perhaps won't work with deepgemm forever.

@bash99 code it in CUDA! They accept community contributions. Someone already vibe-coded SM120 all the way into VLLM lol

Was anyone able to make it work with 2x6000 Pros??

Patiently waiting with my 2x rtx 6000 pro

Patiently waiting with my 2x rtx 6000 pro

Same here

Sign up or log in to comment