not working at DGX Spark

by birolkuyumcu - opened Nov 28, 2025

Nov 28, 2025

docker run --runtime nvidia --gpus all -p 8000:8000 -v "$HOME/.cache/huggingface/:/root/.cache/huggingface" --ipc=host vllm/vllm-openai:nightly --model Firworks/GLM-4.5-Air-nvfp4 --dtype auto --max-model-len 32768

rises

ptxas fatal: Value 'sm_121a' is not defined for option 'gpu-name'

Firworks

Owner Nov 28, 2025

I don't have a Spark so I can't test it / troubleshoot it myself but I think with the Spark you have to use the Nvidia container images for vllm. Maybe try swapping out that standard vllm container for nvcr.io/nvidia/vllm:25.09-py3?

That appears to be what they suggest here:
https://build.nvidia.com/spark/vllm/instructions

birolkuyumcu

Dec 1, 2025

by using nvcr.io/nvidia/vllm:25.09-py3 rises
(EngineCore_0 pid=302) ERROR 12-01 07:56:47 [core.py:700] RuntimeError: Failed to initialize GEMM
error

Firworks

Owner Dec 1, 2025

•

edited Dec 1, 2025

Can you try rerunning it with this additional environment variable set?
-e VLLM_USE_FLASHINFER_MOE_FP4=1

I updated the docker command on the model card with that. It was needed to get INTELLECT-3 running and INTELLECT-3 is based on GLM-4.5-Base so I'm wondering if the same might be needed here. It ran on the B200 I tested this model with without the flag but B200 is a slightly older CUDA support (sm_100) and that environment variable might be needed for sm_120+ including the RTX Pro 6000 Blackwell, 5090, and the Spark.

I found this issue on VLLM's github that is related:
https://github.com/vllm-project/vllm/issues/28110

everphilski

23 days ago

•

edited 23 days ago

Hey, I tried running this model on my Spark with the nvcr.io/nvidia/vllm:25.09-py3 container, using -e VLLM_USE_FLASHINFER_MOE_FP4=1.

The VLLM_USE_FLASHINFER_MOE_FP4=1 flag doesn't help on the GB10 (SM 12.0/12.1). The log shows WARNING [nvfp4_moe_support.py:40] FlashInfer kernels unavailable for CompressedTensorsW4A4MoeMethod on current platform. — it still falls back to the CUTLASS cutlass_fp4_group_mm path and crashes with RuntimeError: Failed to initialize GEMM.

Previous to using -e VLLM_USE_FLASHINFER_MOE_FP4=1, I was getting similar errors to birolkyumcu, failure to initialize GEMM.

Let me know if there's any additional information I can supply and I'm happy to iterate - would love to get this working on my spark!

philip

Firworks

Owner 23 days ago

I actually finally bought a spark. I'll mess with it tonight and see if I can figure out how to get it running.

everphilski

21 days ago

Let me know! excited to see it working!

Firworks

Owner 19 days ago

So I got it "working" two different ways.

With the original Nvidia VLLM container:

sudo docker run --gpus all --network host --ipc=host   -e HF_TOKEN=$HF_TOKEN   -v ~/.cache/huggingface:/root/.cache/huggingface   nvcr.io/nvidia/vllm:26.02-py3   vllm serve Firworks/GLM-4.5-Air-nvfp4   --dtype auto   --max-model-len 32768   --enforce-eager

This crashes eventually though. It does work for a bit but eventually I'm not sure if it's just number of inferences or length but eventually it crashes.

Then I tried the eugr docker repo and with that and its script I ran it like this:

./launch-cluster.sh --solo exec   vllm serve     Firworks/GLM-4.5-Air-nvfp4     --port 8000 --host 0.0.0.0     --gpu-memory-utilization 0.7     --load-format fastsafetensors --enforce-eager

That also works for a while but also eventually crashes. From reading around it seems like NVFP4 MoE models just dont work well on the Spark currently. Prior to buying the Spark I ran all of my quants on the RTX Pro 6000 Blackwell which seems to be much better supported. I'm not really sure why. You'd think Nvidia would be focused on supporting the Spark.

everphilski

19 days ago

•

edited 19 days ago

Interesting. The one NVFP4 model I have working (not an MoE) (and seems to work well - 2 days uptime as a system service) is nm-testing/DeepSeek-R1-Distill-Qwen-32B-NVFP4

Here's how I'm running it:

ExecStart=/usr/bin/docker run --rm --gpus all
--name vllm-verifier
-p 8102:8000
-v /home/user/.cache/huggingface:/root/.cache/huggingface
-v /home/user/.cache/vllm:/root/.cache/vllm
--ipc=host
nvcr.io/nvidia/vllm:25.09-py3
vllm serve nm-testing/DeepSeek-R1-Distill-Qwen-32B-NVFP4 --tensor-parallel-size 1 --dtype auto --gpu-memory-utilization 0.40 --max-model-len 32768

The 'playbook' also lists a number of NVFP4 models which I haven't tested yet, which include Nemotron MoE models: https://github.com/NVIDIA/dgx-spark-playbooks/tree/main/nvidia/vllm

I'm a complete noob just trying to move from openrouter to locally hosted tools. If you figure something out or have other ideas I'm all ears!

Firworks

Owner 19 days ago

Yeah specifically I think the problem is with MoE kernels on the Spark. I do see those Nemotron MoE models on that playbook but I haven't tried those myself. I'm not sure if Nvidia did anything special with them. I've had no trouble running dense models on the Spark yet.

It definitely seems like NVFP4 is very experimental and flaky, such that it's not even clear wether I should be making these with llm-compressor or ModelOpt and if I should be making "NVFP4" quantizations or "MXFP4" quantizations. A lot of this is just guessing.

everphilski

14 days ago

Hey just FYSA I was able to get soundsgoodai/GLM-4.5-Air-NVFP4-KV-cache-BF16 working using the following invocation:

docker run --rm --gpus all
--name glm45-air
-p 8100:8000
-v ~/.cache/huggingface:/root/.cache/huggingface
-v ~/.cache/vllm:/root/.cache/vllm
--ipc=host
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
nvcr.io/nvidia/vllm:26.03-py3
vllm serve soundsgoodai/GLM-4.5-Air-NVFP4-KV-cache-BF16
--gpu-memory-utilization 0.70
--max-model-len 131072
--kv-cache-dtype fp8
--enforce-eager
--enable-auto-tool-choice
--tool-call-parser glm45
--reasoning-parser glm45

Takes 85GB of memory at ~2x concurrency.

Claude: "The compressed-tensors models (gesong2077, Firworks) loaded but fell back to slow Marlin dequant because GB10 lacks native FP4 kernel support for that format, and the modelopt models with FP4/FP8 KV cache (soundsgoodai NVFP4/FP8 variants) crashed because vLLM's GLM4 MoE weight loader doesn't handle the extra k_scale/v_scale attention parameters — only the BF16 KV cache variant combines the native modelopt FP4 weight format (which maps to FLASHINFER_CUTLASS) with unquantized attention projections that vLLM can actually load." - hope that helps.

I was able to get nemotron working as well using the playbook.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment