not working at DGX Spark
docker run --runtime nvidia --gpus all -p 8000:8000 -v "$HOME/.cache/huggingface/:/root/.cache/huggingface" --ipc=host vllm/vllm-openai:nightly --model Firworks/GLM-4.5-Air-nvfp4 --dtype auto --max-model-len 32768
rises
ptxas fatal: Value 'sm_121a' is not defined for option 'gpu-name'
I don't have a Spark so I can't test it / troubleshoot it myself but I think with the Spark you have to use the Nvidia container images for vllm. Maybe try swapping out that standard vllm container for nvcr.io/nvidia/vllm:25.09-py3?
That appears to be what they suggest here:
https://build.nvidia.com/spark/vllm/instructions
by using nvcr.io/nvidia/vllm:25.09-py3 rises
(EngineCore_0 pid=302) ERROR 12-01 07:56:47 [core.py:700] RuntimeError: Failed to initialize GEMM
error
Can you try rerunning it with this additional environment variable set?
-e VLLM_USE_FLASHINFER_MOE_FP4=1
I updated the docker command on the model card with that. It was needed to get INTELLECT-3 running and INTELLECT-3 is based on GLM-4.5-Base so I'm wondering if the same might be needed here. It ran on the B200 I tested this model with without the flag but B200 is a slightly older CUDA support (sm_100) and that environment variable might be needed for sm_120+ including the RTX Pro 6000 Blackwell, 5090, and the Spark.
I found this issue on VLLM's github that is related:
https://github.com/vllm-project/vllm/issues/28110
Hey, I tried running this model on my Spark with the nvcr.io/nvidia/vllm:25.09-py3 container, using -e VLLM_USE_FLASHINFER_MOE_FP4=1.
The VLLM_USE_FLASHINFER_MOE_FP4=1 flag doesn't help on the GB10 (SM 12.0/12.1). The log shows WARNING [nvfp4_moe_support.py:40] FlashInfer kernels unavailable for CompressedTensorsW4A4MoeMethod on current platform. β it still falls back to the CUTLASS cutlass_fp4_group_mm path and crashes with RuntimeError: Failed to initialize GEMM.
Previous to using -e VLLM_USE_FLASHINFER_MOE_FP4=1, I was getting similar errors to birolkyumcu, failure to initialize GEMM.
Let me know if there's any additional information I can supply and I'm happy to iterate - would love to get this working on my spark!
philip
I actually finally bought a spark. I'll mess with it tonight and see if I can figure out how to get it running.
Let me know! excited to see it working!
So I got it "working" two different ways.
With the original Nvidia VLLM container:
sudo docker run --gpus all --network host --ipc=host -e HF_TOKEN=$HF_TOKEN -v ~/.cache/huggingface:/root/.cache/huggingface nvcr.io/nvidia/vllm:26.02-py3 vllm serve Firworks/GLM-4.5-Air-nvfp4 --dtype auto --max-model-len 32768 --enforce-eager
This crashes eventually though. It does work for a bit but eventually I'm not sure if it's just number of inferences or length but eventually it crashes.
Then I tried the eugr docker repo and with that and its script I ran it like this:
./launch-cluster.sh --solo exec vllm serve Firworks/GLM-4.5-Air-nvfp4 --port 8000 --host 0.0.0.0 --gpu-memory-utilization 0.7 --load-format fastsafetensors --enforce-eager
That also works for a while but also eventually crashes. From reading around it seems like NVFP4 MoE models just dont work well on the Spark currently. Prior to buying the Spark I ran all of my quants on the RTX Pro 6000 Blackwell which seems to be much better supported. I'm not really sure why. You'd think Nvidia would be focused on supporting the Spark.
Interesting. The one NVFP4 model I have working (not an MoE) (and seems to work well - 2 days uptime as a system service) is nm-testing/DeepSeek-R1-Distill-Qwen-32B-NVFP4
Here's how I'm running it:
ExecStart=/usr/bin/docker run --rm --gpus all
--name vllm-verifier
-p 8102:8000
-v /home/user/.cache/huggingface:/root/.cache/huggingface
-v /home/user/.cache/vllm:/root/.cache/vllm
--ipc=host
nvcr.io/nvidia/vllm:25.09-py3
vllm serve nm-testing/DeepSeek-R1-Distill-Qwen-32B-NVFP4 --tensor-parallel-size 1 --dtype auto --gpu-memory-utilization 0.40 --max-model-len 32768
The 'playbook' also lists a number of NVFP4 models which I haven't tested yet, which include Nemotron MoE models: https://github.com/NVIDIA/dgx-spark-playbooks/tree/main/nvidia/vllm
I'm a complete noob just trying to move from openrouter to locally hosted tools. If you figure something out or have other ideas I'm all ears!
Yeah specifically I think the problem is with MoE kernels on the Spark. I do see those Nemotron MoE models on that playbook but I haven't tried those myself. I'm not sure if Nvidia did anything special with them. I've had no trouble running dense models on the Spark yet.
It definitely seems like NVFP4 is very experimental and flaky, such that it's not even clear wether I should be making these with llm-compressor or ModelOpt and if I should be making "NVFP4" quantizations or "MXFP4" quantizations. A lot of this is just guessing.
Hey just FYSA I was able to get soundsgoodai/GLM-4.5-Air-NVFP4-KV-cache-BF16 working using the following invocation:
docker run --rm --gpus all
--name glm45-air
-p 8100:8000
-v ~/.cache/huggingface:/root/.cache/huggingface
-v ~/.cache/vllm:/root/.cache/vllm
--ipc=host
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
nvcr.io/nvidia/vllm:26.03-py3
vllm serve soundsgoodai/GLM-4.5-Air-NVFP4-KV-cache-BF16
--gpu-memory-utilization 0.70
--max-model-len 131072
--kv-cache-dtype fp8
--enforce-eager
--enable-auto-tool-choice
--tool-call-parser glm45
--reasoning-parser glm45
Takes 85GB of memory at ~2x concurrency.
Claude: "The compressed-tensors models (gesong2077, Firworks) loaded but fell back to slow Marlin dequant because GB10 lacks native FP4 kernel support for that format, and the modelopt models with FP4/FP8 KV cache (soundsgoodai NVFP4/FP8 variants) crashed because vLLM's GLM4 MoE weight loader doesn't handle the extra k_scale/v_scale attention parameters β only the BF16 KV cache variant combines the native modelopt FP4 weight format (which maps to FLASHINFER_CUTLASS) with unquantized attention projections that vLLM can actually load." - hope that helps.
I was able to get nemotron working as well using the playbook.