Working on nvidia DGX Spark

#1
by bkmtech - opened

Here is my docker command to use this model with a single nvidia DGX Spark. 138k context. Use huggingface cli to download the model and set the path.

docker run -d --gpus all --network host
--name qwen3-235b-a22b-instruct-2507-reap-nvfp4-256k
-v /home/username/Documents/models/Qwen3-235B-A22B-Instruct-2507-REAP-nvfp4:/model
-v ~/.cache/flashinfer:/root/.cache/flashinfer
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864
--shm-size=64g
-e VLLM_USE_FLASHINFER_MOE_FP4=0
--entrypoint python3
nvcr.io/nvidia/vllm:26.02-py3
-m vllm.entrypoints.openai.api_server
--model /model
--tokenizer Banana-Bae/Qwen3-235B-A22B-Instruct-2507-REAP-nvfp4
--dtype auto
--port 8000
--gpu-memory-utilization 0.9
--max-model-len 131072
--enable-prefix-caching
--enable-auto-tool-choice
--tool-call-parser hermes
--trust-remote-code
--enforce-eager
--kv-cache-dtype fp8
--max-num-seqs 1
--disable-custom-all-reduce

Sign up or log in to comment