Getting nonsense output on dual DGX Sparks

by eugreugr - opened Dec 24, 2025

Dec 24, 2025

Running latest vLLM nightly on my dual DGX Spark cluster.

Test request: "Tell me a short story, one paragraph max"
Result:

The user is asking for a short story of one paragraph max. This doesn't involve web searches, code context gathering, or browser automation. It's a simple across writing.Unable مستانت isasına unravel incididunt midnight scroll herein are with expose.rules一夜 fathers.i“There减速 verder 康岡атίζει uninterrupted “나

Launching with the following parameters:

vllm serve cyankiwi/GLM-4.7-AWQ-4bit --tool-call-parser glm47  \
       --reasoning-parser glm45   \
      --enable-auto-tool-choice   \
      -tp 2   \
      --gpu-memory-utilization 0.9   \
      --max-model-len 32000   \
      --distributed-executor-backend ray

Another quant here, Salyut1/GLM-4.7-NVFP4, works without any issues.

Also tried with expert parallel enabled, same thing.
Any ideas?

aidendle94

Dec 24, 2025

same

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment