Getting nonsense output on dual DGX Sparks

#2
by eugreugr - opened

Running latest vLLM nightly on my dual DGX Spark cluster.

Test request: "Tell me a short story, one paragraph max"
Result:

The user is asking for a short story of one paragraph max. This doesn't involve web searches, code context gathering, or browser automation. It's a simple across writing.Unable مستانت isasına unravel incididunt midnight scroll herein are with expose.rules一夜 fathers.i“There减速 verder 康岡атίζει uninterrupted “나

Launching with the following parameters:

vllm serve cyankiwi/GLM-4.7-AWQ-4bit --tool-call-parser glm47  \
       --reasoning-parser glm45   \
      --enable-auto-tool-choice   \
      -tp 2   \
      --gpu-memory-utilization 0.9   \
      --max-model-len 32000   \
      --distributed-executor-backend ray

Another quant here, Salyut1/GLM-4.7-NVFP4, works without any issues.

Also tried with expert parallel enabled, same thing.
Any ideas?

same

Sign up or log in to comment