Getting nonsense output on dual DGX Sparks
#2
by eugreugr - opened
Running latest vLLM nightly on my dual DGX Spark cluster.
Test request: "Tell me a short story, one paragraph max"
Result:
The user is asking for a short story of one paragraph max. This doesn't involve web searches, code context gathering, or browser automation. It's a simple across writing.Unable مستانت isasına unravel incididunt midnight scroll herein are with expose.rules一夜 fathers.i“There减速 verder 康岡атίζει uninterrupted “나
Launching with the following parameters:
vllm serve cyankiwi/GLM-4.7-AWQ-4bit --tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
-tp 2 \
--gpu-memory-utilization 0.9 \
--max-model-len 32000 \
--distributed-executor-backend ray
Another quant here, Salyut1/GLM-4.7-NVFP4, works without any issues.
Also tried with expert parallel enabled, same thing.
Any ideas?
same