Garbled output when model responds
#2
by slappa - opened
Hi, I'm having this issue with this model
Example output:
#### 5. **Deployment Example**!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The chat will hang with the model producing '!!!' output. Am I missing a setting? vllm serve args are as follows
docker run -d --rm \
--name "${CONTAINER}" \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v "${HF_CACHE_HOST}:/root/.cache/huggingface" \
--env "HF_TOKEN=${HF_TOKEN:-}" \
--env "GCN_ARCH_NAME=gfx1100" \
--env "HSA_ENABLE_IPC_MODE_LEGACY=0" \
--env PYTORCH_ALLOC_CONF=graph_capture_record_stream_reuse:True,expandable_segments:True \
--env FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
--env VLLM_ROCM_USE_AITER=1 \
-p "${HOST_PORT}:8000" \
--ipc=host \
"${IMAGE}" \
--model "cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4" \
--tensor-parallel-size 2 \
--max-model-len 65536 \
--gpu-memory-utilization 0.94 \
--enable-prompt-tokens-details \
--dtype float16 \
--enable-auto-tool-choice \
--no-enable-prefix-caching \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--language-model-only \
--mamba-cache-mode align \
--quantization compressed-tensors \
Running on dual 7900XTX
slappa changed discussion title from Garbled output when model response to Garbled output when model responds
Thanks for letting me know. Which vllm docker version are you using? I would recommend using the latest nightly or build from source.
NP.
Using 0.18.1rc1.dev218+gfafca38ad which is a recent nightly within the past 24h or so.
ROCm version is 7.2
Did some playing around. I think the difference maker was forcing the correct TRITON_ATTN backend. Testing with these settings I'm not seeing any issues now
echo "[step] starting container"
docker run -d --rm \
--name "${CONTAINER}" \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v "${HF_CACHE_HOST}:/root/.cache/huggingface" \
--env "HF_TOKEN=${HF_TOKEN:-}" \
--env "GCN_ARCH_NAME=gfx1100" \
--env "HSA_ENABLE_IPC_MODE_LEGACY=0" \
--env PYTORCH_ALLOC_CONF=graph_capture_record_stream_reuse:True,expandable_segments:True \
-p "${HOST_PORT}:8000" \
--ipc=host \
"${IMAGE}" \
--model "cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4" \
--dtype float16 \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--max-num-seqs 8 \
--block-size 32 \
--max-num-batched-tokens 2048 \
--gpu-memory-utilization 0.90 \
--attention-backend TRITON_ATTN \
--enable-prefix-caching \
--enable-prompt-tokens-details \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--chat-template-content-format string \
--language-model-only \
--mamba-cache-mode align \
--quantization compressed-tensors \
Will see how i progress here as I try to increase context