Garbled output when model responds

by slappa - opened 25 days ago

Discussion

slappa

25 days ago

•

edited 25 days ago

Hi, I'm having this issue with this model

Example output:

#### 5. **Deployment Example**!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The chat will hang with the model producing '!!!' output. Am I missing a setting? vllm serve args are as follows

docker run -d --rm \
  --name "${CONTAINER}" \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v "${HF_CACHE_HOST}:/root/.cache/huggingface" \
  --env "HF_TOKEN=${HF_TOKEN:-}" \
  --env "GCN_ARCH_NAME=gfx1100" \
  --env "HSA_ENABLE_IPC_MODE_LEGACY=0" \
  --env PYTORCH_ALLOC_CONF=graph_capture_record_stream_reuse:True,expandable_segments:True \
  --env FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
  --env VLLM_ROCM_USE_AITER=1 \
  -p "${HOST_PORT}:8000" \
  --ipc=host \
  "${IMAGE}" \
  --model "cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4" \
  --tensor-parallel-size 2 \
  --max-model-len  65536 \
  --gpu-memory-utilization 0.94 \
  --enable-prompt-tokens-details \
  --dtype float16 \
  --enable-auto-tool-choice \
  --no-enable-prefix-caching \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --language-model-only \
  --mamba-cache-mode align \
  --quantization compressed-tensors \

Running on dual 7900XTX

slappa changed discussion title from Garbled output when model response to Garbled output when model responds 25 days ago

cpatonn

cyankiwi org 25 days ago

Thanks for letting me know. Which vllm docker version are you using? I would recommend using the latest nightly or build from source.

slappa

24 days ago

•

edited 24 days ago

NP.

Using 0.18.1rc1.dev218+gfafca38ad which is a recent nightly within the past 24h or so.

ROCm version is 7.2

slappa

24 days ago

Did some playing around. I think the difference maker was forcing the correct TRITON_ATTN backend. Testing with these settings I'm not seeing any issues now

echo "[step] starting container"
docker run -d --rm \
  --name "${CONTAINER}" \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v "${HF_CACHE_HOST}:/root/.cache/huggingface" \
  --env "HF_TOKEN=${HF_TOKEN:-}" \
  --env "GCN_ARCH_NAME=gfx1100" \
  --env "HSA_ENABLE_IPC_MODE_LEGACY=0" \
  --env PYTORCH_ALLOC_CONF=graph_capture_record_stream_reuse:True,expandable_segments:True \
  -p "${HOST_PORT}:8000" \
  --ipc=host \
  "${IMAGE}" \
  --model "cyankiwi/Qwen3.5-27B-AWQ-BF16-INT4" \
  --dtype float16 \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --max-num-seqs 8 \
  --block-size 32 \
  --max-num-batched-tokens 2048 \
  --gpu-memory-utilization 0.90 \
  --attention-backend TRITON_ATTN \
  --enable-prefix-caching \
  --enable-prompt-tokens-details \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --chat-template-content-format string \
  --language-model-only \
  --mamba-cache-mode align \
  --quantization compressed-tensors \

Will see how i progress here as I try to increase context

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment