Thanks for this - but I needed some patches.
#1
by retowyss - opened
Thanks for the work - I got it working on 2x Pro 6k.
But I had to make Claude patch some stuff and I added some of the args (like tool-choice and reasoning parser, and accounting for HF_HUB_CACHE).
The patch script needed a few adjustments as well to get vision working, I can submit a PR
services:
vllm:
image: vllm/vllm-openai:cu130-nightly
container_name: vllm-qwen35-reap
shm_size: 16g
ipc: host
volumes:
- ${HF_HUB_CACHE}:/root/.cache/huggingface
- ./patch_vllm.py:/patch_vllm.py
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HF_TOKEN=${HF_TOKEN:-}
- HF_HUB_CACHE=/root/.cache/huggingface
- VLLM_USE_MODELSCOPE=false
- PYTHONUNBUFFERED=1
- TRANSFORMERS_USE_FAST_IMAGE_PROCESSOR=0
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0", "1"]
capabilities: [gpu]
entrypoint: ["/bin/bash", "-c"]
command:
- |
pip install conch-triton-kernels qwen-vl-utils && \
python3 /patch_vllm.py && \
python3 -m vllm.entrypoints.openai.api_server \
--model atbender/Qwen3.5-REAP-262B-A17B-W4A16 \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--reasoning-parser qwen3 \
--mm-encoder-tp-mode data \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--port 8000 \
--served-model-name Qwen3.5-REAP-262B-A17B-W4A16
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
interval: 10s
timeout: 5s
retries: 360
start_period: 900s
retowyss changed discussion title from Thanks for this - but I need some patches. to Thanks for this - but I needed some patches.
Hey. Thanks for reaching out about this. PRs are most welcome! π
I've submitted my files, settings and notes, sorry for the mess - I was lazy and did it through the web interface and accidentally submitted an empty PR π₯
Merged!
There may be some stray bloat in my docker compose snippet - I will test what can be discarded. I haven't tested my patch with CPU offload, did you have a chance to run it?