Thanks for this - but I needed some patches.

#1
by retowyss - opened

Thanks for the work - I got it working on 2x Pro 6k.

But I had to make Claude patch some stuff and I added some of the args (like tool-choice and reasoning parser, and accounting for HF_HUB_CACHE).

The patch script needed a few adjustments as well to get vision working, I can submit a PR

services:
  vllm:
    image: vllm/vllm-openai:cu130-nightly
    container_name: vllm-qwen35-reap
    shm_size: 16g
    ipc: host
    volumes:
      - ${HF_HUB_CACHE}:/root/.cache/huggingface
      - ./patch_vllm.py:/patch_vllm.py
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN:-}
      - HF_HUB_CACHE=/root/.cache/huggingface
      - VLLM_USE_MODELSCOPE=false
      - PYTHONUNBUFFERED=1
      - TRANSFORMERS_USE_FAST_IMAGE_PROCESSOR=0

    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities: [gpu]
    entrypoint: ["/bin/bash", "-c"]
    command:
      - |
        pip install conch-triton-kernels qwen-vl-utils && \
        python3 /patch_vllm.py && \
        python3 -m vllm.entrypoints.openai.api_server \
          --model atbender/Qwen3.5-REAP-262B-A17B-W4A16 \
          --dtype bfloat16 \
          --tensor-parallel-size 2 \
          --gpu-memory-utilization 0.90 \
          --max-num-batched-tokens 16384 \
          --trust-remote-code \
          --reasoning-parser qwen3 \
          --mm-encoder-tp-mode data \
          --enable-prefix-caching \
          --enable-auto-tool-choice \
          --tool-call-parser qwen3_coder \
          --port 8000 \
          --served-model-name Qwen3.5-REAP-262B-A17B-W4A16
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 360
      start_period: 900s
retowyss changed discussion title from Thanks for this - but I need some patches. to Thanks for this - but I needed some patches.

Hey. Thanks for reaching out about this. PRs are most welcome! πŸ™‚

I've submitted my files, settings and notes, sorry for the mess - I was lazy and did it through the web interface and accidentally submitted an empty PR πŸ˜₯

Merged!

There may be some stray bloat in my docker compose snippet - I will test what can be discarded. I haven't tested my patch with CPU offload, did you have a chance to run it?

Sign up or log in to comment