Working Setup: Qwen3.5-27B GPTQ-int4 on vLLM

#68

by elotech - opened 21 days ago

Discussion

elotech

21 days ago

Posting this for anyone trying to run this locally. There are a few non-obvious gotchas.

The Problem

vLLM 0.18.1 with a stock transformers install will fail with one of these errors:

ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported. — the GPTQ repo has a broken tokenizer_config.json referencing a non-standard tokenizer class
ValueError: model type 'qwen3_5' not recognized — Qwen3.5 is too new for stable transformers releases
vLLM misrouting the model through its vision-language pipeline (qwen3_vl.py) and crashing on image processor loading

The Fix

Step 1 — Upgrade transformers from source (stable PyPI release doesn't know qwen3_5 yet):

pip install git+https://github.com/huggingface/transformers.git --upgrade

Step 2 — Serve with the base model tokenizer (bypasses the broken tokenizer in the GPTQ repo) and use gptq_marlin quantization (vLLM explicitly warns the standard gptq kernel is buggy for 4-bit):

CUDA_VISIBLE_DEVICES=1 vllm serve Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled \
  --tokenizer Qwen/Qwen3.5-27B \
  --quantization gptq_marlin \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --port 8000

Notes

Tested on vLLM 0.18.1 with an RTX 4090 (24GB). Model loads at ~~17.5 GiB, leaving ~4.5 GiB for KV cache (~~7,840 tokens available).

Hope this saves someone a few hours of error debugging

XUYIHANG

21 days ago

Not works on
python -m vllm.entrypoints.openai.api_server --model Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled --max-model-len 4096 --gpu-memory-utilization 0.85 --port 8000 --quantization gptq_marlin
error infor: Value error, Cannot find the config file for gptq_marlin

but works on
python -m vllm.entrypoints.openai.api_server --model Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled --max-model-len 4096 --gpu-memory-utilization 0.85 --port 8000 --quantization fp8
Tested on vLLM 0.18.1 with four RTX 5090 (32GB)

jdmsharpe

16 days ago

•

edited 16 days ago

Got this working, thanks for the tips on the tokenizer and upgrading transformers! My run flags:

--model Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled --tokenizer Qwen/Qwen3.5-27B 
--host 0.0.0.0 --port 8000 --dtype bfloat16 --enable-prefix-caching 
--kv-cache-dtype fp8 --max-model-len 210000 --chat-template-content-format string 
--enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3

This works on Runpod with 1 x RTX PRO 6000, I'm sure other configurations can work as well

opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "myprovider": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Runpod vLLM",
      "options": {
        "baseURL": "https://<runpod-instance>-8000.proxy.runpod.net/v1",
        "apiKey": "<api-key>"
      },
      "models": {
        "Qwopus": {
          "name": "Qwopus"
        }
      }
    }
  }
}

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment