Working Setup: Qwen3.5-27B GPTQ-int4 on vLLM
Posting this for anyone trying to run this locally. There are a few non-obvious gotchas.
The Problem
vLLM 0.18.1 with a stock transformers install will fail with one of these errors:
ValueError: Tokenizer class TokenizersBackend does not exist or is not currently imported.β the GPTQ repo has a brokentokenizer_config.jsonreferencing a non-standard tokenizer classValueError: model type 'qwen3_5' not recognizedβ Qwen3.5 is too new for stabletransformersreleases- vLLM misrouting the model through its vision-language pipeline (
qwen3_vl.py) and crashing on image processor loading
The Fix
Step 1 β Upgrade transformers from source (stable PyPI release doesn't know qwen3_5 yet):
pip install git+https://github.com/huggingface/transformers.git --upgrade
Step 2 β Serve with the base model tokenizer (bypasses the broken tokenizer in the GPTQ repo) and use gptq_marlin quantization (vLLM explicitly warns the standard gptq kernel is buggy for 4-bit):
CUDA_VISIBLE_DEVICES=1 vllm serve Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled \
--tokenizer Qwen/Qwen3.5-27B \
--quantization gptq_marlin \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--port 8000
Notes
- Tested on vLLM 0.18.1 with an RTX 4090 (24GB). Model loads at
17.5 GiB, leaving ~4.5 GiB for KV cache (7,840 tokens available).
Hope this saves someone a few hours of error debugging
Not works onpython -m vllm.entrypoints.openai.api_server --model Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled --max-model-len 4096 --gpu-memory-utilization 0.85 --port 8000 --quantization gptq_marlin
error infor: Value error, Cannot find the config file for gptq_marlin
but works onpython -m vllm.entrypoints.openai.api_server --model Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled --max-model-len 4096 --gpu-memory-utilization 0.85 --port 8000 --quantization fp8
Tested on vLLM 0.18.1 with four RTX 5090 (32GB)
Got this working, thanks for the tips on the tokenizer and upgrading transformers! My run flags:
--model Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled --tokenizer Qwen/Qwen3.5-27B
--host 0.0.0.0 --port 8000 --dtype bfloat16 --enable-prefix-caching
--kv-cache-dtype fp8 --max-model-len 210000 --chat-template-content-format string
--enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3
This works on Runpod with 1 x RTX PRO 6000, I'm sure other configurations can work as well
opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"myprovider": {
"npm": "@ai-sdk/openai-compatible",
"name": "Runpod vLLM",
"options": {
"baseURL": "https://<runpod-instance>-8000.proxy.runpod.net/v1",
"apiKey": "<api-key>"
},
"models": {
"Qwopus": {
"name": "Qwopus"
}
}
}
}
}