Vllm support

#15
by deece - opened

Running the nightly VLLM docker images reports:
ValueError: GGUF model with architecture qwen35moe is not supported yet.

Running the nightly wheel gives the same error.

Running the same vllm build with Qwen/Qwen3.5... (safetensors) does work, so it looks like some work is needed before the provided GGUFs will work with VLLM.

maybe it's the same problem as in https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/discussions/12 ?

"The current Qwen3.5-27B-Q3_K_M.gguf uses qwen35 as the architecture name, but vLLM and Transformers expect qwen3_5 (matching the native model config.json). This mismatch causes a RuntimeError during loading."

working great in vLLM nightly for me

id rec that or use lcpp, works great in that for me too, im on a 4060ti and 64 gigs of ram at half offload running 10 tok/s.

working great in vLLM nightly for me

id rec that or use lcpp, works great in that for me too, im on a 4060ti and 64 gigs of ram at half offload running 10 tok/s.

What did you do? i still am unable to get it working. Can you list some exact instructions?

If you can, I would really appreciate an instruction as well, to use the gguf in vLLM. Thank you!

all i do is run ./llama-server -m Qwen3.5-35B-A3B-Q3_K_M.gguf --host 0.0.0.0 --mmproj mmproj-BF16.gguf --n-gpu-layers 5

use the llama-server from llamacpps releases or just build it urself https://github.com/ggml-org/llama.cpp/releases/tag/b8230

Can you please share the steps for vLLM? I still can't find gguf support for qwen3.5!

I have encountered the same problem, llama.cpp works fine with Qwen3.5-35B-A3B GGUF, but vllm doesn't: "ValueError: GGUF model with architecture qwen35moe is not supported yet."

vllm version: 0.17.2rc1.dev108+g4426447bb.d20260319
hardware: strix halo

Sign up or log in to comment