Vllm support
Running the nightly VLLM docker images reports:
ValueError: GGUF model with architecture qwen35moe is not supported yet.
Running the nightly wheel gives the same error.
Running the same vllm build with Qwen/Qwen3.5... (safetensors) does work, so it looks like some work is needed before the provided GGUFs will work with VLLM.
maybe it's the same problem as in https://huggingface.co/unsloth/Qwen3.5-27B-GGUF/discussions/12 ?
"The current Qwen3.5-27B-Q3_K_M.gguf uses qwen35 as the architecture name, but vLLM and Transformers expect qwen3_5 (matching the native model config.json). This mismatch causes a RuntimeError during loading."
working great in vLLM nightly for me
id rec that or use lcpp, works great in that for me too, im on a 4060ti and 64 gigs of ram at half offload running 10 tok/s.
working great in vLLM nightly for me
id rec that or use lcpp, works great in that for me too, im on a 4060ti and 64 gigs of ram at half offload running 10 tok/s.
What did you do? i still am unable to get it working. Can you list some exact instructions?
If you can, I would really appreciate an instruction as well, to use the gguf in vLLM. Thank you!
all i do is run ./llama-server -m Qwen3.5-35B-A3B-Q3_K_M.gguf --host 0.0.0.0 --mmproj mmproj-BF16.gguf --n-gpu-layers 5
use the llama-server from llamacpps releases or just build it urself https://github.com/ggml-org/llama.cpp/releases/tag/b8230
Can you please share the steps for vLLM? I still can't find gguf support for qwen3.5!
I have encountered the same problem, llama.cpp works fine with Qwen3.5-35B-A3B GGUF, but vllm doesn't: "ValueError: GGUF model with architecture qwen35moe is not supported yet."
vllm version: 0.17.2rc1.dev108+g4426447bb.d20260319
hardware: strix halo