MTP doesn't seem to work

#1
by Minachist - opened

First of all, thank you for releasing this quant. I'm currently using 2x RTX 3090s, and this is exactly what I needed.

I'm using vLLM (on podman container, vllm/vllm-openai:cu130-nightly) with these args:
--served-model-name vLLM --tensor-parallel-size 2 --max-model-len 40000 --max-num-batched-tokens 16384 --max-num-seqs 1 --gpu-memory-utilization 0.95 --block-size 32 -O3 --trust-remote-code --model cpatonn/Qwopus3.5-27B-v3-AWQ-BF16-INT8 --language-model-only --tool-call-parser qwen3_coder --reasoning-parser qwen3 --enable-auto-tool-choice --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --mamba-cache-mode all --enable-prefix-caching --enable-chunked-prefill

The problem is that the Avg Draft acceptance rate is always 0.0%. I suspect this is not a vLLM issue, since other models with the same architecture, such as Qwen3.5-27B (and its quants like cyankiwi/Qwen3.5-27B-AWQ-BF16-INT8) and Qwopus3.5-9B-v3, have a working MTP.

Thank you for using my model. My Qwopus3.5-27B-v3 quants do not have MTP layers, as the original model Jackrong/Qwopus3.5-27B-v3 does not have MTP implementation.

That makes perfect sense! Thank you for letting me know, as I thought all Qwopus family had MTP.
I'll drop the speculative decoding arguments from my vLLM setup then. Other than that, the quant runs very well on my setup. I really appreciate your work!

Minachist changed discussion status to closed

Sign up or log in to comment