cpatonn/Qwopus3.5-27B-v3-AWQ-BF16-INT8 · MTP doesn't seem to work

MTP doesn't seem to work

by Minachist - opened 18 days ago

Discussion

Minachist

18 days ago

•

edited 18 days ago

First of all, thank you for releasing this quant. I'm currently using 2x RTX 3090s, and this is exactly what I needed.

I'm using vLLM (on podman container, vllm/vllm-openai:cu130-nightly) with these args:
--served-model-name vLLM --tensor-parallel-size 2 --max-model-len 40000 --max-num-batched-tokens 16384 --max-num-seqs 1 --gpu-memory-utilization 0.95 --block-size 32 -O3 --trust-remote-code --model cpatonn/Qwopus3.5-27B-v3-AWQ-BF16-INT8 --language-model-only --tool-call-parser qwen3_coder --reasoning-parser qwen3 --enable-auto-tool-choice --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --mamba-cache-mode all --enable-prefix-caching --enable-chunked-prefill

The problem is that the Avg Draft acceptance rate is always 0.0%. I suspect this is not a vLLM issue, since other models with the same architecture, such as Qwen3.5-27B (and its quants like cyankiwi/Qwen3.5-27B-AWQ-BF16-INT8) and Qwopus3.5-9B-v3, have a working MTP.

cpatonn

Owner 18 days ago

Thank you for using my model. My Qwopus3.5-27B-v3 quants do not have MTP layers, as the original model Jackrong/Qwopus3.5-27B-v3 does not have MTP implementation.

Minachist

18 days ago

That makes perfect sense! Thank you for letting me know, as I thought all Qwopus family had MTP.
I'll drop the speculative decoding arguments from my vLLM setup then. Other than that, the quant runs very well on my setup. I really appreciate your work!

Minachist changed discussion status to closed 17 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment