Avg Draft acceptance rate: 0.0%

#1
by repne - opened

No matter what I do I can't get any token accepted. I'm running on 2x RTX 5090 using vLLM compiled from main.

This is one of the many arguments I have tried:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True vllm serve --tensor-parallel-size 2 --reasoning-parser qwen3 Sehyo/Qwen3.5-27B-NVFP4 --gpu-memory-utilization 0.85 --max-model-len 250k --max-num-seqs 1 --language-model-only -O3 --language-model-only --speculative-config '{"method":"mtp","num_speculative_tokens":2}' --trust-remote-code

Owner

method is qwen3_next_mtp and not mtp

qwen3_next_mtp is the first I tried, but I get a warning that it is deprecated and to use mtp instead, still, I get the same issue with qwen3_next_mtp

Same here; 0% acceptance rate running w/MTP at 4 or 1.

Launch args:
non-default args: {'model_tag': 'Sehyo/Qwen3.5-27B-NVFP4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'port': 5010, 'api_key': ['sk-ant-QDVDj2ly_YbBf9h0Fl9LJZhjBq26urqX9-53gmBxxdo'], 'model': 'Sehyo/Qwen3.5-27B-NVFP4', 'trust_remote_code': True, 'max_model_len': 262144, 'served_model_name': ['qwen35-27b-nvfp4'], 'attention_backend': 'flashinfer', 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.5, 'enable_prefix_caching': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 32, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}}

This same env locally for me (CUDA 13.0, vllm from nightly cu130) gets really good draft acceptance on your 122B model - 75-80%. MTP4 122B model nvfp4 running at around 140t/s on my rtx 6000. So seems like something's different w/this model.

Have you gotten it to work @Sehyo ? If so - any advice?

Sign up or log in to comment