Avg Draft acceptance rate is low.
This is my launch script:
export CONTEXT_LENGTH=172144
export CUDA_DEVICE_WAIT_POLICY=YIELD
export VLLM_SLEEP_WHEN_IDLE=1
CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0 \
VLLM_USE_MODELSCOPE=true vllm serve \
/path_to/Qwen3.6-27B-AWQ \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len $CONTEXT_LENGTH \
--served-model-name qwen-medium \
--enable-auto-tool-choice \
--max-num-seqs 6 \
--gpu-memory-utilization 0.98 \
--reasoning-parser qwen3 \
--enable-prefix-caching \
--speculative-config '{"method": "dflash", "model": "/path_to/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
--kv-cache-dtype auto \
--tool-call-parser qwen3_coder
This is log:
(APIServer pid=294991) INFO: 192.168.100.150:47462 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=294991) INFO 04-24 20:06:24 [loggers.py:271] Engine 000: Avg prompt throughput: 2978.9 tokens/s, Avg generation throughput: 0.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=294991) INFO 04-24 20:06:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.00, Accepted throughput: 0.09 tokens/s, Drafted throughput: 1.32 tokens/s, Accepted: 5 tokens, Drafted: 75 tokens, Per-position acceptance rate: 0.600, 0.200, 0.200, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 6.7%
(APIServer pid=294991) INFO 04-24 20:06:34 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=294991) INFO: 192.168.100.150:40192 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=294991) INFO 04-24 20:07:24 [loggers.py:271] Engine 000: Avg prompt throughput: 216.8 tokens/s, Avg generation throughput: 32.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 26.4%, Prefix cache hit rate: 46.4%
(APIServer pid=294991) INFO 04-24 20:07:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.62, Accepted throughput: 3.38 tokens/s, Drafted throughput: 31.25 tokens/s, Accepted: 203 tokens, Drafted: 1875 tokens, Per-position acceptance rate: 0.752, 0.480, 0.240, 0.096, 0.032, 0.024, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 10.8%
(APIServer pid=294991) INFO 04-24 20:07:34 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 26.8%, Prefix cache hit rate: 46.4%
(APIServer pid=294991) INFO 04-24 20:07:34 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.39, Accepted throughput: 36.60 tokens/s, Drafted throughput: 229.48 tokens/s, Accepted: 366 tokens, Drafted: 2295 tokens, Per-position acceptance rate: 0.771, 0.601, 0.444, 0.281, 0.137, 0.085, 0.033, 0.026, 0.013, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 15.9%
(APIServer pid=294991) INFO 04-24 20:07:44 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 27.2%, Prefix cache hit rate: 46.4%
(APIServer pid=294991) INFO 04-24 20:07:44 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.25, Accepted throughput: 34.00 tokens/s, Drafted throughput: 226.49 tokens/s, Accepted: 340 tokens, Drafted: 2265 tokens, Per-position acceptance rate: 0.788, 0.570, 0.371, 0.232, 0.159, 0.093, 0.026, 0.013, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 15.0%
(APIServer pid=294991) INFO 04-24 20:07:54 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 27.2%, Prefix cache hit rate: 46.4%
(APIServer pid=294991) INFO 04-24 20:07:54 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.71, Accepted throughput: 25.80 tokens/s, Drafted throughput: 226.48 tokens/s, Accepted: 258 tokens, Drafted: 2265 tokens, Per-position acceptance rate: 0.715, 0.483, 0.232, 0.132, 0.079, 0.033, 0.026, 0.007, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 11.4%
(APIServer pid=294991) INFO 04-24 20:08:04 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 38.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.4%, Prefix cache hit rate: 46.4%
(APIServer pid=294991) INFO 04-24 20:08:04 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.58, Accepted throughput: 23.50 tokens/s, Drafted throughput: 223.48 tokens/s, Accepted: 235 tokens, Drafted: 2235 tokens, Per-position acceptance rate: 0.685, 0.430, 0.255, 0.107, 0.054, 0.020, 0.013, 0.013, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 10.5%
(APIServer pid=294991) INFO 04-24 20:08:14 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 27.6%, Prefix cache hit rate: 46.4%
It's still under training so probably not as good as final version.
Also using third-party AWQ-quant might affect accuracy.
Yes, it's still under training. And also this model has some sliding window layers with causal attention, I think vLLM currently doesn't support that. I will talk to vLLM folks to support this new feature soon.
Yes, it's still under training. And also this model has some sliding window layers with causal attention, I think vLLM currently doesn't support that. I will talk to vLLM folks to support this new feature soon.
Thanks for the update, does it mean those "sliding window layers" are currently used full-sequence attention caused imperfect performance? Thanks.
@evilperson068 For current vLLM implementation, yes they are still using full attention :( We will update the SGLang DFlash implementation today or tomorrow to support interleaved SWA layers.
Sadly sglang is not supported on strix halo as far as I know.
And because of other things I moved to Gemma 4 and the performance on strix halo is better on llama then on vllm so I really hope for a dflash Gemma 4 26b a4b dflash to so if itโs working. Had good results on qwen3.6 35b but on vllm. I donโt know if llama supports this draft setup overall.
I'm really looking forward to the complete VLLM update!
@evilperson068 For current vLLM implementation, yes they are still using full attention :( We will update the SGLang DFlash implementation today or tomorrow to support interleaved SWA layers.
Wow that'd be cool to switch to sliding window, could improve the performance (both speed and accuracy) for a lot! Thanks!
Can't wait for the sliding window support in vLLM. That and the updated, better trained model from you guys. Thank you so much.
I used it on Omlx,but it showed thinking as content.
Update: This model has just been updated 9 hours ago, please try the new version people! <3
--speculative-config '{"method": "dflash", "model": "/path_to/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \Per-position acceptance rate: 0.685, 0.430, 0.255, 0.107, 0.054, 0.020, 0.013, 0.013, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 10.5%
Your num_speculative_tokens is way too high!
Look at the per-position acceptance, the last 7 positions are 0.000 - you are over-saturating the drafter model. Even those .0.013 are kinda insignificant.
Start with 4 then tune up or down to where you get high Avg. Draft acceptance rate vs accepted throughput that satisfies your needs (actual tok/s speed you desire). It always ends up being a trade-off which greatly depends on the context as well.
vllm: 0.20.0, Qwen3.6-27B-DFlash updated:
Tested cases for 1 request:
"num_speculative_tokens": 2 --> Avg Draft acceptance rate: 71.9% --> tps: 56.89
"num_speculative_tokens": 5 --> Avg Draft acceptance rate: 46.3% --> tps: 62.49
"num_speculative_tokens":10 --> Avg Draft acceptance rate: 20.2% --> tps: 43.82
Those are all slower than mtp: 2 -> tps: 70
@fouvy @mancub Could you try the latest commit in this PR? uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"
The vLLM main branch doesn't support the causal SWA layers of this draft model, which will lead to low acceptance rate. This PR fixed this.
@fouvy @mancub Could you try the latest commit in this PR? uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"
The vLLM main branch doesn't support the causal SWA layers of this draft model, which will lead to low acceptance rate. This PR fixed this.
Thanks for the update. I just updated the PR, coding performance around 60 TPS maxed out, but reasoning and other text generation speed not accelerated too much. It might need to take more training to surpass native MTP performance.
@fouvy @mancub Could you try the latest commit in this PR? uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"
The vLLM main branch doesn't support the causal SWA layers of this draft model, which will lead to low acceptance rate. This PR fixed this.
My launch script on 2x3090. After patched your new pr, performance almost the same.
export CONTEXT_LENGTH=166144
export CUDA_DEVICE_WAIT_POLICY=YIELD
export VLLM_SLEEP_WHEN_IDLE=1
CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0,1 \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 \
VLLM_USE_MODELSCOPE=true vllm serve \
/path_to/Qwen3.6-27B-AWQ \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len $CONTEXT_LENGTH \
--served-model-name qwen-medium \
--enable-auto-tool-choice \
--max-num-seqs 10 \
--gpu-memory-utilization 0.98 \
--reasoning-parser qwen3 \
--enable-prefix-caching \
--kv-cache-dtype auto \
--speculative-config '{"method": "dflash", "model": "/path_to/Qwen3.6-27B-DFlash", "num_speculative_tokens": 5}' \
--tool-call-parser qwen3_coder
Just tested vllm nightly + the SWA PR, here are my results on 2x RTX PRO 6000 running on Qwen3.6-27B BF16. Performance degrades at num_speculative_tokens 4 and 16.
It seems that at higher concurrency (C=4) MTP outperforms DFlash.
DFlash = 8
Prefill tok/s Aggregate decode tok/s
โญโโโโโโโฌโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโฎโญโโโโโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฎ
โ ctx โ tokens โ TTFT s โ tok/s โ N โโ ctx \ conc โ 1 โ 2 โ 4 โ
โโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโผโโโโคโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโโค
โ 8k โ 5,333 โ 0.76 โ 7,059 โ 1 โโ 0 โ 103.2 โ 209.4 โ 377.3 โ
โ 16k โ 10,517 โ 1.52 โ 6,941 โ 1 โโ 16k โ 108.8 โ 211.3 โ 372.0 โ
โ 32k โ 20,887 โ 3.05 โ 6,840 โ 1 โโ 32k โ 105.2 โ 193.7 โ 392.9 โ
โ 64k โ 41,626 โ 6.41 โ 6,491 โ 1 โโ 64k โ 115.0 โ 193.7 โ 360.2 โ
โ 128k โ 83,076 โ 13.97 โ 5,946 โ 1 โโ 128k โ 101.9 โ 180.3 โ 318.7 โ
โฐโโโโโโโดโโโโโโโโโดโโโโโโโโโดโโโโโโโโดโโโโฏโฐโโโโโโโโโโโโโดโโโโโโโโดโโโโโโโโดโโโโโโโโฏ
MTP = 3
Prefill tok/s Aggregate decode tok/s
โญโโโโโโโฌโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโฌโโโโฎโญโโโโโโโโโโโโโฌโโโโโโโโฌโโโโโโโโฌโโโโโโโโฎ
โ ctx โ tokens โ TTFT s โ tok/s โ N โโ ctx \ conc โ 1 โ 2 โ 4 โ
โโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโผโโโโคโโโโโโโโโโโโโโผโโโโโโโโผโโโโโโโโผโโโโโโโโค
โ 8k โ 5,334 โ 0.78 โ 6,876 โ 1 โโ 0 โ 96.0 โ 189.1 โ 391.5 โ
โ 16k โ 10,518 โ 1.54 โ 6,816 โ 1 โโ 16k โ 100.6 โ 194.5 โ 388.6 โ
โ 32k โ 20,888 โ 3.12 โ 6,697 โ 1 โโ 32k โ 97.2 โ 192.4 โ 384.0 โ
โ 64k โ 41,627 โ 6.57 โ 6,340 โ 1 โโ 64k โ 98.6 โ 190.3 โ 370.4 โ
โ 128k โ 83,077 โ 14.34 โ 5,792 โ 1 โโ 128k โ 92.3 โ 178.0 โ 345.0 โ
โฐโโโโโโโดโโโโโโโโโดโโโโโโโโโดโโโโโโโโดโโโโฏโฐโโโโโโโโโโโโโดโโโโโโโโดโโโโโโโโดโโโโโโโโฏ
I tried on DGX spark and I get virtually 0% acceptance for 4th and subsequent token with FP8 version of Qwen3.6-27B and significantly better results (positive acceptance for up to 8-9 tokens ahead) with BF16 version of the model so base quantization seems to matter a lot. The slowdown from going to BF16 seems to not justify using DFlash currently though.