Instructions to use z-lab/Qwen3.6-27B-DFlash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use z-lab/Qwen3.6-27B-DFlash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="z-lab/Qwen3.6-27B-DFlash", trust_remote_code=True)

# Load model directly
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("z-lab/Qwen3.6-27B-DFlash", trust_remote_code=True)
model = AutoModel.from_pretrained("z-lab/Qwen3.6-27B-DFlash", trust_remote_code=True)

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use z-lab/Qwen3.6-27B-DFlash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "z-lab/Qwen3.6-27B-DFlash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Qwen3.6-27B-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/z-lab/Qwen3.6-27B-DFlash

SGLang

How to use z-lab/Qwen3.6-27B-DFlash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "z-lab/Qwen3.6-27B-DFlash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Qwen3.6-27B-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "z-lab/Qwen3.6-27B-DFlash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "z-lab/Qwen3.6-27B-DFlash",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use z-lab/Qwen3.6-27B-DFlash with Docker Model Runner:
```
docker model run hf.co/z-lab/Qwen3.6-27B-DFlash
```

Avg Draft acceptance rate is low.

by fouvy - opened 14 days ago

Discussion

fouvy

14 days ago

This is my launch script:

export CONTEXT_LENGTH=172144
export CUDA_DEVICE_WAIT_POLICY=YIELD
export VLLM_SLEEP_WHEN_IDLE=1

CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0 \
VLLM_USE_MODELSCOPE=true vllm serve \
        /path_to/Qwen3.6-27B-AWQ \
        --port 8000 \
        --tensor-parallel-size 1 \
        --max-model-len $CONTEXT_LENGTH \
        --served-model-name qwen-medium \
        --enable-auto-tool-choice \
        --max-num-seqs 6 \
        --gpu-memory-utilization 0.98 \
        --reasoning-parser qwen3 \
        --enable-prefix-caching \
        --speculative-config '{"method": "dflash", "model": "/path_to/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
        --kv-cache-dtype auto \
        --tool-call-parser qwen3_coder

This is log:

(APIServer pid=294991) INFO:     192.168.100.150:47462 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=294991) INFO 04-24 20:06:24 [loggers.py:271] Engine 000: Avg prompt throughput: 2978.9 tokens/s, Avg generation throughput: 0.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=294991) INFO 04-24 20:06:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.00, Accepted throughput: 0.09 tokens/s, Drafted throughput: 1.32 tokens/s, Accepted: 5 tokens, Drafted: 75 tokens, Per-position acceptance rate: 0.600, 0.200, 0.200, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 6.7%
(APIServer pid=294991) INFO 04-24 20:06:34 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=294991) INFO:     192.168.100.150:40192 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=294991) INFO 04-24 20:07:24 [loggers.py:271] Engine 000: Avg prompt throughput: 216.8 tokens/s, Avg generation throughput: 32.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 26.4%, Prefix cache hit rate: 46.4%
(APIServer pid=294991) INFO 04-24 20:07:24 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.62, Accepted throughput: 3.38 tokens/s, Drafted throughput: 31.25 tokens/s, Accepted: 203 tokens, Drafted: 1875 tokens, Per-position acceptance rate: 0.752, 0.480, 0.240, 0.096, 0.032, 0.024, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 10.8%
(APIServer pid=294991) INFO 04-24 20:07:34 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 26.8%, Prefix cache hit rate: 46.4%
(APIServer pid=294991) INFO 04-24 20:07:34 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.39, Accepted throughput: 36.60 tokens/s, Drafted throughput: 229.48 tokens/s, Accepted: 366 tokens, Drafted: 2295 tokens, Per-position acceptance rate: 0.771, 0.601, 0.444, 0.281, 0.137, 0.085, 0.033, 0.026, 0.013, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 15.9%
(APIServer pid=294991) INFO 04-24 20:07:44 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 49.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 27.2%, Prefix cache hit rate: 46.4%
(APIServer pid=294991) INFO 04-24 20:07:44 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 3.25, Accepted throughput: 34.00 tokens/s, Drafted throughput: 226.49 tokens/s, Accepted: 340 tokens, Drafted: 2265 tokens, Per-position acceptance rate: 0.788, 0.570, 0.371, 0.232, 0.159, 0.093, 0.026, 0.013, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 15.0%
(APIServer pid=294991) INFO 04-24 20:07:54 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 27.2%, Prefix cache hit rate: 46.4%
(APIServer pid=294991) INFO 04-24 20:07:54 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.71, Accepted throughput: 25.80 tokens/s, Drafted throughput: 226.48 tokens/s, Accepted: 258 tokens, Drafted: 2265 tokens, Per-position acceptance rate: 0.715, 0.483, 0.232, 0.132, 0.079, 0.033, 0.026, 0.007, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 11.4%
(APIServer pid=294991) INFO 04-24 20:08:04 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 38.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 28.4%, Prefix cache hit rate: 46.4%
(APIServer pid=294991) INFO 04-24 20:08:04 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.58, Accepted throughput: 23.50 tokens/s, Drafted throughput: 223.48 tokens/s, Accepted: 235 tokens, Drafted: 2235 tokens, Per-position acceptance rate: 0.685, 0.430, 0.255, 0.107, 0.054, 0.020, 0.013, 0.013, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 10.5%
(APIServer pid=294991) INFO 04-24 20:08:14 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 27.6%, Prefix cache hit rate: 46.4%

evilperson068

14 days ago

•

edited 14 days ago

It's still under training so probably not as good as final version.
Also using third-party AWQ-quant might affect accuracy.

jianchen0311

Z Lab org 14 days ago

Yes, it's still under training. And also this model has some sliding window layers with causal attention, I think vLLM currently doesn't support that. I will talk to vLLM folks to support this new feature soon.

evilperson068

13 days ago

Yes, it's still under training. And also this model has some sliding window layers with causal attention, I think vLLM currently doesn't support that. I will talk to vLLM folks to support this new feature soon.

Thanks for the update, does it mean those "sliding window layers" are currently used full-sequence attention caused imperfect performance? Thanks.

jianchen0311

Z Lab org 13 days ago

@evilperson068 For current vLLM implementation, yes they are still using full attention :( We will update the SGLang DFlash implementation today or tomorrow to support interleaved SWA layers.

Rico1911

13 days ago

Sadly sglang is not supported on strix halo as far as I know.

And because of other things I moved to Gemma 4 and the performance on strix halo is better on llama then on vllm so I really hope for a dflash Gemma 4 26b a4b dflash to so if it’s working. Had good results on qwen3.6 35b but on vllm. I don’t know if llama supports this draft setup overall.

BuiDoan

13 days ago

I'm really looking forward to the complete VLLM update!

evilperson068

13 days ago

@evilperson068 For current vLLM implementation, yes they are still using full attention :( We will update the SGLang DFlash implementation today or tomorrow to support interleaved SWA layers.

Wow that'd be cool to switch to sliding window, could improve the performance (both speed and accuracy) for a lot! Thanks!

Storge02

13 days ago

Can't wait for the sliding window support in vLLM. That and the updated, better trained model from you guys. Thank you so much.

BeCreated

12 days ago

I used it on Omlx,but it showed thinking as content.

evilperson068

12 days ago

Update: This model has just been updated 9 hours ago, please try the new version people! <3

mancub

11 days ago

@fouvy

    --speculative-config '{"method": "dflash", "model": "/path_to/Qwen3.6-27B-DFlash", "num_speculative_tokens": 15}' \
Per-position acceptance rate: 0.685, 0.430, 0.255, 0.107, 0.054, 0.020, 0.013, 0.013, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 10.5%

Your num_speculative_tokens is way too high!

Look at the per-position acceptance, the last 7 positions are 0.000 - you are over-saturating the drafter model. Even those .0.013 are kinda insignificant.

Start with 4 then tune up or down to where you get high Avg. Draft acceptance rate vs accepted throughput that satisfies your needs (actual tok/s speed you desire). It always ends up being a trade-off which greatly depends on the context as well.

fouvy

9 days ago

vllm: 0.20.0, Qwen3.6-27B-DFlash updated:

Tested cases for 1 request:

"num_speculative_tokens": 2 --> Avg Draft acceptance rate: 71.9% --> tps: 56.89
"num_speculative_tokens": 5 --> Avg Draft acceptance rate: 46.3% --> tps: 62.49
"num_speculative_tokens":10 --> Avg Draft acceptance rate: 20.2% --> tps: 43.82

Those are all slower than mtp: 2 -> tps: 70

jianchen0311

Z Lab org 9 days ago

@fouvy @mancub Could you try the latest commit in this PR? uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"

The vLLM main branch doesn't support the causal SWA layers of this draft model, which will lead to low acceptance rate. This PR fixed this.

evilperson068

9 days ago

•

edited 9 days ago

@fouvy @mancub Could you try the latest commit in this PR? uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"

The vLLM main branch doesn't support the causal SWA layers of this draft model, which will lead to low acceptance rate. This PR fixed this.

Thanks for the update. I just updated the PR, coding performance around 60 TPS maxed out, but reasoning and other text generation speed not accelerated too much. It might need to take more training to surpass native MTP performance.

fouvy

9 days ago

•

edited 9 days ago

@fouvy @mancub Could you try the latest commit in this PR? uv pip install -U --torch-backend=auto "vllm @ git+https://github.com/vllm-project/vllm.git@refs/pull/40898/head"

The vLLM main branch doesn't support the causal SWA layers of this draft model, which will lead to low acceptance rate. This PR fixed this.

My launch script on 2x3090. After patched your new pr, performance almost the same.

export CONTEXT_LENGTH=166144

export CUDA_DEVICE_WAIT_POLICY=YIELD
export VLLM_SLEEP_WHEN_IDLE=1

CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0,1 \
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 \
VLLM_USE_MODELSCOPE=true vllm serve \
        /path_to/Qwen3.6-27B-AWQ \
        --port 8000 \
        --tensor-parallel-size 2 \
        --max-model-len $CONTEXT_LENGTH \
        --served-model-name qwen-medium \
        --enable-auto-tool-choice \
        --max-num-seqs 10 \
        --gpu-memory-utilization 0.98 \
        --reasoning-parser qwen3 \
        --enable-prefix-caching \
        --kv-cache-dtype auto \
        --speculative-config '{"method": "dflash", "model": "/path_to/Qwen3.6-27B-DFlash", "num_speculative_tokens": 5}' \
        --tool-call-parser qwen3_coder

repne

9 days ago

•

edited 9 days ago

Just tested vllm nightly + the SWA PR, here are my results on 2x RTX PRO 6000 running on Qwen3.6-27B BF16. Performance degrades at num_speculative_tokens 4 and 16.
It seems that at higher concurrency (C=4) MTP outperforms DFlash.

DFlash = 8

Prefill tok/s                         Aggregate decode tok/s
╭──────┬────────┬────────┬───────┬───╮╭────────────┬───────┬───────┬───────╮
│ ctx  │ tokens │ TTFT s │ tok/s │ N ││ ctx \ conc │     1 │     2 │     4 │
├──────┼────────┼────────┼───────┼───┤├────────────┼───────┼───────┼───────┤
│ 8k   │  5,333 │   0.76 │ 7,059 │ 1 ││ 0          │ 103.2 │ 209.4 │ 377.3 │
│ 16k  │ 10,517 │   1.52 │ 6,941 │ 1 ││ 16k        │ 108.8 │ 211.3 │ 372.0 │
│ 32k  │ 20,887 │   3.05 │ 6,840 │ 1 ││ 32k        │ 105.2 │ 193.7 │ 392.9 │
│ 64k  │ 41,626 │   6.41 │ 6,491 │ 1 ││ 64k        │ 115.0 │ 193.7 │ 360.2 │
│ 128k │ 83,076 │  13.97 │ 5,946 │ 1 ││ 128k       │ 101.9 │ 180.3 │ 318.7 │
╰──────┴────────┴────────┴───────┴───╯╰────────────┴───────┴───────┴───────╯

MTP = 3

Prefill tok/s                         Aggregate decode tok/s
╭──────┬────────┬────────┬───────┬───╮╭────────────┬───────┬───────┬───────╮
│ ctx  │ tokens │ TTFT s │ tok/s │ N ││ ctx \ conc │     1 │     2 │     4 │
├──────┼────────┼────────┼───────┼───┤├────────────┼───────┼───────┼───────┤
│ 8k   │  5,334 │   0.78 │ 6,876 │ 1 ││ 0          │  96.0 │ 189.1 │ 391.5 │
│ 16k  │ 10,518 │   1.54 │ 6,816 │ 1 ││ 16k        │ 100.6 │ 194.5 │ 388.6 │
│ 32k  │ 20,888 │   3.12 │ 6,697 │ 1 ││ 32k        │  97.2 │ 192.4 │ 384.0 │
│ 64k  │ 41,627 │   6.57 │ 6,340 │ 1 ││ 64k        │  98.6 │ 190.3 │ 370.4 │
│ 128k │ 83,077 │  14.34 │ 5,792 │ 1 ││ 128k       │  92.3 │ 178.0 │ 345.0 │
╰──────┴────────┴────────┴───────┴───╯╰────────────┴───────┴───────┴───────╯

jatentaki

7 days ago

I tried on DGX spark and I get virtually 0% acceptance for 4th and subsequent token with FP8 version of Qwen3.6-27B and significantly better results (positive acceptance for up to 8-9 tokens ahead) with BF16 version of the model so base quantization seems to matter a lot. The slowdown from going to BF16 seems to not justify using DFlash currently though.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment