DFlash draft model for Qwen 3.6 27B, made by z-lab.

Tested with BeeLlama.cpp v0.2.0 โ€” a llama.cpp fork with advanced DFlash support that enables using these draft models to their full potential.

  • Target model: Qwen 3.6 27B Q5_K_S
  • Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
  • Config: same as in quick start docs, but with reasoning and adaptive DM disabled
  • Baseline is llama.cpp b9275 CUDA 13.1 Windows prebuilt: 36.8 tok/s median
Prompt: Doubly-linked list (output: ~4K tok)

Write a complete Python 3 module implementing a doubly-linked list with the following methods: append, prepend, insert_at, remove_at, find, reverse, to_list, length, is_empty, iter. Include comprehensive docstrings, type hints, and pytest unit tests for every method. Return only the code, no commentary.

DFlash quant Size Median Best Speedup Acceptance
IQ4_XS 891 MB 148.0 tok/s 160.5 tok/s 4.02x 47.6% / 87.7%
Q4_K_M 985 MB 145.6 tok/s 152.6 tok/s 3.96x 47.0% / 87.6%
Q5_K_M 1.17 GB 144.9 tok/s 157.2 tok/s 3.94x 46.8% / 87.6%
Q6_K 1.36 GB 139.2 tok/s 152.5 tok/s 3.79x 45.4% / 87.2%
Q8_0 1.76 GB 142.8 tok/s 155.5 tok/s 3.88x 46.9% / 87.6%
bf16 3.31 GB 132.5 tok/s 145.0 tok/s 3.60x 44.2% / 86.9%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

Between IQ4_XS, Q4_K_M and Q5_K_M the difference is smaller than noise from variance between passes, so using any of them should be fine. IQ4_XS takes up the least VRAM, but Q5_K_M might result in slightly higher acceptance in the long run.

Higher quants don't guarantee better performance: the model's job is to predict just a few tokens at the time, so loss of precision doesn't affect it as much. Meanwhile, larger size leads to slower drafting, reducing resulting tok/s, and also more VRAM consumption.

Keep in mind that results will likely be different for higher target model quants, which I can't test myself due to VRAM limitations.

Downloads last month
4,306
GGUF
Model size
2B params
Architecture
dflash-draft
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Anbeeld/Qwen3.6-27B-DFlash-GGUF

Quantized
(6)
this model