Qwen3.6-27B DFlash Draft โ€” Q8_0 GGUF

Q8_0 GGUF quantization of the z-lab/Qwen3.6-27B-DFlash draft model, produced for the Lucebox dflash engine (speculative decoding for Qwen3.6-27B-Q4_K_M).

  • Source: deepsweet/Qwen3.6-27B-DFlash-FP16 (FP16 safetensors mirror of z-lab's BF16)
  • Tool: dflash/scripts/quantize_draft_q8.py from lucebox-hub
  • Size: 1.84 GB (53 % of BF16)
  • Arch: qwen35-dflash-draft, 5 layers, hidden 5120, n_target_layers 5, vocab 248320
  • Tensors: projection weights โ†’ Q8_0, norms โ†’ F32 (precision-critical, tiny)
  • Block size: 16, RoPE ฮธ 1e6, RMS ฮต 1e-6, MASK token id 248070

Files

File Size Purpose
dflash-draft-3.6-q8_0.gguf 1.84 GB The draft model. Pass to dflash via --draft

Usage with the Lucebox dflash engine

# 1. Clone + checkout (PR 129 adds Qwen3.6 SWA support)
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/129/head:pr129 && git checkout pr129
git submodule update --init --recursive

# 2. Build (sm_86+ enables Block-Sparse Attention; sm_75 falls back to ggml flash_attn_ext)
cd dflash
cmake -B build -S . -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=86 \
  -DDFLASH27B_ENABLE_BSA=ON \
  -DDFLASH27B_TESTS=ON
cmake --build build --target test_dflash -j

# 3. Get the target (Q4_K_M GGUF) and this draft
mkdir -p models/target models/draft
hf download unsloth/Qwen3.6-27B-GGUF --include "*Q4_K_M*.gguf" --local-dir models/target
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF --include "dflash-draft-3.6-q8_0.gguf" --local-dir models/draft

# 4. Run
export DFLASH_TARGET=models/target/Qwen3.6-27B-Q4_K_M.gguf
export DFLASH_DRAFT=models/draft/dflash-draft-3.6-q8_0.gguf
echo "Write a haiku about GPUs." | python3 scripts/run.py --max-ctx 2048 --n-gen 256

The binary autodetects .gguf vs .safetensors from the draft path.

Compatibility

  • Target: any Qwen3.6-27B-Q4_K_M.gguf (e.g. unsloth/Qwen3.6-27B-GGUF)
  • The DFlash arch (5 layers + dflash.fc.weight + dflash.hidden_norm.weight) is loaded by gguf_draft_loader.cpp. Quantizing this draft requires the matching converter in dflash/scripts/quantize_draft_q8.py; do not re-quantize with stock llama-quantize โ€” that won't preserve the dflash-specific tensors.

License & attribution

Apache 2.0, inheriting the upstream z-lab license. Original DFlash work and weights by z-lab; FP16 mirror by deepsweet; Q8_0 quantization + repackaging by Lucebox.

Downloads last month
1,841
GGUF
Model size
2B params
Architecture
qwen35-dflash-draft
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Lucebox/Qwen3.6-27B-DFlash-GGUF

Quantized
(6)
this model