Instructions to use Lucebox/Qwen3.6-27B-DFlash-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Lucebox/Qwen3.6-27B-DFlash-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Lucebox/Qwen3.6-27B-DFlash-GGUF", filename="dflash-draft-3.6-q8_0.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Lucebox/Qwen3.6-27B-DFlash-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0 # Run inference directly in the terminal: llama-cli -hf Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0 # Run inference directly in the terminal: llama-cli -hf Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0 # Run inference directly in the terminal: ./llama-cli -hf Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0
Use Docker
docker model run hf.co/Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0
- LM Studio
- Jan
- Ollama
How to use Lucebox/Qwen3.6-27B-DFlash-GGUF with Ollama:
ollama run hf.co/Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0
- Unsloth Studio new
How to use Lucebox/Qwen3.6-27B-DFlash-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Lucebox/Qwen3.6-27B-DFlash-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Lucebox/Qwen3.6-27B-DFlash-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Lucebox/Qwen3.6-27B-DFlash-GGUF to start chatting
- Docker Model Runner
How to use Lucebox/Qwen3.6-27B-DFlash-GGUF with Docker Model Runner:
docker model run hf.co/Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0
- Lemonade
How to use Lucebox/Qwen3.6-27B-DFlash-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Lucebox/Qwen3.6-27B-DFlash-GGUF:Q8_0
Run and chat with the model
lemonade run user.Qwen3.6-27B-DFlash-GGUF-Q8_0
List all available models
lemonade list
output = llm(
"Once upon a time,",
max_tokens=512,
echo=True
)
print(output)Qwen3.6-27B DFlash Draft โ Q8_0 GGUF
Q8_0 GGUF quantization of the z-lab/Qwen3.6-27B-DFlash draft model, produced for the Lucebox dflash engine (speculative decoding for Qwen3.6-27B-Q4_K_M).
- Source: deepsweet/Qwen3.6-27B-DFlash-FP16 (FP16 safetensors mirror of z-lab's BF16)
- Tool:
dflash/scripts/quantize_draft_q8.pyfromlucebox-hub - Size: 1.84 GB (53 % of BF16)
- Arch:
qwen35-dflash-draft, 5 layers, hidden 5120, n_target_layers 5, vocab 248320 - Tensors: projection weights โ Q8_0, norms โ F32 (precision-critical, tiny)
- Block size: 16, RoPE ฮธ 1e6, RMS ฮต 1e-6, MASK token id 248070
Files
| File | Size | Purpose |
|---|---|---|
dflash-draft-3.6-q8_0.gguf |
1.84 GB | The draft model. Pass to dflash via --draft |
Usage with the Lucebox dflash engine
# 1. Clone + checkout (PR 129 adds Qwen3.6 SWA support)
git clone https://github.com/Luce-Org/lucebox-hub.git
cd lucebox-hub
git fetch origin pull/129/head:pr129 && git checkout pr129
git submodule update --init --recursive
# 2. Build (sm_86+ enables Block-Sparse Attention; sm_75 falls back to ggml flash_attn_ext)
cd dflash
cmake -B build -S . -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=86 \
-DDFLASH27B_ENABLE_BSA=ON \
-DDFLASH27B_TESTS=ON
cmake --build build --target test_dflash -j
# 3. Get the target (Q4_K_M GGUF) and this draft
mkdir -p models/target models/draft
hf download unsloth/Qwen3.6-27B-GGUF --include "*Q4_K_M*.gguf" --local-dir models/target
hf download Lucebox/Qwen3.6-27B-DFlash-GGUF --include "dflash-draft-3.6-q8_0.gguf" --local-dir models/draft
# 4. Run
export DFLASH_TARGET=models/target/Qwen3.6-27B-Q4_K_M.gguf
export DFLASH_DRAFT=models/draft/dflash-draft-3.6-q8_0.gguf
echo "Write a haiku about GPUs." | python3 scripts/run.py --max-ctx 2048 --n-gen 256
The binary autodetects .gguf vs .safetensors from the draft path.
Compatibility
- Target: any
Qwen3.6-27B-Q4_K_M.gguf(e.g.unsloth/Qwen3.6-27B-GGUF) - The DFlash arch (5 layers +
dflash.fc.weight+dflash.hidden_norm.weight) is loaded bygguf_draft_loader.cpp. Quantizing this draft requires the matching converter indflash/scripts/quantize_draft_q8.py; do not re-quantize with stockllama-quantizeโ that won't preserve the dflash-specific tensors.
License & attribution
Apache 2.0, inheriting the upstream z-lab license. Original DFlash work and weights by z-lab; FP16 mirror by deepsweet; Q8_0 quantization + repackaging by Lucebox.
- Downloads last month
- 1,841
8-bit
Model tree for Lucebox/Qwen3.6-27B-DFlash-GGUF
Base model
z-lab/Qwen3.6-27B-DFlash
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Lucebox/Qwen3.6-27B-DFlash-GGUF", filename="dflash-draft-3.6-q8_0.gguf", )