--- license: apache-2.0 base_model: - Qwen/Qwen3.6-27B library_name: gguf pipeline_tag: text-generation tags: - gguf - quantized - dynamic-quant - qwen3 - llama.cpp - speculative-decoding --- # Qwen3.6-27B-PRISM-PRO — DQ GGUF llama.cpp-native GGUF quantization of `Qwen3.6-27B-PRISM-PRO` using the PRISM project's **dynamic-quant (DQ)** recipe. **~13.7 GB** (vs 55 GB BF16). PRISM-PRO of `Qwen/Qwen3.6-27B` (bias/propoganda removal) This GGUF preserves the model's native MTP draft head + full vision tower, and pairs with the separately-published [EAGLE-3 drafter](https://huggingface.co/Ex0bit/Qwen3.6-27B-PRISM-EAGLE3) for lossless faster decode. ## Performance llama.cpp on a single NVIDIA Blackwell GPU, single-stream greedy decode: | config | tok/s | speedup | |---|--:|--:| | no-spec baseline | 80 | 1.00× | | **native MTP** (built-in draft head) | **121** | **1.51×** | | EAGLE-3 chain (with our drafter) | 111 | 1.39× | Speculative decoding is **lossless** (output token-identical to non-spec greedy, modulo batched-verify floating-point non-associativity intrinsic to all spec decoding). For a faster SGLang deployment (~183 tok/s, ~1.97× over no-spec) using the BF16 target + EAGLE-3, see the [drafter repo](https://huggingface.co/Ex0bit/Qwen3.6-27B-PRISM-EAGLE3). ## Quick start (llama.cpp) ```bash # 1. no-spec baseline ./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf # 2. native MTP speculative decoding (the model's own draft head -- fastest in llama.cpp) ./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf \ --spec-type draft-mtp --spec-draft-n-max 1 --spec-draft-n-min 1 # 3. EAGLE-3 chain (needs the WIP PR #18039 patches + the RS-rollback fix -- # a one-shot llama.cpp patch script is documented alongside the drafter: # https://huggingface.co/Ex0bit/Qwen3.6-27B-PRISM-EAGLE3) ./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf \ --spec-type draft-eagle3 --model-draft \ --spec-draft-n-max 2 ``` ## Provenance - **Base:** `Qwen/Qwen3.6-27B` (hybrid: 48 GatedDeltaNet linear-attention layers + 16 full-attention layers; hidden 5120; vocab 248 320; native MTP head). - **PRISM Dynamic Quantization:** PRISM DQ recipe (llama.cpp GGUF dynamic quant) — preserves the MTP draft head (15 tensors) and the full vision tower (333 tensors). ## License Apache-2.0. Derived from `Qwen/Qwen3.6-27B` (Apache-2.0).