Ex0bit's picture
Update README.md
d3f5d79 verified
---
license: apache-2.0
base_model:
- Qwen/Qwen3.6-27B
library_name: gguf
pipeline_tag: text-generation
tags:
- gguf
- quantized
- dynamic-quant
- qwen3
- llama.cpp
- speculative-decoding
---
# Qwen3.6-27B-PRISM-PRO — DQ GGUF
llama.cpp-native GGUF quantization of `Qwen3.6-27B-PRISM-PRO` using the PRISM
project's **dynamic-quant (DQ)** recipe. **~13.7 GB** (vs 55 GB BF16).
PRISM-PRO of `Qwen/Qwen3.6-27B` (bias/propoganda removal)
This GGUF preserves the model's native MTP draft head + full vision
tower, and pairs with the separately-published
[EAGLE-3 drafter](https://huggingface.co/Ex0bit/Qwen3.6-27B-PRISM-EAGLE3) for
lossless faster decode.
## Performance
llama.cpp on a single NVIDIA Blackwell GPU, single-stream greedy decode:
| config | tok/s | speedup |
|---|--:|--:|
| no-spec baseline | 80 | 1.00× |
| **native MTP** (built-in draft head) | **121** | **1.51×** |
| EAGLE-3 chain (with our drafter) | 111 | 1.39× |
Speculative decoding is **lossless** (output token-identical to non-spec greedy,
modulo batched-verify floating-point non-associativity intrinsic to all spec
decoding). For a faster SGLang deployment (~183 tok/s, ~1.97× over no-spec)
using the BF16 target + EAGLE-3, see the
[drafter repo](https://huggingface.co/Ex0bit/Qwen3.6-27B-PRISM-EAGLE3).
## Quick start (llama.cpp)
```bash
# 1. no-spec baseline
./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf
# 2. native MTP speculative decoding (the model's own draft head -- fastest in llama.cpp)
./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf \
--spec-type draft-mtp --spec-draft-n-max 1 --spec-draft-n-min 1
# 3. EAGLE-3 chain (needs the WIP PR #18039 patches + the RS-rollback fix --
# a one-shot llama.cpp patch script is documented alongside the drafter:
# https://huggingface.co/Ex0bit/Qwen3.6-27B-PRISM-EAGLE3)
./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf \
--spec-type draft-eagle3 --model-draft <eagle3-drafter.gguf> \
--spec-draft-n-max 2
```
## Provenance
- **Base:** `Qwen/Qwen3.6-27B` (hybrid: 48 GatedDeltaNet linear-attention layers
+ 16 full-attention layers; hidden 5120; vocab 248 320; native MTP head).
- **PRISM Dynamic Quantization:** PRISM DQ recipe (llama.cpp GGUF dynamic quant) — preserves
the MTP draft head (15 tensors) and the full vision tower (333 tensors).
## License
Apache-2.0. Derived from `Qwen/Qwen3.6-27B` (Apache-2.0).