---
license: apache-2.0
base_model:
  - Qwen/Qwen3.6-27B
library_name: gguf
pipeline_tag: text-generation
tags:
  - gguf
  - quantized
  - dynamic-quant
  - qwen3
  - llama.cpp
  - speculative-decoding
---

# Qwen3.6-27B-PRISM-PRO — DQ GGUF

llama.cpp-native GGUF quantization of `Qwen3.6-27B-PRISM-PRO` using the PRISM
project's **dynamic-quant (DQ)** recipe. **~13.7 GB** (vs 55 GB BF16).

PRISM-PRO of `Qwen/Qwen3.6-27B` (bias/propoganda removal)
This GGUF preserves the model's native MTP draft head + full vision
tower, and pairs with the separately-published
[EAGLE-3 drafter](https://huggingface.co/Ex0bit/Qwen3.6-27B-PRISM-EAGLE3) for
lossless faster decode.

## Performance

llama.cpp on a single NVIDIA  Blackwell GPU, single-stream greedy decode:

| config | tok/s | speedup |
|---|--:|--:|
| no-spec baseline | 80 | 1.00× |
| **native MTP** (built-in draft head) | **121** | **1.51×** |
| EAGLE-3 chain (with our drafter) | 111 | 1.39× |

Speculative decoding is **lossless** (output token-identical to non-spec greedy,
modulo batched-verify floating-point non-associativity intrinsic to all spec
decoding). For a faster SGLang deployment (~183 tok/s, ~1.97× over no-spec)
using the BF16 target + EAGLE-3, see the
[drafter repo](https://huggingface.co/Ex0bit/Qwen3.6-27B-PRISM-EAGLE3).

## Quick start (llama.cpp)

```bash
# 1. no-spec baseline
./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf

# 2. native MTP speculative decoding (the model's own draft head -- fastest in llama.cpp)
./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf \
    --spec-type draft-mtp --spec-draft-n-max 1 --spec-draft-n-min 1

# 3. EAGLE-3 chain (needs the WIP PR #18039 patches + the RS-rollback fix --
#    a one-shot llama.cpp patch script is documented alongside the drafter:
#    https://huggingface.co/Ex0bit/Qwen3.6-27B-PRISM-EAGLE3)
./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf \
    --spec-type draft-eagle3 --model-draft <eagle3-drafter.gguf> \
    --spec-draft-n-max 2
```

## Provenance

- **Base:** `Qwen/Qwen3.6-27B` (hybrid: 48 GatedDeltaNet linear-attention layers
  + 16 full-attention layers; hidden 5120; vocab 248 320; native MTP head).
- **PRISM Dynamic Quantization:** PRISM DQ recipe (llama.cpp GGUF dynamic quant) — preserves
  the MTP draft head (15 tensors) and the full vision tower (333 tensors).

## License

Apache-2.0. Derived from `Qwen/Qwen3.6-27B` (Apache-2.0).