Ex0bit's picture
Update README.md
d3f5d79 verified
metadata
license: apache-2.0
base_model:
  - Qwen/Qwen3.6-27B
library_name: gguf
pipeline_tag: text-generation
tags:
  - gguf
  - quantized
  - dynamic-quant
  - qwen3
  - llama.cpp
  - speculative-decoding

Qwen3.6-27B-PRISM-PRO — DQ GGUF

llama.cpp-native GGUF quantization of Qwen3.6-27B-PRISM-PRO using the PRISM project's dynamic-quant (DQ) recipe. ~13.7 GB (vs 55 GB BF16).

PRISM-PRO of Qwen/Qwen3.6-27B (bias/propoganda removal) This GGUF preserves the model's native MTP draft head + full vision tower, and pairs with the separately-published EAGLE-3 drafter for lossless faster decode.

Performance

llama.cpp on a single NVIDIA Blackwell GPU, single-stream greedy decode:

config tok/s speedup
no-spec baseline 80 1.00×
native MTP (built-in draft head) 121 1.51×
EAGLE-3 chain (with our drafter) 111 1.39×

Speculative decoding is lossless (output token-identical to non-spec greedy, modulo batched-verify floating-point non-associativity intrinsic to all spec decoding). For a faster SGLang deployment (~183 tok/s, ~1.97× over no-spec) using the BF16 target + EAGLE-3, see the drafter repo.

Quick start (llama.cpp)

# 1. no-spec baseline
./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf

# 2. native MTP speculative decoding (the model's own draft head -- fastest in llama.cpp)
./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf \
    --spec-type draft-mtp --spec-draft-n-max 1 --spec-draft-n-min 1

# 3. EAGLE-3 chain (needs the WIP PR #18039 patches + the RS-rollback fix --
#    a one-shot llama.cpp patch script is documented alongside the drafter:
#    https://huggingface.co/Ex0bit/Qwen3.6-27B-PRISM-EAGLE3)
./llama-server --model Qwen3.6-27B-PRISM-PRO-DQ.gguf \
    --spec-type draft-eagle3 --model-draft <eagle3-drafter.gguf> \
    --spec-draft-n-max 2

Provenance

  • Base: Qwen/Qwen3.6-27B (hybrid: 48 GatedDeltaNet linear-attention layers
    • 16 full-attention layers; hidden 5120; vocab 248 320; native MTP head).
  • PRISM Dynamic Quantization: PRISM DQ recipe (llama.cpp GGUF dynamic quant) — preserves the MTP draft head (15 tensors) and the full vision tower (333 tensors).

License

Apache-2.0. Derived from Qwen/Qwen3.6-27B (Apache-2.0).