Qwen3.6-27B Heretic v2-mtp INT4 AutoRound
A W4A16 (INT4 weight, FP16 activation) quantization of huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp, produced with Intel's AutoRound and packaged for drop-in vLLM serving with MTP speculative decoding on a single 32 GB GPU.
This release mirrors the recipe and runtime layout of Lorbus/Qwen3.6-27B-int4-AutoRound — the official-base counterpart — but starts from the heretic / abliterated text body for less-restricted generation.
TL;DR
- Base:
huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp(denseQwen3_5ForConditionalGeneration, 64 layers, multimodal, MTP head preserved) - Quant: INT4 W4A16,
group_size=128, symmetric,auto_round:auto_gptqpacking - Tool:
auto-round0.13.0 - Size: ~18 GB on disk (down from ~54 GB BF16) — 3× reduction
- MTP preserved: native Qwen3_5 MTP head kept, 74.5% mean draft acceptance in our tests
- Vision tower: kept BF16 (matches Lorbus); image-text-to-text still works
Lineage
Qwen/Qwen3.6-27B (official base, multimodal, dense, Apr 21 2026)
│
├── (abliteration / uncensoring lineage by community)
│
└── huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp (BF16, ~54 GB, MTP retained)
│
└── THIS REPO (INT4 AutoRound, ~18 GB, MTP retained, vision BF16)
Sibling reference for the official-base path: Lorbus/Qwen3.6-27B-int4-AutoRound.
Quick inference with vLLM (with MTP speculative decoding)
Tested with vllm/vllm-openai:latest-cu130 (vLLM 0.20.0) on a single RTX 5090 (32 GB, sm_120 / Blackwell).
docker run --rm --name qwen36-heretic --gpus all --ipc host --network host \
-e VLLM_USE_FLASHINFER_SAMPLER=1 -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-v /path/to/model:/model:ro \
vllm/vllm-openai:latest-cu130 \
--model /model \
--served-model-name Qwen3.6-27B-heretic-int4-AutoRound \
--host 0.0.0.0 --port 8000 \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-seqs 1 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.96 \
--enable-chunked-prefill --enable-prefix-caching \
--load-format safetensors --trust-remote-code \
--language-model-only \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
Notes:
--kv-cache-dtype fp8is the right pick for mainline vLLM. Lorbus's README mentionstq-t4nc(TurboQuant 4-bit KV) which is aeugr/spark-vllm-dockerfork only. Mainline vLLM 0.20.0 actually exposesnvfp4,turboquant_4bit_nc,turboquant_k8v4,turboquant_3bit_ncetc. in itsCacheDTypeliteral — but on this model they are unavailable:nvfp4has no backend supportinghead_size=256(Qwen3.6's head_dim), and allturboquant_*variants are rejected asNotImplementedError: TurboQuant KV cache is not supported for hybrid (attention + Mamba) modelsbecause Qwen3.6's interleaved Gated DeltaNet + full-attention layout classifies as hybrid. Sofp8is the only path on mainline.--language-model-onlykeeps vision modules out of the runtime graph for text-only workloads. Drop it to enable image input.--speculative-configuses the model's native MTP head as a built-in drafter.num_speculative_tokens=3is the sweet spot per Lorbus's tuning.
Multimodal serving
For image input, drop --language-model-only and reduce max-model-len to leave room for the vision tower's activation budget (e.g. --max-model-len 32768 --gpu-memory-utilization 0.94).
OpenAI-compatible request
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
r = client.chat.completions.create(
model="Qwen3.6-27B-heretic-int4-AutoRound",
messages=[{"role": "user", "content": "Write a quicksort in Python."}],
max_tokens=512,
)
print(r.choices[0].message.content)
Quantization details
| Field | Value |
|---|---|
| Base | huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp |
| Method | AutoRound (intel/auto-round 0.13.0) |
| Scheme | W4A16 (4-bit weights, FP16 activations) |
| Bits | 4 |
| Group size | 128 |
| Symmetric | yes |
| Packing format | auto_round:auto_gptq |
| Unquantized layers | every linear_attn.in_proj_a/b (48 layers × 2), mtp.fc, all LayerNorms / RMSNorms, full vision tower (BF16) |
| Calibration set | NeelNanda/pile-10k |
| Calibration samples | 128 |
| Sequence length | 2048 |
| GPU used for quant | 1× RTX 5090 (32 GB, sm_120), low_gpu_mem_usage=True, device_map=cpu |
| Quant wall time | ~2h 05min |
| Peak quant memory | RAM 23.77 GiB · VRAM 22.62 GiB |
Why these layers stay BF16
linear_attn.in_proj_a/b— low-rank projections in Qwen3.6's Gated DeltaNet, shapes not divisible by group_size; identical exclusion to Lorbus.- Vision tower (
model.visual.*) — vLLM's GPTQ-Marlin kernel rejects vision MLPfc1(output_size=4304 not divisible bymin_thread_n=64). Kept in BF16 to bypass; matches Lorbus exactly. - Norms, routers, lm_head — precision-sensitive and small.
Performance
Cold benchmarks on 1× RTX 5090 (32 GB), vLLM 0.20.0, MTP num_speculative_tokens=3, max-model-len 262144, kv-cache-dtype fp8:
| Prompt | max_tokens | Throughput |
|---|---|---|
| Code (Python fib + memoization explainer) | 1024 | 120.98 tok/s |
| Prose (Chinese, 800字 散文) | 2048 | 132.03 tok/s |
| Short uncensored-check | 256 | 114.78 tok/s |
| Spec-decode metric | Value |
|---|---|
| Mean draft acceptance | 74.5% |
| Per-position acceptance | 0.867 / 0.741 / 0.628 |
| Mean accepted length | 3.24 (k=3) |
| Resource | Value |
|---|---|
| Steady-state VRAM | 28.7 GiB / 32 GiB (matches Lorbus baseline) |
| On-disk size | ~18 GB (2013 tensors, identical layout to Lorbus) |
After warmup we expect to reach Lorbus's reported 139–162 tok/s envelope; cold numbers shown above.
Reproduction (advanced)
This repo's full toolchain (Dockerfile, quantize.py, post-quant relabel_keys.py + fix_vision.py, docker-compose.yml, bench.sh) is included for transparent reproduction. Key non-obvious steps:
- Quantize via
AutoModelForCausalLM, not the multimodal class — the Conditional class needs a Qwen3VLProcessor that ships separately. Forceauto-round'sdetect_model_type → "llm"to bypass MLLM template path. - Skip list: feed
layer_configwith everylinear_attentionlayer'slinear_attn.in_proj_a/bset tobits=16, data_type=fp(mirrors Lorbus's publishedquantization_config.jsonexactly — 96 entries for the 48 linear_attention layers). - Calibration:
NeelNanda/pile-10k,nsamples=128,seqlen=2048. Wall ~2h on RTX 5090. - Post-quant
relabel_keys.py:AutoModelForCausalLMflattensQwen3_5ForConditionalGenerationand saves keys asmodel.layers.*. vLLM's serving class for the same arch expects nestedmodel.language_model.layers.*. Without this relabel, vLLM's loader hits a fallback path and OOMs at ~30 GiB during cudagraph capture even at 65k context. - Post-quant
fix_vision.py:auto-round's missing-tensor pass auto-quantizesmodel.visual.blocks.*via WOQ-RTN. vLLM's GPTQ-Marlin kernel rejects vision MLPfc1(output_size 4304 not divisible by 64). Replace with original BF16 visual tensors from the source; removemodel.visual.blocksfromblock_name_to_quantize. - Restore config skeleton:
auto-roundsaves a flatqwen3_5_textconfig. Overlay the original huginnforkconfig.json(rootqwen3_5+ nestedtext_config+vision_config) and only injectquantization_configfrom auto-round output. - Compact
model_extra_tensors.safetensorsto drop orphan packed visual tensors after vision swap (saves ~244 MB and avoids confusing vLLM's safetensors scanner).
Without steps 4–7, the artifact looks valid on disk but either OOMs at startup or rejects vLLM's kernel selection.
Files
| File | Size | Content |
|---|---|---|
model-0000{1..5}-of-00005.safetensors |
17.6 GB | Quantized model.language_model.layers.* (64 blocks) |
model_extra_tensors.safetensors |
298 MB | mtp.* (29 tensors, INT4 packed) |
model-visual-bf16.safetensors |
921 MB | Vision tower (333 tensors, BF16) |
model.safetensors.index.json |
— | 2013 tensors total |
config.json |
— | Multimodal Qwen3_5ForConditionalGeneration skeleton + quantization_config |
Total on disk: ~18 GB.
Known limitations
- Cold-start cudagraph capture is heavy — first request after boot takes a few seconds longer; warm throughput climbs into Lorbus's published envelope.
tq-t4nc4-bit KV is unavailable here — mainline vLLM only supports up tofp8. If you forkeugr/spark-vllm-dockeryou can plug it in identically to Lorbus.- Vision benchmarking is preliminary — primary focus was text-with-MTP on a 32 GB budget.
- Heretic / abliterated content: this is an uncensored model. Guardrails removed during the heretic stage are upstream of this quant; please use responsibly.
Acknowledgements
- Alibaba Qwen team for the Qwen3.6-27B base
- @huginnfork for the heretic-v2-mtp upstream
- @Lorbus for publishing the exact quantization_config recipe and 5090 deployment notes that this artifact mirrors
- Intel AutoRound team
- vLLM project for Qwen3_5 MTP integration
License
Apache 2.0 — same as Qwen3.6-27B base. Heretic abliteration is upstream and inherits its license terms from huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp.
Citation
@article{cheng2023autoround,
title = {Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
author = {Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
journal = {arXiv preprint arXiv:2309.05516},
year = {2023}
}
- Downloads last month
- 1,308