Qwen3.6-27B Heretic v2-mtp INT4 AutoRound

A W4A16 (INT4 weight, FP16 activation) quantization of huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp, produced with Intel's AutoRound and packaged for drop-in vLLM serving with MTP speculative decoding on a single 32 GB GPU.

This release mirrors the recipe and runtime layout of Lorbus/Qwen3.6-27B-int4-AutoRound — the official-base counterpart — but starts from the heretic / abliterated text body for less-restricted generation.

TL;DR

  • Base: huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp (dense Qwen3_5ForConditionalGeneration, 64 layers, multimodal, MTP head preserved)
  • Quant: INT4 W4A16, group_size=128, symmetric, auto_round:auto_gptq packing
  • Tool: auto-round 0.13.0
  • Size: ~18 GB on disk (down from ~54 GB BF16) — 3× reduction
  • MTP preserved: native Qwen3_5 MTP head kept, 74.5% mean draft acceptance in our tests
  • Vision tower: kept BF16 (matches Lorbus); image-text-to-text still works

Lineage

Qwen/Qwen3.6-27B                            (official base, multimodal, dense, Apr 21 2026)
        │
        ├── (abliteration / uncensoring lineage by community)
        │
        └── huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp   (BF16, ~54 GB, MTP retained)
                  │
                  └── THIS REPO   (INT4 AutoRound, ~18 GB, MTP retained, vision BF16)

Sibling reference for the official-base path: Lorbus/Qwen3.6-27B-int4-AutoRound.

Quick inference with vLLM (with MTP speculative decoding)

Tested with vllm/vllm-openai:latest-cu130 (vLLM 0.20.0) on a single RTX 5090 (32 GB, sm_120 / Blackwell).

docker run --rm --name qwen36-heretic --gpus all --ipc host --network host \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -v /path/to/model:/model:ro \
  vllm/vllm-openai:latest-cu130 \
  --model /model \
  --served-model-name Qwen3.6-27B-heretic-int4-AutoRound \
  --host 0.0.0.0 --port 8000 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.96 \
  --enable-chunked-prefill --enable-prefix-caching \
  --load-format safetensors --trust-remote-code \
  --language-model-only \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Notes:

  • --kv-cache-dtype fp8 is the right pick for mainline vLLM. Lorbus's README mentions tq-t4nc (TurboQuant 4-bit KV) which is a eugr/spark-vllm-docker fork only. Mainline vLLM 0.20.0 actually exposes nvfp4, turboquant_4bit_nc, turboquant_k8v4, turboquant_3bit_nc etc. in its CacheDType literal — but on this model they are unavailable: nvfp4 has no backend supporting head_size=256 (Qwen3.6's head_dim), and all turboquant_* variants are rejected as NotImplementedError: TurboQuant KV cache is not supported for hybrid (attention + Mamba) models because Qwen3.6's interleaved Gated DeltaNet + full-attention layout classifies as hybrid. So fp8 is the only path on mainline.
  • --language-model-only keeps vision modules out of the runtime graph for text-only workloads. Drop it to enable image input.
  • --speculative-config uses the model's native MTP head as a built-in drafter. num_speculative_tokens=3 is the sweet spot per Lorbus's tuning.

Multimodal serving

For image input, drop --language-model-only and reduce max-model-len to leave room for the vision tower's activation budget (e.g. --max-model-len 32768 --gpu-memory-utilization 0.94).

OpenAI-compatible request

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
r = client.chat.completions.create(
    model="Qwen3.6-27B-heretic-int4-AutoRound",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=512,
)
print(r.choices[0].message.content)

Quantization details

Field Value
Base huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp
Method AutoRound (intel/auto-round 0.13.0)
Scheme W4A16 (4-bit weights, FP16 activations)
Bits 4
Group size 128
Symmetric yes
Packing format auto_round:auto_gptq
Unquantized layers every linear_attn.in_proj_a/b (48 layers × 2), mtp.fc, all LayerNorms / RMSNorms, full vision tower (BF16)
Calibration set NeelNanda/pile-10k
Calibration samples 128
Sequence length 2048
GPU used for quant 1× RTX 5090 (32 GB, sm_120), low_gpu_mem_usage=True, device_map=cpu
Quant wall time ~2h 05min
Peak quant memory RAM 23.77 GiB · VRAM 22.62 GiB

Why these layers stay BF16

  • linear_attn.in_proj_a/b — low-rank projections in Qwen3.6's Gated DeltaNet, shapes not divisible by group_size; identical exclusion to Lorbus.
  • Vision tower (model.visual.*) — vLLM's GPTQ-Marlin kernel rejects vision MLP fc1 (output_size=4304 not divisible by min_thread_n=64). Kept in BF16 to bypass; matches Lorbus exactly.
  • Norms, routers, lm_head — precision-sensitive and small.

Performance

Cold benchmarks on 1× RTX 5090 (32 GB), vLLM 0.20.0, MTP num_speculative_tokens=3, max-model-len 262144, kv-cache-dtype fp8:

Prompt max_tokens Throughput
Code (Python fib + memoization explainer) 1024 120.98 tok/s
Prose (Chinese, 800字 散文) 2048 132.03 tok/s
Short uncensored-check 256 114.78 tok/s
Spec-decode metric Value
Mean draft acceptance 74.5%
Per-position acceptance 0.867 / 0.741 / 0.628
Mean accepted length 3.24 (k=3)
Resource Value
Steady-state VRAM 28.7 GiB / 32 GiB (matches Lorbus baseline)
On-disk size ~18 GB (2013 tensors, identical layout to Lorbus)

After warmup we expect to reach Lorbus's reported 139–162 tok/s envelope; cold numbers shown above.

Reproduction (advanced)

This repo's full toolchain (Dockerfile, quantize.py, post-quant relabel_keys.py + fix_vision.py, docker-compose.yml, bench.sh) is included for transparent reproduction. Key non-obvious steps:

  1. Quantize via AutoModelForCausalLM, not the multimodal class — the Conditional class needs a Qwen3VLProcessor that ships separately. Force auto-round's detect_model_type → "llm" to bypass MLLM template path.
  2. Skip list: feed layer_config with every linear_attention layer's linear_attn.in_proj_a/b set to bits=16, data_type=fp (mirrors Lorbus's published quantization_config.json exactly — 96 entries for the 48 linear_attention layers).
  3. Calibration: NeelNanda/pile-10k, nsamples=128, seqlen=2048. Wall ~2h on RTX 5090.
  4. Post-quant relabel_keys.py: AutoModelForCausalLM flattens Qwen3_5ForConditionalGeneration and saves keys as model.layers.*. vLLM's serving class for the same arch expects nested model.language_model.layers.*. Without this relabel, vLLM's loader hits a fallback path and OOMs at ~30 GiB during cudagraph capture even at 65k context.
  5. Post-quant fix_vision.py: auto-round's missing-tensor pass auto-quantizes model.visual.blocks.* via WOQ-RTN. vLLM's GPTQ-Marlin kernel rejects vision MLP fc1 (output_size 4304 not divisible by 64). Replace with original BF16 visual tensors from the source; remove model.visual.blocks from block_name_to_quantize.
  6. Restore config skeleton: auto-round saves a flat qwen3_5_text config. Overlay the original huginnfork config.json (root qwen3_5 + nested text_config + vision_config) and only inject quantization_config from auto-round output.
  7. Compact model_extra_tensors.safetensors to drop orphan packed visual tensors after vision swap (saves ~244 MB and avoids confusing vLLM's safetensors scanner).

Without steps 4–7, the artifact looks valid on disk but either OOMs at startup or rejects vLLM's kernel selection.

Files

File Size Content
model-0000{1..5}-of-00005.safetensors 17.6 GB Quantized model.language_model.layers.* (64 blocks)
model_extra_tensors.safetensors 298 MB mtp.* (29 tensors, INT4 packed)
model-visual-bf16.safetensors 921 MB Vision tower (333 tensors, BF16)
model.safetensors.index.json 2013 tensors total
config.json Multimodal Qwen3_5ForConditionalGeneration skeleton + quantization_config

Total on disk: ~18 GB.

Known limitations

  • Cold-start cudagraph capture is heavy — first request after boot takes a few seconds longer; warm throughput climbs into Lorbus's published envelope.
  • tq-t4nc 4-bit KV is unavailable here — mainline vLLM only supports up to fp8. If you fork eugr/spark-vllm-docker you can plug it in identically to Lorbus.
  • Vision benchmarking is preliminary — primary focus was text-with-MTP on a 32 GB budget.
  • Heretic / abliterated content: this is an uncensored model. Guardrails removed during the heretic stage are upstream of this quant; please use responsibly.

Acknowledgements

License

Apache 2.0 — same as Qwen3.6-27B base. Heretic abliteration is upstream and inherits its license terms from huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp.

Citation

@article{cheng2023autoround,
  title   = {Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
  author  = {Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
  journal = {arXiv preprint arXiv:2309.05516},
  year    = {2023}
}
Downloads last month
1,308
Safetensors
Model size
3B params
Tensor type
BF16
·
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound

Quantized
(1)
this model

Paper for lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound