Instructions to use lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound")
model = AutoModelForImageTextToText.from_pretrained("lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound

SGLang

How to use lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound with Docker Model Runner:
```
docker model run hf.co/lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound
```

Qwen3.6-27B Heretic v2-mtp INT4 AutoRound

A W4A16 (INT4 weight, FP16 activation) quantization of huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp, produced with Intel's AutoRound and packaged for drop-in vLLM serving with MTP speculative decoding on a single 32 GB GPU.

This release mirrors the recipe and runtime layout of Lorbus/Qwen3.6-27B-int4-AutoRound — the official-base counterpart — but starts from the heretic / abliterated text body for less-restricted generation.

TL;DR

Base: huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp (dense Qwen3_5ForConditionalGeneration, 64 layers, multimodal, MTP head preserved)
Quant: INT4 W4A16, group_size=128, symmetric, auto_round:auto_gptq packing
Tool: auto-round 0.13.0
Size: ~18 GB on disk (down from ~54 GB BF16) — 3× reduction
MTP preserved: native Qwen3_5 MTP head kept, 74.5% mean draft acceptance in our tests
Vision tower: kept BF16 (matches Lorbus); image-text-to-text still works

Lineage

Qwen/Qwen3.6-27B                            (official base, multimodal, dense, Apr 21 2026)
        │
        ├── (abliteration / uncensoring lineage by community)
        │
        └── huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp   (BF16, ~54 GB, MTP retained)
                  │
                  └── THIS REPO   (INT4 AutoRound, ~18 GB, MTP retained, vision BF16)

Sibling reference for the official-base path: Lorbus/Qwen3.6-27B-int4-AutoRound.

Quick inference with vLLM (with MTP speculative decoding)

Tested with vllm/vllm-openai:latest-cu130 (vLLM 0.20.0) on a single RTX 5090 (32 GB, sm_120 / Blackwell).

docker run --rm --name qwen36-heretic --gpus all --ipc host --network host \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -v /path/to/model:/model:ro \
  vllm/vllm-openai:latest-cu130 \
  --model /model \
  --served-model-name Qwen3.6-27B-heretic-int4-AutoRound \
  --host 0.0.0.0 --port 8000 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 262144 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.96 \
  --enable-chunked-prefill --enable-prefix-caching \
  --load-format safetensors --trust-remote-code \
  --language-model-only \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

Notes:

--kv-cache-dtype fp8 is the right pick for mainline vLLM. Lorbus's README mentions tq-t4nc (TurboQuant 4-bit KV) which is a eugr/spark-vllm-docker fork only. Mainline vLLM 0.20.0 actually exposes nvfp4, turboquant_4bit_nc, turboquant_k8v4, turboquant_3bit_nc etc. in its CacheDType literal — but on this model they are unavailable: nvfp4 has no backend supporting head_size=256 (Qwen3.6's head_dim), and all turboquant_* variants are rejected as NotImplementedError: TurboQuant KV cache is not supported for hybrid (attention + Mamba) models because Qwen3.6's interleaved Gated DeltaNet + full-attention layout classifies as hybrid. So fp8 is the only path on mainline.
--language-model-only keeps vision modules out of the runtime graph for text-only workloads. Drop it to enable image input.
--speculative-config uses the model's native MTP head as a built-in drafter. num_speculative_tokens=3 is the sweet spot per Lorbus's tuning.

Multimodal serving

For image input, drop --language-model-only and reduce max-model-len to leave room for the vision tower's activation budget (e.g. --max-model-len 32768 --gpu-memory-utilization 0.94).

OpenAI-compatible request

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
r = client.chat.completions.create(
    model="Qwen3.6-27B-heretic-int4-AutoRound",
    messages=[{"role": "user", "content": "Write a quicksort in Python."}],
    max_tokens=512,
)
print(r.choices[0].message.content)

Quantization details

Field	Value
Base	`huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp`
Method	AutoRound (`intel/auto-round` 0.13.0)
Scheme	W4A16 (4-bit weights, FP16 activations)
Bits	4
Group size	128
Symmetric	yes
Packing format	`auto_round:auto_gptq`
Unquantized layers	every `linear_attn.in_proj_a/b` (48 layers × 2), `mtp.fc`, all LayerNorms / RMSNorms, full vision tower (BF16)
Calibration set	`NeelNanda/pile-10k`
Calibration samples	128
Sequence length	2048
GPU used for quant	1× RTX 5090 (32 GB, sm_120), `low_gpu_mem_usage=True`, `device_map=cpu`
Quant wall time	~2h 05min
Peak quant memory	RAM 23.77 GiB · VRAM 22.62 GiB

Why these layers stay BF16

linear_attn.in_proj_a/b — low-rank projections in Qwen3.6's Gated DeltaNet, shapes not divisible by group_size; identical exclusion to Lorbus.
Vision tower (model.visual.*) — vLLM's GPTQ-Marlin kernel rejects vision MLP fc1 (output_size=4304 not divisible by min_thread_n=64). Kept in BF16 to bypass; matches Lorbus exactly.
Norms, routers, lm_head — precision-sensitive and small.

Performance

Cold benchmarks on 1× RTX 5090 (32 GB), vLLM 0.20.0, MTP num_speculative_tokens=3, max-model-len 262144, kv-cache-dtype fp8:

Prompt	max_tokens	Throughput
Code (Python fib + memoization explainer)	1024	120.98 tok/s
Prose (Chinese, 800字散文)	2048	132.03 tok/s
Short uncensored-check	256	114.78 tok/s

Spec-decode metric	Value
Mean draft acceptance	74.5%
Per-position acceptance	0.867 / 0.741 / 0.628
Mean accepted length	3.24 (k=3)

Resource	Value
Steady-state VRAM	28.7 GiB / 32 GiB (matches Lorbus baseline)
On-disk size	~18 GB (2013 tensors, identical layout to Lorbus)

After warmup we expect to reach Lorbus's reported 139–162 tok/s envelope; cold numbers shown above.

Reproduction (advanced)

This repo's full toolchain (Dockerfile, quantize.py, post-quant relabel_keys.py + fix_vision.py, docker-compose.yml, bench.sh) is included for transparent reproduction. Key non-obvious steps:

Quantize via AutoModelForCausalLM, not the multimodal class — the Conditional class needs a Qwen3VLProcessor that ships separately. Force auto-round's detect_model_type → "llm" to bypass MLLM template path.
Skip list: feed layer_config with every linear_attention layer's linear_attn.in_proj_a/b set to bits=16, data_type=fp (mirrors Lorbus's published quantization_config.json exactly — 96 entries for the 48 linear_attention layers).
Calibration: NeelNanda/pile-10k, nsamples=128, seqlen=2048. Wall ~2h on RTX 5090.
Post-quant relabel_keys.py: AutoModelForCausalLM flattens Qwen3_5ForConditionalGeneration and saves keys as model.layers.*. vLLM's serving class for the same arch expects nested model.language_model.layers.*. Without this relabel, vLLM's loader hits a fallback path and OOMs at ~30 GiB during cudagraph capture even at 65k context.
Post-quant fix_vision.py: auto-round's missing-tensor pass auto-quantizes model.visual.blocks.* via WOQ-RTN. vLLM's GPTQ-Marlin kernel rejects vision MLP fc1 (output_size 4304 not divisible by 64). Replace with original BF16 visual tensors from the source; remove model.visual.blocks from block_name_to_quantize.
Restore config skeleton: auto-round saves a flat qwen3_5_text config. Overlay the original huginnfork config.json (root qwen3_5 + nested text_config + vision_config) and only inject quantization_config from auto-round output.
Compact model_extra_tensors.safetensors to drop orphan packed visual tensors after vision swap (saves ~244 MB and avoids confusing vLLM's safetensors scanner).

Without steps 4–7, the artifact looks valid on disk but either OOMs at startup or rejects vLLM's kernel selection.

Files

File	Size	Content
`model-0000{1..5}-of-00005.safetensors`	17.6 GB	Quantized `model.language_model.layers.*` (64 blocks)
`model_extra_tensors.safetensors`	298 MB	`mtp.*` (29 tensors, INT4 packed)
`model-visual-bf16.safetensors`	921 MB	Vision tower (333 tensors, BF16)
`model.safetensors.index.json`	—	2013 tensors total
`config.json`	—	Multimodal `Qwen3_5ForConditionalGeneration` skeleton + `quantization_config`

Total on disk: ~18 GB.

Known limitations

Cold-start cudagraph capture is heavy — first request after boot takes a few seconds longer; warm throughput climbs into Lorbus's published envelope.
tq-t4nc 4-bit KV is unavailable here — mainline vLLM only supports up to fp8. If you fork eugr/spark-vllm-docker you can plug it in identically to Lorbus.
Vision benchmarking is preliminary — primary focus was text-with-MTP on a 32 GB budget.
Heretic / abliterated content: this is an uncensored model. Guardrails removed during the heretic stage are upstream of this quant; please use responsibly.

Acknowledgements

Alibaba Qwen team for the Qwen3.6-27B base
@huginnfork for the heretic-v2-mtp upstream
@Lorbus for publishing the exact quantization_config recipe and 5090 deployment notes that this artifact mirrors
Intel AutoRound team
vLLM project for Qwen3_5 MTP integration

License

Apache 2.0 — same as Qwen3.6-27B base. Heretic abliteration is upstream and inherits its license terms from huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp.

Citation

@article{cheng2023autoround,
  title   = {Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
  author  = {Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
  journal = {arXiv preprint arXiv:2309.05516},
  year    = {2023}
}

Downloads last month: 1,308

Safetensors

Model size

3B params

Tensor type

BF16

I32

F16

Model tree for lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound

Base model

huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp

Quantized

(1)

this model

Paper for lyf/Qwen3.6-27B-heretic-v2-mtp-int4-AutoRound

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Paper • 2309.05516 • Published Sep 11, 2023 • 12