---
library_name: transformers
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE
pipeline_tag: image-text-to-text
base_model: Qwen/Qwen3.6-27B
base_model_relation: quantized
tags:
- compressed-tensors
- qwen3_6
- int4
- int8
- mixed
- autoround
---

# Qwen3.6-27B Mixed AutoRound

This is an unofficial quantized version of the Qwen3.6-27B. 
It was created using [AutoRound](https://github.com/intel/auto-round) with a custom mixed-precision recipe.

## Quantization details

* This model uses a mixed-precision quantization to balance performance and model size.
* The `self_attn` layers are quantized to 8-bit.
* The MLP layers are generally quantized to 4-bit, but the first 3 and last 3 layers are kept at 8-bit.
* The `lm_head`, `linear_attn`, `visual`, `mtp.fc` layers are kept unquantized in FP16.

| Field | Custom Mixed Recipe | 
|------|------|
| Base | `Qwen/Qwen3.6-27B` | 
| Method | AutoRound (`intel/auto-round`), **custom** recipe |
| Scheme | Mixed (W4A16 / W8A16) | 
| Bits | 4 & 8 |
| Group size | 128 | 
| Symmetric | yes | 
| Unquantized layers | `lm_head`, `linear_attn`, `visual`, `mtp.fc` | 
| Calibration dataset | `NeelNanda/pile-10k` |
| Calibration samples | 512 | 
| Sequence length | 2048 |
| Iterations | 1000 | 
| Batch size | 8 | 
| torch.compile | enabled | 

* For more information, please check `quantize.py`.

### KLD Metrics
| Metric | Value | Description |
| :--- | :--- | :--- |
| **Median KLD** | 0.005592 | Median divergence |
| **P90 KLD** | 0.034514 | Divergence at the 90th percentile |
| **Mean KLD** | 0.046941 | Average divergence |
| **Mean Coverage** | 0.994750 | - |

### Evaluation Configuration
| Parameter | Value |
| :--- | :--- |
| **Calibration Dataset** | wikitext-2-raw-v1 (test) |
| **Sequence Length** | 2048 |
| **Num Samples** | 64 |
| **Total Positions** | 131,008 |
| **Top-K Reference** | 1000 |

## How to use

* This model is tested on the latest `docker.io/vllm/vllm-openai:cu130-nightly`.
* vLLM is recommended.
* **⚠️ Important Note:** Do NOT use `FLASHINFER` as the attention backend (`--attention-backend FLASHINFER`), as it may cause compatibility issues for some people!

* Example args (For 2x 3090 Users) :
```bash
vllm serve ./Qwen3.6-27B-mixed-autoround \
  --tensor-parallel-size 2 \
  --attention-backend FLASH_ATTN \
  --performance-mode interactivity \
  --max-model-len auto \
  --max-num-batched-tokens 2048 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.96 \
  --compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[4]}' \
  -O3 \
  --async-scheduling \
  --language-model-only \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
  --default-chat-template-kwargs.preserve_thinking true \
  --mamba-cache-mode all \
  --mamba-block-size 8 \
  --enable-prefix-caching \
  --enable-chunked-prefill
```
* With these settings, you get full context. 
* Note: This information is based on current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation.

## Acknowledgements
- [Lorbus](https://huggingface.co/Lorbus) for the README.md format
- [Alibaba / Qwen team](https://huggingface.co/Qwen) for the base [Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) model
- [Intel AutoRound](https://github.com/intel/auto-round) team for the quantization framework
- [vLLM project](https://github.com/vllm-project/vllm) for the inference engine and Qwen3_5 MTP support