Minachist's picture
Update README.md
beedc78 verified
---
library_name: transformers
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE
pipeline_tag: image-text-to-text
base_model: Qwen/Qwen3.6-27B
base_model_relation: quantized
tags:
- compressed-tensors
- qwen3_6
- int4
- int8
- mixed
- autoround
---
# Qwen3.6-27B Mixed AutoRound
This is an unofficial quantized version of the Qwen3.6-27B.
It was created using [AutoRound](https://github.com/intel/auto-round) with a custom mixed-precision recipe.
## Quantization details
* This model uses a mixed-precision quantization to balance performance and model size.
* The `self_attn` layers are quantized to 8-bit.
* The MLP layers are generally quantized to 4-bit, but the first 3 and last 3 layers are kept at 8-bit.
* The `lm_head`, `linear_attn`, `visual`, `mtp.fc` layers are kept unquantized in FP16.
| Field | Custom Mixed Recipe |
|------|------|
| Base | `Qwen/Qwen3.6-27B` |
| Method | AutoRound (`intel/auto-round`), **custom** recipe |
| Scheme | Mixed (W4A16 / W8A16) |
| Bits | 4 & 8 |
| Group size | 128 |
| Symmetric | yes |
| Unquantized layers | `lm_head`, `linear_attn`, `visual`, `mtp.fc` |
| Calibration dataset | `NeelNanda/pile-10k` |
| Calibration samples | 512 |
| Sequence length | 2048 |
| Iterations | 1000 |
| Batch size | 8 |
| torch.compile | enabled |
* For more information, please check `quantize.py`.
### KLD Metrics
| Metric | Value | Description |
| :--- | :--- | :--- |
| **Median KLD** | 0.005592 | Median divergence |
| **P90 KLD** | 0.034514 | Divergence at the 90th percentile |
| **Mean KLD** | 0.046941 | Average divergence |
| **Mean Coverage** | 0.994750 | - |
### Evaluation Configuration
| Parameter | Value |
| :--- | :--- |
| **Calibration Dataset** | wikitext-2-raw-v1 (test) |
| **Sequence Length** | 2048 |
| **Num Samples** | 64 |
| **Total Positions** | 131,008 |
| **Top-K Reference** | 1000 |
## How to use
* This model is tested on the latest `docker.io/vllm/vllm-openai:cu130-nightly`.
* vLLM is recommended.
* **⚠️ Important Note:** Do NOT use `FLASHINFER` as the attention backend (`--attention-backend FLASHINFER`), as it may cause compatibility issues for some people!
* Example args (For 2x 3090 Users) :
```bash
vllm serve ./Qwen3.6-27B-mixed-autoround \
--tensor-parallel-size 2 \
--attention-backend FLASH_ATTN \
--performance-mode interactivity \
--max-model-len auto \
--max-num-batched-tokens 2048 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.96 \
--compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[4]}' \
-O3 \
--async-scheduling \
--language-model-only \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}' \
--default-chat-template-kwargs.preserve_thinking true \
--mamba-cache-mode all \
--mamba-block-size 8 \
--enable-prefix-caching \
--enable-chunked-prefill
```
* With these settings, you get full context.
* Note: This information is based on current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation.
## Acknowledgements
- [Lorbus](https://huggingface.co/Lorbus) for the README.md format
- [Alibaba / Qwen team](https://huggingface.co/Qwen) for the base [Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) model
- [Intel AutoRound](https://github.com/intel/auto-round) team for the quantization framework
- [vLLM project](https://github.com/vllm-project/vllm) for the inference engine and Qwen3_5 MTP support