| --- |
| library_name: transformers |
| license: apache-2.0 |
| license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE |
| pipeline_tag: image-text-to-text |
| base_model: Qwen/Qwen3.6-27B |
| base_model_relation: quantized |
| tags: |
| - compressed-tensors |
| - qwen3_6 |
| - int4 |
| - int8 |
| - mixed |
| - autoround |
| --- |
| |
| # Qwen3.6-27B Mixed AutoRound |
|
|
| This is an unofficial quantized version of the Qwen3.6-27B. |
| It was created using [AutoRound](https://github.com/intel/auto-round) with a custom mixed-precision recipe. |
|
|
| ## Quantization details |
|
|
| * This model uses a mixed-precision quantization to balance performance and model size. |
| * The `self_attn` layers are quantized to 8-bit. |
| * The MLP layers are generally quantized to 4-bit, but the first 3 and last 3 layers are kept at 8-bit. |
| * The `lm_head`, `linear_attn`, `visual`, `mtp.fc` layers are kept unquantized in FP16. |
|
|
| | Field | Custom Mixed Recipe | |
| |------|------| |
| | Base | `Qwen/Qwen3.6-27B` | |
| | Method | AutoRound (`intel/auto-round`), **custom** recipe | |
| | Scheme | Mixed (W4A16 / W8A16) | |
| | Bits | 4 & 8 | |
| | Group size | 128 | |
| | Symmetric | yes | |
| | Unquantized layers | `lm_head`, `linear_attn`, `visual`, `mtp.fc` | |
| | Calibration dataset | `NeelNanda/pile-10k` | |
| | Calibration samples | 512 | |
| | Sequence length | 2048 | |
| | Iterations | 1000 | |
| | Batch size | 8 | |
| | torch.compile | enabled | |
|
|
| * For more information, please check `quantize.py`. |
|
|
| ### KLD Metrics |
| | Metric | Value | Description | |
| | :--- | :--- | :--- | |
| | **Median KLD** | 0.005592 | Median divergence | |
| | **P90 KLD** | 0.034514 | Divergence at the 90th percentile | |
| | **Mean KLD** | 0.046941 | Average divergence | |
| | **Mean Coverage** | 0.994750 | - | |
|
|
| ### Evaluation Configuration |
| | Parameter | Value | |
| | :--- | :--- | |
| | **Calibration Dataset** | wikitext-2-raw-v1 (test) | |
| | **Sequence Length** | 2048 | |
| | **Num Samples** | 64 | |
| | **Total Positions** | 131,008 | |
| | **Top-K Reference** | 1000 | |
|
|
| ## How to use |
|
|
| * This model is tested on the latest `docker.io/vllm/vllm-openai:cu130-nightly`. |
| * vLLM is recommended. |
| * **⚠️ Important Note:** Do NOT use `FLASHINFER` as the attention backend (`--attention-backend FLASHINFER`), as it may cause compatibility issues for some people! |
|
|
| * Example args (For 2x 3090 Users) : |
| ```bash |
| vllm serve ./Qwen3.6-27B-mixed-autoround \ |
| --tensor-parallel-size 2 \ |
| --attention-backend FLASH_ATTN \ |
| --performance-mode interactivity \ |
| --max-model-len auto \ |
| --max-num-batched-tokens 2048 \ |
| --max-num-seqs 1 \ |
| --gpu-memory-utilization 0.96 \ |
| --compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[4]}' \ |
| -O3 \ |
| --async-scheduling \ |
| --language-model-only \ |
| --tool-call-parser qwen3_coder \ |
| --reasoning-parser qwen3 \ |
| --enable-auto-tool-choice \ |
| --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \ |
| --default-chat-template-kwargs.preserve_thinking true \ |
| --mamba-cache-mode all \ |
| --mamba-block-size 8 \ |
| --enable-prefix-caching \ |
| --enable-chunked-prefill |
| ``` |
| * With these settings, you get full context. |
| * Note: This information is based on current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation. |
|
|
| ## Acknowledgements |
| - [Lorbus](https://huggingface.co/Lorbus) for the README.md format |
| - [Alibaba / Qwen team](https://huggingface.co/Qwen) for the base [Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) model |
| - [Intel AutoRound](https://github.com/intel/auto-round) team for the quantization framework |
| - [vLLM project](https://github.com/vllm-project/vllm) for the inference engine and Qwen3_5 MTP support |
| |