--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE pipeline_tag: image-text-to-text base_model: Qwen/Qwen3.6-27B base_model_relation: quantized tags: - compressed-tensors - qwen3_6 - int4 - int8 - mixed - autoround --- # Qwen3.6-27B Mixed AutoRound This is an unofficial quantized version of the Qwen3.6-27B. It was created using [AutoRound](https://github.com/intel/auto-round) with a custom mixed-precision recipe. ## Quantization details * This model uses a mixed-precision quantization to balance performance and model size. * The `self_attn` layers are quantized to 8-bit. * The MLP layers are generally quantized to 4-bit, but the first 3 and last 3 layers are kept at 8-bit. * The `lm_head`, `linear_attn`, `visual`, `mtp.fc` layers are kept unquantized in FP16. | Field | Custom Mixed Recipe | |------|------| | Base | `Qwen/Qwen3.6-27B` | | Method | AutoRound (`intel/auto-round`), **custom** recipe | | Scheme | Mixed (W4A16 / W8A16) | | Bits | 4 & 8 | | Group size | 128 | | Symmetric | yes | | Unquantized layers | `lm_head`, `linear_attn`, `visual`, `mtp.fc` | | Calibration dataset | `NeelNanda/pile-10k` | | Calibration samples | 512 | | Sequence length | 2048 | | Iterations | 1000 | | Batch size | 8 | | torch.compile | enabled | * For more information, please check `quantize.py`. ### KLD Metrics | Metric | Value | Description | | :--- | :--- | :--- | | **Median KLD** | 0.005592 | Median divergence | | **P90 KLD** | 0.034514 | Divergence at the 90th percentile | | **Mean KLD** | 0.046941 | Average divergence | | **Mean Coverage** | 0.994750 | - | ### Evaluation Configuration | Parameter | Value | | :--- | :--- | | **Calibration Dataset** | wikitext-2-raw-v1 (test) | | **Sequence Length** | 2048 | | **Num Samples** | 64 | | **Total Positions** | 131,008 | | **Top-K Reference** | 1000 | ## How to use * This model is tested on the latest `docker.io/vllm/vllm-openai:cu130-nightly`. * vLLM is recommended. * **⚠️ Important Note:** Do NOT use `FLASHINFER` as the attention backend (`--attention-backend FLASHINFER`), as it may cause compatibility issues for some people! * Example args (For 2x 3090 Users) : ```bash vllm serve ./Qwen3.6-27B-mixed-autoround \ --tensor-parallel-size 2 \ --attention-backend FLASH_ATTN \ --performance-mode interactivity \ --max-model-len auto \ --max-num-batched-tokens 2048 \ --max-num-seqs 1 \ --gpu-memory-utilization 0.96 \ --compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[4]}' \ -O3 \ --async-scheduling \ --language-model-only \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --enable-auto-tool-choice \ --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \ --default-chat-template-kwargs.preserve_thinking true \ --mamba-cache-mode all \ --mamba-block-size 8 \ --enable-prefix-caching \ --enable-chunked-prefill ``` * With these settings, you get full context. * Note: This information is based on current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation. ## Acknowledgements - [Lorbus](https://huggingface.co/Lorbus) for the README.md format - [Alibaba / Qwen team](https://huggingface.co/Qwen) for the base [Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) model - [Intel AutoRound](https://github.com/intel/auto-round) team for the quantization framework - [vLLM project](https://github.com/vllm-project/vllm) for the inference engine and Qwen3_5 MTP support