| --- |
| library_name: transformers |
| license: apache-2.0 |
| license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE |
| pipeline_tag: image-text-to-text |
| base_model: Qwen/Qwen3.6-27B |
| base_model_relation: quantized |
| tags: |
| - compressed-tensors |
| - qwen3_6 |
| - int8 |
| - autoround |
| --- |
| |
| # Qwen3.6-27B INT8 AutoRound |
|
|
| This is an unofficial INT8 quantized version of the Qwen3.6-27B. |
| It was created using [AutoRound](https://github.com/intel/auto-round). |
|
|
| ## Available versions |
|
|
| * There are two versions. |
| * Main branch one is a little bit smaller by quantizing self_attn, and disabling group_size at the cost of the model's intelligence. |
| * For users with 48GB VRAM, just using Main branch is recommended. If you have more than that, gs128 branch might be better. The performance difference in practical use is minimal. |
|
|
| ## Quantization details |
|
|
| | Field | Main branch | gs128 branch | |
| |------|------|------| |
| | Base | `Qwen/Qwen3.6-27B` | `Qwen/Qwen3.6-27B` | |
| | Method | AutoRound (`intel/auto-round`), **custom** recipe | AutoRound (`intel/auto-round`), default recipe | |
| | Scheme | W8A16 | W8A16 | |
| | Bits | 8 | 8 | |
| | Group size | **-1** | 128 | |
| | Symmetric | yes | yes | |
| | Unquantized layers | `visual`, `mtp`, `linear_attn`, `embed_tokens`, `lm_head` | `visual`, `mtp`, <code><strong>self_attn</strong></code>, `linear_attn`, `embed_tokens`, `lm_head` | |
| | Calibration samples | 128 | 128 | |
| | Iterations | **1000** | 200 | |
| | Batch size | 8 | 8 | |
| | torch.compile | enabled | enabled | |
| | Size | **36.8GB** | **38.8GB** | |
| | GPU used for quant | 2× RTX 3090 | 2× RTX 3090 | |
|
|
|
|
| * For more information, please check quantize.py. |
|
|
| ## Evaluation Results (KLD) |
|
|
| Lower values indicate less degradation caused by quantization. |
| Main branch is used for the evaluation. |
|
|
| ### KLD Metrics |
| | Metric | Value | Description | |
| | :--- | :--- | :--- | |
| | **Median KLD** | 0.000621 | Median divergence | |
| | **P90 KLD** | 0.002607 | Divergence at the 90th percentile | |
| | **Mean KLD** | 0.009185 | Average divergence | |
| | **Mean Coverage** | 0.998240 | - | |
|
|
| ### Evaluation Configuration |
| | Parameter | Value | |
| | :--- | :--- | |
| | **Calibration Dataset** | wikitext-2-raw-v1 (test) | |
| | **Sequence Length** | 2048 | |
| | **Num Samples** | 64 | |
| | **Total Positions** | 131,008 | |
| | **Top-K Reference** | 1000 | |
|
|
| ## Quantization log |
|
|
| * Please check log.txt. |
|
|
| ## How to use |
|
|
| * This model is tested on latest docker.io/vllm/vllm-openai:cu130-nightly. |
| * vLLM is recommended. |
|
|
| * Example args (For 2x 3090 Users) : |
| ``` |
| vllm serve ./Qwen3.6-27B-INT8-AutoRound \ |
| --tensor-parallel-size 2 \ |
| --attention-backend FLASHINFER \ |
| --performance-mode interactivity \ |
| --max-model-len auto \ |
| --max-num-batched-tokens 2048 \ |
| --max-num-seqs 1 \ |
| --gpu-memory-utilization 0.932 \ |
| --compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[3]}' \ |
| -O3 \ |
| --async-scheduling \ |
| --language-model-only \ |
| --tool-call-parser qwen3_coder \ |
| --reasoning-parser qwen3 \ |
| --enable-auto-tool-choice \ |
| --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \ |
| --default-chat-template-kwargs.preserve_thinking true \ |
| --mamba-cache-mode all \ |
| --mamba-block-size 8 \ |
| --enable-prefix-caching \ |
| --enable-chunked-prefill |
| ``` |
| * With these settings, you get around 129k context. You can also add --kv-cache-dtype fp8_e4m3 --calculate-kv-scales args to get about 252k tokens. |
| * You can add --enforce-eager (you might need to remove --compilation-config) or set the PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False environment variable (requires --disable-custom-all-reduce) to allocate more VRAM to the KV cache, but the tk/s will be noticeably lower. |
| * Remove --speculative-config if you really want more context, but I highly recommend keeping it. |
| * Note: This information is based on my current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation. |
| |
| |
| ## Acknowledgements |
| - [Lorbus](https://huggingface.co/Lorbus) for the README.md format |
| - [Alibaba / Qwen team](https://huggingface.co/Qwen) for the base [Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) model |
| - [Intel AutoRound](https://github.com/intel/auto-round) team for the quantization framework |
| - [vLLM project](https://github.com/vllm-project/vllm) for the inference engine and Qwen3_5 MTP support |