--- library_name: transformers license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE pipeline_tag: image-text-to-text base_model: Qwen/Qwen3.6-27B base_model_relation: quantized tags: - compressed-tensors - qwen3_6 - int8 - autoround --- # Qwen3.6-27B INT8 AutoRound This is an unofficial INT8 quantized version of the Qwen3.6-27B. It was created using [AutoRound](https://github.com/intel/auto-round). ## Available versions * There are two versions. * Main branch one is a little bit smaller by quantizing self_attn, and disabling group_size at the cost of the model's intelligence. * For users with 48GB VRAM, just using Main branch is recommended. If you have more than that, gs128 branch might be better. The performance difference in practical use is minimal. ## Quantization details | Field | Main branch | gs128 branch | |------|------|------| | Base | `Qwen/Qwen3.6-27B` | `Qwen/Qwen3.6-27B` | | Method | AutoRound (`intel/auto-round`), **custom** recipe | AutoRound (`intel/auto-round`), default recipe | | Scheme | W8A16 | W8A16 | | Bits | 8 | 8 | | Group size | **-1** | 128 | | Symmetric | yes | yes | | Unquantized layers | `visual`, `mtp`, `linear_attn`, `embed_tokens`, `lm_head` | `visual`, `mtp`, self_attn, `linear_attn`, `embed_tokens`, `lm_head` | | Calibration samples | 128 | 128 | | Iterations | **1000** | 200 | | Batch size | 8 | 8 | | torch.compile | enabled | enabled | | Size | **36.8GB** | **38.8GB** | | GPU used for quant | 2× RTX 3090 | 2× RTX 3090 | * For more information, please check quantize.py. ## Evaluation Results (KLD) Lower values indicate less degradation caused by quantization. Main branch is used for the evaluation. ### KLD Metrics | Metric | Value | Description | | :--- | :--- | :--- | | **Median KLD** | 0.000621 | Median divergence | | **P90 KLD** | 0.002607 | Divergence at the 90th percentile | | **Mean KLD** | 0.009185 | Average divergence | | **Mean Coverage** | 0.998240 | - | ### Evaluation Configuration | Parameter | Value | | :--- | :--- | | **Calibration Dataset** | wikitext-2-raw-v1 (test) | | **Sequence Length** | 2048 | | **Num Samples** | 64 | | **Total Positions** | 131,008 | | **Top-K Reference** | 1000 | ## Quantization log * Please check log.txt. ## How to use * This model is tested on latest docker.io/vllm/vllm-openai:cu130-nightly. * vLLM is recommended. * Example args (For 2x 3090 Users) : ``` vllm serve ./Qwen3.6-27B-INT8-AutoRound \ --tensor-parallel-size 2 \ --attention-backend FLASHINFER \ --performance-mode interactivity \ --max-model-len auto \ --max-num-batched-tokens 2048 \ --max-num-seqs 1 \ --gpu-memory-utilization 0.932 \ --compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[3]}' \ -O3 \ --async-scheduling \ --language-model-only \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --enable-auto-tool-choice \ --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \ --default-chat-template-kwargs.preserve_thinking true \ --mamba-cache-mode all \ --mamba-block-size 8 \ --enable-prefix-caching \ --enable-chunked-prefill ``` * With these settings, you get around 129k context. You can also add --kv-cache-dtype fp8_e4m3 --calculate-kv-scales args to get about 252k tokens. * You can add --enforce-eager (you might need to remove --compilation-config) or set the PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False environment variable (requires --disable-custom-all-reduce) to allocate more VRAM to the KV cache, but the tk/s will be noticeably lower. * Remove --speculative-config if you really want more context, but I highly recommend keeping it. * Note: This information is based on my current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation. ## Acknowledgements - [Lorbus](https://huggingface.co/Lorbus) for the README.md format - [Alibaba / Qwen team](https://huggingface.co/Qwen) for the base [Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) model - [Intel AutoRound](https://github.com/intel/auto-round) team for the quantization framework - [vLLM project](https://github.com/vllm-project/vllm) for the inference engine and Qwen3_5 MTP support