Minachist's picture
Update README.md
a40d4ff verified
---
library_name: transformers
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE
pipeline_tag: image-text-to-text
base_model: Qwen/Qwen3.6-27B
base_model_relation: quantized
tags:
- compressed-tensors
- qwen3_6
- int8
- autoround
---
# Qwen3.6-27B INT8 AutoRound
This is an unofficial INT8 quantized version of the Qwen3.6-27B.
It was created using [AutoRound](https://github.com/intel/auto-round).
## Available versions
* There are two versions.
* Main branch one is a little bit smaller by quantizing self_attn, and disabling group_size at the cost of the model's intelligence.
* For users with 48GB VRAM, just using Main branch is recommended. If you have more than that, gs128 branch might be better. The performance difference in practical use is minimal.
## Quantization details
| Field | Main branch | gs128 branch |
|------|------|------|
| Base | `Qwen/Qwen3.6-27B` | `Qwen/Qwen3.6-27B` |
| Method | AutoRound (`intel/auto-round`), **custom** recipe | AutoRound (`intel/auto-round`), default recipe |
| Scheme | W8A16 | W8A16 |
| Bits | 8 | 8 |
| Group size | **-1** | 128 |
| Symmetric | yes | yes |
| Unquantized layers | `visual`, `mtp`, `linear_attn`, `embed_tokens`, `lm_head` | `visual`, `mtp`, <code><strong>self_attn</strong></code>, `linear_attn`, `embed_tokens`, `lm_head` |
| Calibration samples | 128 | 128 |
| Iterations | **1000** | 200 |
| Batch size | 8 | 8 |
| torch.compile | enabled | enabled |
| Size | **36.8GB** | **38.8GB** |
| GPU used for quant | 2× RTX 3090 | 2× RTX 3090 |
* For more information, please check quantize.py.
## Evaluation Results (KLD)
Lower values indicate less degradation caused by quantization.
Main branch is used for the evaluation.
### KLD Metrics
| Metric | Value | Description |
| :--- | :--- | :--- |
| **Median KLD** | 0.000621 | Median divergence |
| **P90 KLD** | 0.002607 | Divergence at the 90th percentile |
| **Mean KLD** | 0.009185 | Average divergence |
| **Mean Coverage** | 0.998240 | - |
### Evaluation Configuration
| Parameter | Value |
| :--- | :--- |
| **Calibration Dataset** | wikitext-2-raw-v1 (test) |
| **Sequence Length** | 2048 |
| **Num Samples** | 64 |
| **Total Positions** | 131,008 |
| **Top-K Reference** | 1000 |
## Quantization log
* Please check log.txt.
## How to use
* This model is tested on latest docker.io/vllm/vllm-openai:cu130-nightly.
* vLLM is recommended.
* Example args (For 2x 3090 Users) :
```
vllm serve ./Qwen3.6-27B-INT8-AutoRound \
--tensor-parallel-size 2 \
--attention-backend FLASHINFER \
--performance-mode interactivity \
--max-model-len auto \
--max-num-batched-tokens 2048 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.932 \
--compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[3]}' \
-O3 \
--async-scheduling \
--language-model-only \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
--default-chat-template-kwargs.preserve_thinking true \
--mamba-cache-mode all \
--mamba-block-size 8 \
--enable-prefix-caching \
--enable-chunked-prefill
```
* With these settings, you get around 129k context. You can also add --kv-cache-dtype fp8_e4m3 --calculate-kv-scales args to get about 252k tokens.
* You can add --enforce-eager (you might need to remove --compilation-config) or set the PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False environment variable (requires --disable-custom-all-reduce) to allocate more VRAM to the KV cache, but the tk/s will be noticeably lower.
* Remove --speculative-config if you really want more context, but I highly recommend keeping it.
* Note: This information is based on my current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation.
## Acknowledgements
- [Lorbus](https://huggingface.co/Lorbus) for the README.md format
- [Alibaba / Qwen team](https://huggingface.co/Qwen) for the base [Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) model
- [Intel AutoRound](https://github.com/intel/auto-round) team for the quantization framework
- [vLLM project](https://github.com/vllm-project/vllm) for the inference engine and Qwen3_5 MTP support