| --- |
| license: apache-2.0 |
| base_model: |
| - Qwen/Qwen3.6-27B |
| --- |
| ## Model details |
|
|
| This model is an INT4 quantized version of the original **Qwen3.6-27B** model. |
|
|
| Quantization was performed with **AutoRound** using a calibration dataset derived from Pile-10k. The calibration set consists of 256 randomly selected samples from diverse categories, with sequence length 2048 and 300 optimization iterations. |
|
|
| The launch parameters below were tested on a headless RTX 3090 using the latest vLLM nightly build. MTP and prefix caching are confirmed to work. |
|
|
| ## vLLM launch example |
|
|
| ```bash |
| vllm serve acyildirimer/Qwen3.6-27B-int4-AutoRound \ |
| --host 0.0.0.0 \ |
| --port 8000 \ |
| --dtype auto \ |
| --kv-cache-dtype auto \ |
| --max-model-len auto \ |
| --reasoning-parser qwen3 \ |
| --enable-auto-tool-choice \ |
| --tool-call-parser qwen3_coder \ |
| --gpu-memory-utilization 0.96 \ |
| --enable-prefix-caching \ |
| --max-num-seqs 1 \ |
| --served-model-name Qwen3.6-27B \ |
| --language-model-only \ |
| --performance-mode interactivity \ |
| --attention-backend auto \ |
| --max-num-batched-tokens 4096 \ |
| --generation-config auto \ |
| --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"repetition_penalty":1.0}' \ |
| --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}' |
| ``` |
|
|
| ## Compatibility |
|
|
| - The launch command was tested with the latest vLLM nightly build. |
| - It should also work with the stable vLLM release, provided that the selected KV cache type is supported. If needed, use a supported `--kv-cache-dtype` value or set it to `auto`. |
| - `turboquant` is not currently working on Ampere GPUs, so it is not enabled in the launch example above. |
|
|
| ## Troubleshooting |
|
|
| - If you encounter out-of-memory errors, reduce `--gpu-memory-utilization`. For example, try `0.94`, `0.92`, or lower depending on your available VRAM and workload. |
| - In some runs, vLLM may incorrectly calculate `Available KV cache memory` during startup. If this happens, terminate the running instance and launch it again with the same parameters. |
|
|
| ## Citation |
|
|
| Please cite both the original Qwen3.6-27B model and the AutoRound quantization method when using this quantized model. |
|
|
| ### Base model |
|
|
| ```bibtex |
| @misc{qwen36-27b, |
| title = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model}, |
| author = {{Qwen Team}}, |
| year = {2026}, |
| month = {April}, |
| url = {https://qwen.ai/blog?id=qwen3.6-27b} |
| } |
| ``` |
|
|
| ### Quantization method |
|
|
| ```bibtex |
| @article{cheng2023optimize, |
| title = {Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs}, |
| author = {Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, |
| journal = {arXiv preprint arXiv:2309.05516}, |
| year = {2023} |
| } |
| ``` |