acyildirimer's picture
Update model
4c99ee6
---
license: apache-2.0
base_model:
- Qwen/Qwen3.6-27B
---
## Model details
This model is an INT4 quantized version of the original **Qwen3.6-27B** model.
Quantization was performed with **AutoRound** using a calibration dataset derived from Pile-10k. The calibration set consists of 256 randomly selected samples from diverse categories, with sequence length 2048 and 300 optimization iterations.
The launch parameters below were tested on a headless RTX 3090 using the latest vLLM nightly build. MTP and prefix caching are confirmed to work.
## vLLM launch example
```bash
vllm serve acyildirimer/Qwen3.6-27B-int4-AutoRound \
--host 0.0.0.0 \
--port 8000 \
--dtype auto \
--kv-cache-dtype auto \
--max-model-len auto \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.96 \
--enable-prefix-caching \
--max-num-seqs 1 \
--served-model-name Qwen3.6-27B \
--language-model-only \
--performance-mode interactivity \
--attention-backend auto \
--max-num-batched-tokens 4096 \
--generation-config auto \
--override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"repetition_penalty":1.0}' \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'
```
## Compatibility
- The launch command was tested with the latest vLLM nightly build.
- It should also work with the stable vLLM release, provided that the selected KV cache type is supported. If needed, use a supported `--kv-cache-dtype` value or set it to `auto`.
- `turboquant` is not currently working on Ampere GPUs, so it is not enabled in the launch example above.
## Troubleshooting
- If you encounter out-of-memory errors, reduce `--gpu-memory-utilization`. For example, try `0.94`, `0.92`, or lower depending on your available VRAM and workload.
- In some runs, vLLM may incorrectly calculate `Available KV cache memory` during startup. If this happens, terminate the running instance and launch it again with the same parameters.
## Citation
Please cite both the original Qwen3.6-27B model and the AutoRound quantization method when using this quantized model.
### Base model
```bibtex
@misc{qwen36-27b,
title = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
author = {{Qwen Team}},
year = {2026},
month = {April},
url = {https://qwen.ai/blog?id=qwen3.6-27b}
}
```
### Quantization method
```bibtex
@article{cheng2023optimize,
title = {Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
author = {Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
journal = {arXiv preprint arXiv:2309.05516},
year = {2023}
}
```