Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Paper • 2309.05516 • Published • 12
This model is an INT4 quantized version of the original Qwen3.6-27B model.
Quantization was performed with AutoRound using a calibration dataset derived from Pile-10k. The calibration set consists of 256 randomly selected samples from diverse categories, with sequence length 2048 and 300 optimization iterations.
The launch parameters below were tested on a headless RTX 3090 using the latest vLLM nightly build. MTP and prefix caching are confirmed to work.
vllm serve acyildirimer/Qwen3.6-27B-int4-AutoRound \
--host 0.0.0.0 \
--port 8000 \
--dtype auto \
--kv-cache-dtype auto \
--max-model-len auto \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.96 \
--enable-prefix-caching \
--max-num-seqs 1 \
--served-model-name Qwen3.6-27B \
--language-model-only \
--performance-mode interactivity \
--attention-backend auto \
--max-num-batched-tokens 4096 \
--generation-config auto \
--override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"repetition_penalty":1.0}' \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'
--kv-cache-dtype value or set it to auto.turboquant is not currently working on Ampere GPUs, so it is not enabled in the launch example above.--gpu-memory-utilization. For example, try 0.94, 0.92, or lower depending on your available VRAM and workload.Available KV cache memory during startup. If this happens, terminate the running instance and launch it again with the same parameters.Please cite both the original Qwen3.6-27B model and the AutoRound quantization method when using this quantized model.
@misc{qwen36-27b,
title = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
author = {{Qwen Team}},
year = {2026},
month = {April},
url = {https://qwen.ai/blog?id=qwen3.6-27b}
}
@article{cheng2023optimize,
title = {Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
author = {Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
journal = {arXiv preprint arXiv:2309.05516},
year = {2023}
}
Base model
Qwen/Qwen3.6-27B