--- license: apache-2.0 base_model: - Qwen/Qwen3.6-27B --- ## Model details This model is an INT4 quantized version of the original **Qwen3.6-27B** model. Quantization was performed with **AutoRound** using a calibration dataset derived from Pile-10k. The calibration set consists of 256 randomly selected samples from diverse categories, with sequence length 2048 and 300 optimization iterations. The launch parameters below were tested on a headless RTX 3090 using the latest vLLM nightly build. MTP and prefix caching are confirmed to work. ## vLLM launch example ```bash vllm serve acyildirimer/Qwen3.6-27B-int4-AutoRound \ --host 0.0.0.0 \ --port 8000 \ --dtype auto \ --kv-cache-dtype auto \ --max-model-len auto \ --reasoning-parser qwen3 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --gpu-memory-utilization 0.96 \ --enable-prefix-caching \ --max-num-seqs 1 \ --served-model-name Qwen3.6-27B \ --language-model-only \ --performance-mode interactivity \ --attention-backend auto \ --max-num-batched-tokens 4096 \ --generation-config auto \ --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"repetition_penalty":1.0}' \ --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}' ``` ## Compatibility - The launch command was tested with the latest vLLM nightly build. - It should also work with the stable vLLM release, provided that the selected KV cache type is supported. If needed, use a supported `--kv-cache-dtype` value or set it to `auto`. - `turboquant` is not currently working on Ampere GPUs, so it is not enabled in the launch example above. ## Troubleshooting - If you encounter out-of-memory errors, reduce `--gpu-memory-utilization`. For example, try `0.94`, `0.92`, or lower depending on your available VRAM and workload. - In some runs, vLLM may incorrectly calculate `Available KV cache memory` during startup. If this happens, terminate the running instance and launch it again with the same parameters. ## Citation Please cite both the original Qwen3.6-27B model and the AutoRound quantization method when using this quantized model. ### Base model ```bibtex @misc{qwen36-27b, title = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model}, author = {{Qwen Team}}, year = {2026}, month = {April}, url = {https://qwen.ai/blog?id=qwen3.6-27b} } ``` ### Quantization method ```bibtex @article{cheng2023optimize, title = {Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs}, author = {Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal = {arXiv preprint arXiv:2309.05516}, year = {2023} } ```