acyildirimer
/

Qwen3.6-27B-int4-AutoRound

4-bit precision

Model card Files Files and versions

Qwen3.6-27B-int4-AutoRound / README.md

acyildirimer's picture

Update model

4c99ee6 9 days ago

|

history blame contribute delete

2.76 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen3.6-27B
	---
	## Model details

	This model is an INT4 quantized version of the original Qwen3.6-27B model.

	Quantization was performed with AutoRound using a calibration dataset derived from Pile-10k. The calibration set consists of 256 randomly selected samples from diverse categories, with sequence length 2048 and 300 optimization iterations.

	The launch parameters below were tested on a headless RTX 3090 using the latest vLLM nightly build. MTP and prefix caching are confirmed to work.

	## vLLM launch example

	```bash
	vllm serve acyildirimer/Qwen3.6-27B-int4-AutoRound \
	--host 0.0.0.0 \
	--port 8000 \
	--dtype auto \
	--kv-cache-dtype auto \
	--max-model-len auto \
	--reasoning-parser qwen3 \
	--enable-auto-tool-choice \
	--tool-call-parser qwen3_coder \
	--gpu-memory-utilization 0.96 \
	--enable-prefix-caching \
	--max-num-seqs 1 \
	--served-model-name Qwen3.6-27B \
	--language-model-only \
	--performance-mode interactivity \
	--attention-backend auto \
	--max-num-batched-tokens 4096 \
	--generation-config auto \
	--override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"repetition_penalty":1.0}' \
	--speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'
	```

	## Compatibility

	- The launch command was tested with the latest vLLM nightly build.
	- It should also work with the stable vLLM release, provided that the selected KV cache type is supported. If needed, use a supported `--kv-cache-dtype` value or set it to `auto`.
	- `turboquant` is not currently working on Ampere GPUs, so it is not enabled in the launch example above.

	## Troubleshooting

	- If you encounter out-of-memory errors, reduce `--gpu-memory-utilization`. For example, try `0.94`, `0.92`, or lower depending on your available VRAM and workload.
	- In some runs, vLLM may incorrectly calculate `Available KV cache memory` during startup. If this happens, terminate the running instance and launch it again with the same parameters.

	## Citation

	Please cite both the original Qwen3.6-27B model and the AutoRound quantization method when using this quantized model.

	### Base model

	```bibtex
	@misc{qwen36-27b,
	title = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
	author = {{Qwen Team}},
	year = {2026},
	month = {April},
	url = {https://qwen.ai/blog?id=qwen3.6-27b}
	}
	```

	### Quantization method

	```bibtex
	@article{cheng2023optimize,
	title = {Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
	author = {Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
	journal = {arXiv preprint arXiv:2309.05516},
	year = {2023}
	}
	```