File size: 2,763 Bytes
4c99ee6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: apache-2.0
base_model:
- Qwen/Qwen3.6-27B
---
## Model details

This model is an INT4 quantized version of the original **Qwen3.6-27B** model.

Quantization was performed with **AutoRound** using a calibration dataset derived from Pile-10k. The calibration set consists of 256 randomly selected samples from diverse categories, with sequence length 2048 and 300 optimization iterations.

The launch parameters below were tested on a headless RTX 3090 using the latest vLLM nightly build. MTP and prefix caching are confirmed to work.

## vLLM launch example

```bash
vllm serve acyildirimer/Qwen3.6-27B-int4-AutoRound \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --kv-cache-dtype auto \
  --max-model-len auto \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --gpu-memory-utilization 0.96 \
  --enable-prefix-caching \
  --max-num-seqs 1 \
  --served-model-name Qwen3.6-27B \
  --language-model-only \
  --performance-mode interactivity \
  --attention-backend auto \
  --max-num-batched-tokens 4096 \
  --generation-config auto \
  --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"repetition_penalty":1.0}' \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'
```

## Compatibility

- The launch command was tested with the latest vLLM nightly build.
- It should also work with the stable vLLM release, provided that the selected KV cache type is supported. If needed, use a supported `--kv-cache-dtype` value or set it to `auto`.
- `turboquant` is not currently working on Ampere GPUs, so it is not enabled in the launch example above.

## Troubleshooting

- If you encounter out-of-memory errors, reduce `--gpu-memory-utilization`. For example, try `0.94`, `0.92`, or lower depending on your available VRAM and workload.
- In some runs, vLLM may incorrectly calculate `Available KV cache memory` during startup. If this happens, terminate the running instance and launch it again with the same parameters.

## Citation

Please cite both the original Qwen3.6-27B model and the AutoRound quantization method when using this quantized model.

### Base model

```bibtex
@misc{qwen36-27b,
  title  = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
  author = {{Qwen Team}},
  year   = {2026},
  month  = {April},
  url    = {https://qwen.ai/blog?id=qwen3.6-27b}
}
```

### Quantization method

```bibtex
@article{cheng2023optimize,
  title   = {Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
  author  = {Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
  journal = {arXiv preprint arXiv:2309.05516},
  year    = {2023}
}
```