Model details

This model is an INT4 quantized version of the original Qwen3.6-27B model.

Quantization was performed with AutoRound using a calibration dataset derived from Pile-10k. The calibration set consists of 256 randomly selected samples from diverse categories, with sequence length 2048 and 300 optimization iterations.

The launch parameters below were tested on a headless RTX 3090 using the latest vLLM nightly build. MTP and prefix caching are confirmed to work.

vLLM launch example

vllm serve acyildirimer/Qwen3.6-27B-int4-AutoRound \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --kv-cache-dtype auto \
  --max-model-len auto \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --gpu-memory-utilization 0.96 \
  --enable-prefix-caching \
  --max-num-seqs 1 \
  --served-model-name Qwen3.6-27B \
  --language-model-only \
  --performance-mode interactivity \
  --attention-backend auto \
  --max-num-batched-tokens 4096 \
  --generation-config auto \
  --override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"repetition_penalty":1.0}' \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 2}'

Compatibility

The launch command was tested with the latest vLLM nightly build.
It should also work with the stable vLLM release, provided that the selected KV cache type is supported. If needed, use a supported --kv-cache-dtype value or set it to auto.
turboquant is not currently working on Ampere GPUs, so it is not enabled in the launch example above.

Troubleshooting

If you encounter out-of-memory errors, reduce --gpu-memory-utilization. For example, try 0.94, 0.92, or lower depending on your available VRAM and workload.
In some runs, vLLM may incorrectly calculate Available KV cache memory during startup. If this happens, terminate the running instance and launch it again with the same parameters.

Citation

Please cite both the original Qwen3.6-27B model and the AutoRound quantization method when using this quantized model.

Base model

@misc{qwen36-27b,
  title  = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
  author = {{Qwen Team}},
  year   = {2026},
  month  = {April},
  url    = {https://qwen.ai/blog?id=qwen3.6-27b}
}

Quantization method

@article{cheng2023optimize,
  title   = {Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
  author  = {Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
  journal = {arXiv preprint arXiv:2309.05516},
  year    = {2023}
}

Downloads last month: 2,064

Safetensors

Model size

3B params

Tensor type

BF16

I32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for acyildirimer/Qwen3.6-27B-int4-AutoRound

Base model

Qwen/Qwen3.6-27B

Quantized

(279)

this model

Paper for acyildirimer/Qwen3.6-27B-int4-AutoRound

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Paper • 2309.05516 • Published Sep 11, 2023 • 12