Instructions to use Minachist/Qwen3.6-27B-INT8-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Minachist/Qwen3.6-27B-INT8-AutoRound with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Minachist/Qwen3.6-27B-INT8-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Minachist/Qwen3.6-27B-INT8-AutoRound")
model = AutoModelForImageTextToText.from_pretrained("Minachist/Qwen3.6-27B-INT8-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Minachist/Qwen3.6-27B-INT8-AutoRound with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Minachist/Qwen3.6-27B-INT8-AutoRound"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Minachist/Qwen3.6-27B-INT8-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Minachist/Qwen3.6-27B-INT8-AutoRound

SGLang

How to use Minachist/Qwen3.6-27B-INT8-AutoRound with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Minachist/Qwen3.6-27B-INT8-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Minachist/Qwen3.6-27B-INT8-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Minachist/Qwen3.6-27B-INT8-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Minachist/Qwen3.6-27B-INT8-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Minachist/Qwen3.6-27B-INT8-AutoRound with Docker Model Runner:
```
docker model run hf.co/Minachist/Qwen3.6-27B-INT8-AutoRound
```

Qwen3.6-27B-INT8-AutoRound / README.md

Minachist

Update README.md

a40d4ff verified 2 days ago

preview code

raw

history blame contribute delete

4.35 kB

	---
	library_name: transformers
	license: apache-2.0
	license_link: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/LICENSE
	pipeline_tag: image-text-to-text
	base_model: Qwen/Qwen3.6-27B
	base_model_relation: quantized
	tags:
	- compressed-tensors
	- qwen3_6
	- int8
	- autoround
	---

	# Qwen3.6-27B INT8 AutoRound

	This is an unofficial INT8 quantized version of the Qwen3.6-27B.
	It was created using [AutoRound](https://github.com/intel/auto-round).

	## Available versions

	* There are two versions.
	* Main branch one is a little bit smaller by quantizing self_attn, and disabling group_size at the cost of the model's intelligence.
	* For users with 48GB VRAM, just using Main branch is recommended. If you have more than that, gs128 branch might be better. The performance difference in practical use is minimal.

	## Quantization details

	\| Field \| Main branch \| gs128 branch \|
	\|------\|------\|------\|
	\| Base \| `Qwen/Qwen3.6-27B` \| `Qwen/Qwen3.6-27B` \|
	\| Method \| AutoRound (`intel/auto-round`), custom recipe \| AutoRound (`intel/auto-round`), default recipe \|
	\| Scheme \| W8A16 \| W8A16 \|
	\| Bits \| 8 \| 8 \|
	\| Group size \| -1 \| 128 \|
	\| Symmetric \| yes \| yes \|
	\| Unquantized layers \| `visual`, `mtp`, `linear_attn`, `embed_tokens`, `lm_head` \| `visual`, `mtp`, <code><strong>self_attn</strong></code>, `linear_attn`, `embed_tokens`, `lm_head` \|
	\| Calibration samples \| 128 \| 128 \|
	\| Iterations \| 1000 \| 200 \|
	\| Batch size \| 8 \| 8 \|
	\| torch.compile \| enabled \| enabled \|
	\| Size \| 36.8GB \| 38.8GB \|
	\| GPU used for quant \| 2× RTX 3090 \| 2× RTX 3090 \|


	* For more information, please check quantize.py.

	## Evaluation Results (KLD)

	Lower values indicate less degradation caused by quantization.
	Main branch is used for the evaluation.

	### KLD Metrics
	\| Metric \| Value \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Median KLD \| 0.000621 \| Median divergence \|
	\| P90 KLD \| 0.002607 \| Divergence at the 90th percentile \|
	\| Mean KLD \| 0.009185 \| Average divergence \|
	\| Mean Coverage \| 0.998240 \| - \|

	### Evaluation Configuration
	\| Parameter \| Value \|
	\| :--- \| :--- \|
	\| Calibration Dataset \| wikitext-2-raw-v1 (test) \|
	\| Sequence Length \| 2048 \|
	\| Num Samples \| 64 \|
	\| Total Positions \| 131,008 \|
	\| Top-K Reference \| 1000 \|

	## Quantization log

	* Please check log.txt.

	## How to use

	* This model is tested on latest docker.io/vllm/vllm-openai:cu130-nightly.
	* vLLM is recommended.

	* Example args (For 2x 3090 Users) :
	```
	vllm serve ./Qwen3.6-27B-INT8-AutoRound \
	--tensor-parallel-size 2 \
	--attention-backend FLASHINFER \
	--performance-mode interactivity \
	--max-model-len auto \
	--max-num-batched-tokens 2048 \
	--max-num-seqs 1 \
	--gpu-memory-utilization 0.932 \
	--compilation-config '{"mode":"VLLM_COMPILE","cudagraph_capture_sizes":[3]}' \
	-O3 \
	--async-scheduling \
	--language-model-only \
	--tool-call-parser qwen3_coder \
	--reasoning-parser qwen3 \
	--enable-auto-tool-choice \
	--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
	--default-chat-template-kwargs.preserve_thinking true \
	--mamba-cache-mode all \
	--mamba-block-size 8 \
	--enable-prefix-caching \
	--enable-chunked-prefill
	```
	* With these settings, you get around 129k context. You can also add --kv-cache-dtype fp8_e4m3 --calculate-kv-scales args to get about 252k tokens.
	* You can add --enforce-eager (you might need to remove --compilation-config) or set the PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False environment variable (requires --disable-custom-all-reduce) to allocate more VRAM to the KV cache, but the tk/s will be noticeably lower.
	* Remove --speculative-config if you really want more context, but I highly recommend keeping it.
	* Note: This information is based on my current understanding and testing. Optimal configurations may vary depending on your specific hardware setup. For further details, please refer to the official vLLM documentation.


	## Acknowledgements
	- [Lorbus](https://huggingface.co/Lorbus) for the README.md format
	- [Alibaba / Qwen team](https://huggingface.co/Qwen) for the base [Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) model
	- [Intel AutoRound](https://github.com/intel/auto-round) team for the quantization framework
	- [vLLM project](https://github.com/vllm-project/vllm) for the inference engine and Qwen3_5 MTP support