Instructions to use Lorbus/Qwen3.6-27B-int4-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Lorbus/Qwen3.6-27B-int4-AutoRound with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Lorbus/Qwen3.6-27B-int4-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Lorbus/Qwen3.6-27B-int4-AutoRound")
model = AutoModelForImageTextToText.from_pretrained("Lorbus/Qwen3.6-27B-int4-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Lorbus/Qwen3.6-27B-int4-AutoRound with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Lorbus/Qwen3.6-27B-int4-AutoRound"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lorbus/Qwen3.6-27B-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Lorbus/Qwen3.6-27B-int4-AutoRound

SGLang

How to use Lorbus/Qwen3.6-27B-int4-AutoRound with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Lorbus/Qwen3.6-27B-int4-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lorbus/Qwen3.6-27B-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Lorbus/Qwen3.6-27B-int4-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lorbus/Qwen3.6-27B-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Lorbus/Qwen3.6-27B-int4-AutoRound with Docker Model Runner:
```
docker model run hf.co/Lorbus/Qwen3.6-27B-int4-AutoRound
```

Heads-up for Ampere users: tq-t4nc recipe + MTP doesn't work with CUDA graphs

by wasifb - opened 15 days ago

Discussion

wasifb

15 days ago

•

edited 13 days ago

Thanks for the great quant and especially for bundling the BF16 mtp.fc — on 2× RTX 3090 (Ampere SM 86) we're getting 80 TPS code single-card
and 92 TPS code TP=2 following your card (great numbers, beating your RTX 5090 quoted numbers on our 3090s). Wanted to flag a caveat for other
Ampere users who try the --kv-cache-dtype tq-t4nc recipe:

On Ampere, TurboQuant KV combined with --speculative-config method=mtp + --enable-chunked-prefill crashes vLLM at engine warmup:

turboquant_attn.py:570: qsl = query_start_loc.tolist()
RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture
unless the CPU tensor is pinned.

The .tolist() in the continuation-prefill branch forces a GPU→CPU sync that's illegal inside CUDA graph capture. vLLM PR #40092 (2026-04-23)
unblocked the fast path but left the continuation-prefill branch untouched, so any spec-decode + chunked-prefill combo still hits this on
Ampere. Distinct from the hybrid-model NotImplementedError at arg_utils.py:1652 (covered by the
Sandermage/genesis-vllm-patches monkey-patcher).

Filed upstream as vllm-project/vllm#40807.

Your card's suggested workaround (--compilation-config.cudagraph_mode none) boots but costs ~55% of short-prompt TPS and ends up slower than
llama.cpp mainline at long context, so it's net-negative on this hardware today. We wrote a small disk-edit patch that wraps both .tolist()
sites with torch.cuda.is_current_stream_capturing() guards — falls back to the graph-safe fast path only during capture, real inference is
unchanged. Source + reproducible recipe at
github.com/noonghunna/qwen36-27b-single-3090.

What worked for us on 2× RTX 3090 (benched against your recipe):

--kv-cache-dtype turboquant_3bit_nc instead of tq-t4nc → smaller bytes-per-token, lets us fit 125K context on a single card (KV pool 198K
tokens with vision enabled)
--speculative-config '{"method":"mtp","num_speculative_tokens":3}' — n=3 beats your recommended n=1 by ~30% on code (79.7 vs ~60 TPS
single-card, 92.2 TPS TP=2) with mean AL 3.4 and 67–92% per-position acceptance. n=4 regresses (position-4 accept collapses to ~8%).
--gpu-memory-utilization 0.97, --max-num-seqs 1, --max-num-batched-tokens 4128
Vision enabled (no --language-model-only): 54 TPS on image+text, only 18% slower than pure text narrative
CUDA graphs ON via our patch → keeping this is what takes us from ~30 TPS (eager workaround) to 85 TPS sustained / 106 peak

Once the upstream .tolist() sync gets pinned or precomputed (pending review at #40807), your tq-t4nc recipe should unlock 262K context on
Ampere too — we confirmed the Genesis-patched hybrid path gives a GPU KV cache of 874,368 tokens (5× fp8) on this model. Standing by to
retest when upstream lands.

Minor note: your card recommends --tool-call-parser qwen3_xml but qwen3_coder worked better for us with OpenAI-compat clients (Open WebUI,
LM Studio, Cline). YMMV.

Thanks again for the great release — the BF16 mtp.fc design decision is what makes single-card MTP possible at all.

wasifb

14 days ago

https://medium.com/@fzbcwvv/an-overnight-stack-for-qwen3-6-27b-85-tps-125k-context-vision-on-one-rtx-3090-0d95c6291914

wasifb changed discussion status to closed 14 days ago

wasifb changed discussion status to open 14 days ago

babaramh

14 days ago

@wasifb can you please share your patch_tolist_cudagraph.py file

wasifb

13 days ago

@wasifb can you please share your patch_tolist_cudagraph.py file

The patch is now available on git. links updated in the article.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment