Instructions to use Qwen/Qwen3.6-27B-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3.6-27B-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.6-27B-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.6-27B-FP8")
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.6-27B-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Qwen/Qwen3.6-27B-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3.6-27B-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-27B-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3.6-27B-FP8

SGLang

How to use Qwen/Qwen3.6-27B-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3.6-27B-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-27B-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3.6-27B-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-27B-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3.6-27B-FP8 with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3.6-27B-FP8
```

Performance report on RTX 4090D (48GB VRAM): 40 t/s

#11

by SlavikF - opened 3 days ago

Discussion

SlavikF

3 days ago

•

edited 3 days ago

System:

Intel Xeon W5-3425 with DDR5-4800 RAM
Nvidia RTX 4090D modded with 48GB VRAM

Tested with request which is 40k tokens and the response about 2k tokens.
Using MTP.
I can fit 128k context to my 48GB VRAM.

Getting speed:
PP: 4000 t/s
TG: from 40 to 44 t/k

My docker composer:

services:
  vllm:
    image: vllm/vllm-openai:v0.20.1-cu129-ubuntu2404
    container_name: vllm-qwen27B
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ports:
      - "8080:8000"
    environment:
      TORCH_CUDA_ARCH_LIST: "8.9"
    volumes:
      - /home/slavik/.cache:/root/.cache
    ipc: host
    command:
      - "--model"
      - "Qwen/Qwen3.6-27B-FP8"
      - "--max-model-len"
      - "131072"
      - "--served-model-name"
      - "local-vl-qwen27B"
      - "--gpu-memory-utilization"
      - "0.975"
      - "--performance-mode"
      - "interactivity"
      - "--trust-remote-code"
      - "--enable-auto-tool-choice"
      - "--tool-call-parser"
      - "qwen3_coder"
      - "--reasoning-parser"
      - "qwen3"
      - "--mm-encoder-tp-mode"
      - "data"
      - "--mm-processor-cache-type"
      - "shm"
      - "--speculative-config"
      - '{"method":"mtp","num_speculative_tokens":2}'
      - "--compilation-config"
      - '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}'
      - "--async-scheduling"
      - "--attention-backend"
      - "flashinfer"
      - "--kv-cache-dtype"
      - "bfloat16"
      - "--enable-prefix-caching"

Config	Steady generation speed	Speed vs no SpecDecoding
No SpecDecoding	~18.8 tok/s	1.0x
MTP=2	~41.4 tok/s	~2.2x
MTP=3	~45.4 tok/s	~2.4x
MTP=4	~47.3 tok/s	~2.5x
MTP=5	~48.0 tok/s	~2.55x

SlavikF

3 days ago

•

edited 3 days ago

Compare that to llama.cpp with unsloth/Qwen3.6-27B-GGUF:Q6_K_XL:

https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/discussions/7

getting 30 t/s with llama.cpp.
No MTP

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment