Qwen3-VL-30B-A3B-Instruct FP8-DYNAMIC
This repository contains Qwen/Qwen3-VL-30B-A3B-Instruct quantized to FP8_DYNAMIC using llm-compressor (data-free quantization) and saved in compressed-tensors format.
Quantization details
- Scheme: FP8_DYNAMIC
- Targets: Linear layers
- Ignored modules:
lm_head,visual/visionblocks,mlp.gate - Calibration: data-free (no dataset required)
Usage (Transformers)
import torch
from transformers import AutoProcessor, Qwen3VLMoeForConditionalGeneration
repo_id = "dtometzki/Qwen3-VL-30B-A3B-Instruct-FP8-DYNAMIC"
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
repo_id,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)
print("Model loaded ✅")
Usage (vLLM)
python -m vllm.entrypoints.openai.api_server \
--model ./Qwen3-VL-30B-A3B-Instruct-FP8-DYNAMIC \
--quantization fp8 \
--trust-remote-code \
--port 8000
on runpod A6000 with 48GB vRAM
Notes
- FP8 reduces VRAM compared to FP16/BF16 and can improve throughput depending on hardware and runtime.
- Longer contexts increase KV-cache VRAM usage. Reduce
--max-model-lenif you run out of memory.
Acknowledgements
- Base model: Qwen/Qwen3-VL-30B-A3B-Instruct
- Quantization: llm-compressor + compressed-tensors
- Downloads last month
- 7
Model tree for dtometzki/Qwen3-VL-30B-A3B-Instruct-FP8-DYNAMIC
Base model
Qwen/Qwen3-VL-30B-A3B-Instruct