Qwen3-VL-30B-A3B-Instruct FP8-DYNAMIC

This repository contains Qwen/Qwen3-VL-30B-A3B-Instruct quantized to FP8_DYNAMIC using llm-compressor (data-free quantization) and saved in compressed-tensors format.

Quantization details

Scheme: FP8_DYNAMIC
Targets: Linear layers
Ignored modules: lm_head, visual / vision blocks, mlp.gate
Calibration: data-free (no dataset required)

Usage (Transformers)

import torch
from transformers import AutoProcessor, Qwen3VLMoeForConditionalGeneration

repo_id = "dtometzki/Qwen3-VL-30B-A3B-Instruct-FP8-DYNAMIC"

model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
    repo_id,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)

processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)

print("Model loaded ✅")

Usage (vLLM)

python -m vllm.entrypoints.openai.api_server \
  --model ./Qwen3-VL-30B-A3B-Instruct-FP8-DYNAMIC \
  --quantization fp8 \
  --trust-remote-code \
  --port 8000

on runpod A6000 with 48GB vRAM

Notes

FP8 reduces VRAM compared to FP16/BF16 and can improve throughput depending on hardware and runtime.
Longer contexts increase KV-cache VRAM usage. Reduce --max-model-len if you run out of memory.

Acknowledgements

Base model: Qwen/Qwen3-VL-30B-A3B-Instruct
Quantization: llm-compressor + compressed-tensors

Downloads last month: 7

Safetensors

Model size

31B params

Tensor type

BF16

F8_E4M3

Model tree for dtometzki/Qwen3-VL-30B-A3B-Instruct-FP8-DYNAMIC

Base model

Qwen/Qwen3-VL-30B-A3B-Instruct

Quantized

(56)

this model