Qwen3-VL-30B-A3B-Instruct FP8-DYNAMIC

This repository contains Qwen/Qwen3-VL-30B-A3B-Instruct quantized to FP8_DYNAMIC using llm-compressor (data-free quantization) and saved in compressed-tensors format.

Quantization details

  • Scheme: FP8_DYNAMIC
  • Targets: Linear layers
  • Ignored modules: lm_head, visual / vision blocks, mlp.gate
  • Calibration: data-free (no dataset required)

Usage (Transformers)

import torch
from transformers import AutoProcessor, Qwen3VLMoeForConditionalGeneration

repo_id = "dtometzki/Qwen3-VL-30B-A3B-Instruct-FP8-DYNAMIC"

model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
    repo_id,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)

processor = AutoProcessor.from_pretrained(repo_id, trust_remote_code=True)

print("Model loaded ✅")

Usage (vLLM)

python -m vllm.entrypoints.openai.api_server \
  --model ./Qwen3-VL-30B-A3B-Instruct-FP8-DYNAMIC \
  --quantization fp8 \
  --trust-remote-code \
  --port 8000

on runpod A6000 with 48GB vRAM

Notes

  • FP8 reduces VRAM compared to FP16/BF16 and can improve throughput depending on hardware and runtime.
  • Longer contexts increase KV-cache VRAM usage. Reduce --max-model-len if you run out of memory.

Acknowledgements

  • Base model: Qwen/Qwen3-VL-30B-A3B-Instruct
  • Quantization: llm-compressor + compressed-tensors

Downloads last month
7
Safetensors
Model size
31B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dtometzki/Qwen3-VL-30B-A3B-Instruct-FP8-DYNAMIC

Quantized
(56)
this model