dtometzki/Qwen3-30B-A3B-MXFP4A16

base_model: Qwen/Qwen3-30B-A3B
library_name: transformers
tags:
- quantization
- mxfp4
- 4-bit
- compressed-tensors
- qwen
- text-generation
- llmcompressor
language:
- en
pipeline_tag: text-generation
license: other
---

# Qwen3-30B-A3B-MXFP4A16

## Model Description

This is a compressed version of **[Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)**.

The model was quantized using **Weight-Only Quantization** to **4-bit Floating Point (FP4)** using the **MXFP4** scaling scheme. This format is optimized for next-generation hardware (like NVIDIA Blackwell) but also runs efficiently on current GPUs using software emulation.

By using `MXFP4A16` (Microscaling FP4 weights with FP16 activations), this model achieves a massive reduction in size (approx. 70-75%) while maintaining high accuracy, especially for weights distributed around zero, compared to standard INT4 quantization.

## Quantization Details

This model was created using the `llmcompressor` library with the following configuration:

* **Scheme:** `MXFP4A16` (4-bit Weights, 16-bit Activations)
* **Algorithm:** Weight-Only Quantization (Data-Free)
* **Target Modules:** Linear Layers
* **Ignored Modules:** `lm_head` (kept in full precision for stability)
* **Group Size:** 32 (Block-wise scaling)

## Installation

You need to install `vllm` or `llmcompressor` to use this model efficiently.

```bash
pip install vllm
# or
pip install llmcompressor

Quickstart

Using vLLM (Recommended)

This model is optimized for vLLM, which supports the compressed-tensors format natively.

from vllm import LLM, SamplingParams

model_id = "DEIN_USERNAME/Qwen3-30B-A3B-MXFP4A16"

llm = LLM(
    model=model_id,
    trust_remote_code=True
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=200)

prompts = [
    "Hello, my name is",
    "Explain quantum physics in simple terms:",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Using Transformers & LLMCompressor

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.utils import dispatch_for_generation

model_id = "DEIN_USERNAME/Qwen3-30B-A3B-MXFP4A16"

model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="auto", 
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Optimize model for generation
dispatch_for_generation(model)

input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=100)

print(tokenizer.decode(output[0]))

Hardware Requirements

VRAM: Significantly reduced compared to the BF16 original.
Original (30B): ~60 GB VRAM
Quantized (MXFP4): ~17 GB VRAM
Compatibility: Runs on NVIDIA GPUs (Ampere/Ada Lovelace/Hopper/Blackwell).

*Created with llmcompressor*

Downloads last month: 3

Safetensors

Model size

17B params

Tensor type

BF16

Model tree for dtometzki/Qwen3-30B-A3B-MXFP4A16

Base model

Qwen/Qwen3-30B-A3B-Base

Finetuned

Qwen/Qwen3-30B-A3B

Quantized

(112)

this model