base_model: Qwen/Qwen3-30B-A3B
library_name: transformers
tags:
- quantization
- mxfp4
- 4-bit
- compressed-tensors
- qwen
- text-generation
- llmcompressor
language:
- en
pipeline_tag: text-generation
license: other
---
# Qwen3-30B-A3B-MXFP4A16
## Model Description
This is a compressed version of **[Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)**.
The model was quantized using **Weight-Only Quantization** to **4-bit Floating Point (FP4)** using the **MXFP4** scaling scheme. This format is optimized for next-generation hardware (like NVIDIA Blackwell) but also runs efficiently on current GPUs using software emulation.
By using `MXFP4A16` (Microscaling FP4 weights with FP16 activations), this model achieves a massive reduction in size (approx. 70-75%) while maintaining high accuracy, especially for weights distributed around zero, compared to standard INT4 quantization.
## Quantization Details
This model was created using the `llmcompressor` library with the following configuration:
* **Scheme:** `MXFP4A16` (4-bit Weights, 16-bit Activations)
* **Algorithm:** Weight-Only Quantization (Data-Free)
* **Target Modules:** Linear Layers
* **Ignored Modules:** `lm_head` (kept in full precision for stability)
* **Group Size:** 32 (Block-wise scaling)
## Installation
You need to install `vllm` or `llmcompressor` to use this model efficiently.
```bash
pip install vllm
# or
pip install llmcompressor
Quickstart
Using vLLM (Recommended)
This model is optimized for vLLM, which supports the compressed-tensors format natively.
from vllm import LLM, SamplingParams
model_id = "DEIN_USERNAME/Qwen3-30B-A3B-MXFP4A16"
llm = LLM(
model=model_id,
trust_remote_code=True
)
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=200)
prompts = [
"Hello, my name is",
"Explain quantum physics in simple terms:",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Using Transformers & LLMCompressor
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.utils import dispatch_for_generation
model_id = "DEIN_USERNAME/Qwen3-30B-A3B-MXFP4A16"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Optimize model for generation
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(model.device)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
Hardware Requirements
VRAM: Significantly reduced compared to the BF16 original.
Original (30B): ~60 GB VRAM
Quantized (MXFP4): ~17 GB VRAM
Compatibility: Runs on NVIDIA GPUs (Ampere/Ada Lovelace/Hopper/Blackwell).
*Created with llmcompressor*
- Downloads last month
- 3