Configuration Parsing Warning:Config file config.json cannot be fetched (too big)
Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
🚀 Qwen3.5-35B-A3B-MXFP4
Qwen3.5-35B-A3B quantized to MXFP4 (Microscaling FP4) precision — specifically optimized for AMD Radeon RX 9700 (RDNA 4) GPUs using vLLM.
📋 Model Overview
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-35B-A3B-FP8 |
| Architecture | Mixture of Experts (MoE) |
| Parameters | 35B total / ~3.5B active |
| Quantization | MXFP4 (Microscaling FP4) |
| Inference Engine | vLLM |
| Target Hardware | AMD Radeon RX 9700 / RX 9700 XT (RDNA 4) |
| License | Apache 2.0 |
📦 Size & Compression
| Version | Precision | Size on Disk | vs FP16 |
|---|---|---|---|
| Qwen3.5-35B-A3B (FP16) | FP16 | ~70 GB | baseline |
| Qwen3.5-35B-A3B (FP8) | FP8 | 35 GB | −50% |
| Qwen3.5-35B-A3B-MXFP4 (this) | MXFP4 | 22 GB | −69% |
Compared directly to the FP8 source model: −37% smaller, enabling deployment on a single 16 GB consumer GPU.
⚡ Quick Start
Installation
pip install vllm --upgrade
Basic Inference
from vllm import LLM, SamplingParams
llm = LLM(
model="djdeniro/Qwen3.5-35B-A3B-MXFP4",
quantization="mxfp4",
dtype="auto",
max_model_len=8192,
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
outputs = llm.generate(
["Explain the concept of mixture of experts in neural networks."],
sampling_params,
)
print(outputs[0].outputs[0].text)
Chat / Instruction Format
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
llm = LLM(
model="djdeniro/Qwen3.5-35B-A3B-MXFP4",
quantization="mxfp4",
dtype="auto",
max_model_len=8192,
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is MXFP4 quantization and why is it useful?"},
]
tokenizer = AutoTokenizer.from_pretrained("djdeniro/Qwen3.5-35B-A3B-MXFP4")
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
sampling_params = SamplingParams(temperature=0.7, max_tokens=1024)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
OpenAI-Compatible Server
vllm serve djdeniro/Qwen3.5-35B-A3B-MXFP4 \
--quantization mxfp4 \
--max-model-len 8192 \
--port 8000
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="djdeniro/Qwen3.5-35B-A3B-MXFP4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello! What can you do?"},
],
max_tokens=512,
)
print(response.choices[0].message.content)
🔧 Quantization Details
| Parameter | Value |
|---|---|
| Source model | Qwen/Qwen3.5-35B-A3B-FP8 |
| Source precision | FP8 (E4M3) |
| Target precision | MXFP4 |
| Block size | 32 elements |
| Scaling granularity | Per-block shared microscaling exponent |
| Quantized layers | Linear (attention projections + FFN) |
| Non-quantized | Embeddings, LayerNorm, LM head |
📊 Quality vs Compression
| Model | Size | Quality (est.) | VRAM Required |
|---|---|---|---|
| Qwen3.5-35B-A3B FP16 | ~70 GB | 100% (reference) | 2× 40 GB |
| Qwen3.5-35B-A3B FP8 | 35 GB | ~99.5% | 1× 40 GB |
| Qwen3.5-35B-A3B MXFP4 (this) | 22 GB | ~97–98% | 1× 16 GB |
📁 Repository Structure
Qwen3.5-35B-A3B-MXFP4/
├── README.md
├── config.json
├── generation_config.json
├── chat_template.jinja
├── tokenizer.json
├── tokenizer_config.json
├── vocab.json
├── merges.txt
├── preprocessor_config.json
├── model.safetensors.index.json
└── model.safetensors-0000{1..8}-of-00008.safetensors # 22 GB total
⚠️ Limitations
- MXFP4 inference requires vLLM with MXFP4 support and RDNA 4 or compatible hardware
- Slight quality reduction vs FP8/FP16 due to aggressive quantization
- Not recommended for tasks requiring extreme numerical precision
🙏 Credits
- Base model: Qwen Team @ Alibaba Cloud
- Quantization: @djdeniro
- Inference engine: vLLM Project
- MXFP4 specification: OCP Microscaling Formats (MX)
📄 License
This model is released under the Apache 2.0 license, consistent with the original Qwen3.5 model.
Made with ❤️ for the open-source community
Optimized for AMD RDNA 4 · Powered by vLLM
- Downloads last month
- 147
Model tree for djdeniro/Qwen3.5-35B-A3B-MXFP4
Base model
Qwen/Qwen3.5-35B-A3B-Base Finetuned
Qwen/Qwen3.5-35B-A3B Quantized
Qwen/Qwen3.5-35B-A3B-FP8