Configuration Parsing Warning:Config file config.json cannot be fetched (too big)

Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

🚀 Qwen3.5-35B-A3B-MXFP4

Qwen3.5-35B-A3B quantized to MXFP4 (Microscaling FP4) precision — specifically optimized for AMD Radeon RX 9700 (RDNA 4) GPUs using vLLM.

📋 Model Overview

Property	Value
Base Model	Qwen/Qwen3.5-35B-A3B-FP8
Architecture	Mixture of Experts (MoE)
Parameters	35B total / ~3.5B active
Quantization	MXFP4 (Microscaling FP4)
Inference Engine	vLLM
Target Hardware	AMD Radeon RX 9700 / RX 9700 XT (RDNA 4)
License	Apache 2.0

📦 Size & Compression

Version	Precision	Size on Disk	vs FP16
Qwen3.5-35B-A3B (FP16)	FP16	~70 GB	baseline
Qwen3.5-35B-A3B (FP8)	FP8	35 GB	−50%
Qwen3.5-35B-A3B-MXFP4 (this)	MXFP4	22 GB	−69%

Compared directly to the FP8 source model: −37% smaller, enabling deployment on a single 16 GB consumer GPU.

⚡ Quick Start

Installation

pip install vllm --upgrade

Basic Inference

from vllm import LLM, SamplingParams

llm = LLM(
    model="djdeniro/Qwen3.5-35B-A3B-MXFP4",
    quantization="mxfp4",
    dtype="auto",
    max_model_len=8192,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

outputs = llm.generate(
    ["Explain the concept of mixture of experts in neural networks."],
    sampling_params,
)

print(outputs[0].outputs[0].text)

Chat / Instruction Format

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

llm = LLM(
    model="djdeniro/Qwen3.5-35B-A3B-MXFP4",
    quantization="mxfp4",
    dtype="auto",
    max_model_len=8192,
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is MXFP4 quantization and why is it useful?"},
]

tokenizer = AutoTokenizer.from_pretrained("djdeniro/Qwen3.5-35B-A3B-MXFP4")
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

sampling_params = SamplingParams(temperature=0.7, max_tokens=1024)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

OpenAI-Compatible Server

vllm serve djdeniro/Qwen3.5-35B-A3B-MXFP4 \
    --quantization mxfp4 \
    --max-model-len 8192 \
    --port 8000

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="djdeniro/Qwen3.5-35B-A3B-MXFP4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello! What can you do?"},
    ],
    max_tokens=512,
)

print(response.choices[0].message.content)

🔧 Quantization Details

Parameter	Value
Source model	Qwen/Qwen3.5-35B-A3B-FP8
Source precision	FP8 (E4M3)
Target precision	MXFP4
Block size	32 elements
Scaling granularity	Per-block shared microscaling exponent
Quantized layers	Linear (attention projections + FFN)
Non-quantized	Embeddings, LayerNorm, LM head

📊 Quality vs Compression

Model	Size	Quality (est.)	VRAM Required
Qwen3.5-35B-A3B FP16	~70 GB	100% (reference)	2× 40 GB
Qwen3.5-35B-A3B FP8	35 GB	~99.5%	1× 40 GB
Qwen3.5-35B-A3B MXFP4 (this)	22 GB	~97–98%	1× 16 GB

📁 Repository Structure

Qwen3.5-35B-A3B-MXFP4/
├── README.md
├── config.json
├── generation_config.json
├── chat_template.jinja
├── tokenizer.json
├── tokenizer_config.json
├── vocab.json
├── merges.txt
├── preprocessor_config.json
├── model.safetensors.index.json
└── model.safetensors-0000{1..8}-of-00008.safetensors  # 22 GB total

⚠️ Limitations

MXFP4 inference requires vLLM with MXFP4 support and RDNA 4 or compatible hardware
Slight quality reduction vs FP8/FP16 due to aggressive quantization
Not recommended for tasks requiring extreme numerical precision

🙏 Credits

Base model: Qwen Team @ Alibaba Cloud
Quantization: @djdeniro
Inference engine: vLLM Project
MXFP4 specification: OCP Microscaling Formats (MX)

📄 License

This model is released under the Apache 2.0 license, consistent with the original Qwen3.5 model.

Made with ❤️ for the open-source community
Optimized for AMD RDNA 4 · Powered by vLLM

Downloads last month: 147

Safetensors

Model size

20B params

Tensor type

BF16

Model tree for djdeniro/Qwen3.5-35B-A3B-MXFP4

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

Qwen/Qwen3.5-35B-A3B-FP8

Finetuned

(1)

this model