Configuration Parsing Warning:Config file config.json cannot be fetched (too big)

Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)

🚀 Qwen3.5-35B-A3B-MXFP4

Qwen3.5-35B-A3B quantized to MXFP4 (Microscaling FP4) precision — specifically optimized for AMD Radeon RX 9700 (RDNA 4) GPUs using vLLM.


📋 Model Overview

Property Value
Base Model Qwen/Qwen3.5-35B-A3B-FP8
Architecture Mixture of Experts (MoE)
Parameters 35B total / ~3.5B active
Quantization MXFP4 (Microscaling FP4)
Inference Engine vLLM
Target Hardware AMD Radeon RX 9700 / RX 9700 XT (RDNA 4)
License Apache 2.0

📦 Size & Compression

Version Precision Size on Disk vs FP16
Qwen3.5-35B-A3B (FP16) FP16 ~70 GB baseline
Qwen3.5-35B-A3B (FP8) FP8 35 GB −50%
Qwen3.5-35B-A3B-MXFP4 (this) MXFP4 22 GB −69%

Compared directly to the FP8 source model: −37% smaller, enabling deployment on a single 16 GB consumer GPU.


⚡ Quick Start

Installation

pip install vllm --upgrade

Basic Inference

from vllm import LLM, SamplingParams

llm = LLM(
    model="djdeniro/Qwen3.5-35B-A3B-MXFP4",
    quantization="mxfp4",
    dtype="auto",
    max_model_len=8192,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

outputs = llm.generate(
    ["Explain the concept of mixture of experts in neural networks."],
    sampling_params,
)

print(outputs[0].outputs[0].text)

Chat / Instruction Format

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

llm = LLM(
    model="djdeniro/Qwen3.5-35B-A3B-MXFP4",
    quantization="mxfp4",
    dtype="auto",
    max_model_len=8192,
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is MXFP4 quantization and why is it useful?"},
]

tokenizer = AutoTokenizer.from_pretrained("djdeniro/Qwen3.5-35B-A3B-MXFP4")
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

sampling_params = SamplingParams(temperature=0.7, max_tokens=1024)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

OpenAI-Compatible Server

vllm serve djdeniro/Qwen3.5-35B-A3B-MXFP4 \
    --quantization mxfp4 \
    --max-model-len 8192 \
    --port 8000
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="djdeniro/Qwen3.5-35B-A3B-MXFP4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello! What can you do?"},
    ],
    max_tokens=512,
)

print(response.choices[0].message.content)

🔧 Quantization Details

Parameter Value
Source model Qwen/Qwen3.5-35B-A3B-FP8
Source precision FP8 (E4M3)
Target precision MXFP4
Block size 32 elements
Scaling granularity Per-block shared microscaling exponent
Quantized layers Linear (attention projections + FFN)
Non-quantized Embeddings, LayerNorm, LM head

📊 Quality vs Compression

Model Size Quality (est.) VRAM Required
Qwen3.5-35B-A3B FP16 ~70 GB 100% (reference) 2× 40 GB
Qwen3.5-35B-A3B FP8 35 GB ~99.5% 1× 40 GB
Qwen3.5-35B-A3B MXFP4 (this) 22 GB ~97–98% 1× 16 GB

📁 Repository Structure

Qwen3.5-35B-A3B-MXFP4/
├── README.md
├── config.json
├── generation_config.json
├── chat_template.jinja
├── tokenizer.json
├── tokenizer_config.json
├── vocab.json
├── merges.txt
├── preprocessor_config.json
├── model.safetensors.index.json
└── model.safetensors-0000{1..8}-of-00008.safetensors  # 22 GB total

⚠️ Limitations

  • MXFP4 inference requires vLLM with MXFP4 support and RDNA 4 or compatible hardware
  • Slight quality reduction vs FP8/FP16 due to aggressive quantization
  • Not recommended for tasks requiring extreme numerical precision

🙏 Credits


📄 License

This model is released under the Apache 2.0 license, consistent with the original Qwen3.5 model.


Made with ❤️ for the open-source community
Optimized for AMD RDNA 4 · Powered by vLLM

Downloads last month
147
Safetensors
Model size
20B params
Tensor type
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for djdeniro/Qwen3.5-35B-A3B-MXFP4

Finetuned
(1)
this model