Qwen3.5-35B-A3B MXFP4

Overview

MXFP4 quantized version of Qwen3.5-35B-A3B — a 35B total / 3B active parameter MoE model with hybrid Gated DeltaNet + Gated Attention architecture.

Key Improvement: MoE expert MLP weights quantized to MXFP4 (4-bit), reducing size from ~72GB to ~22GB while improving perplexity (6.22 → 6.14).

Model Details

Property	Value
Architecture	Qwen3.5 MoE — hybrid Gated DeltaNet + Gated Attention (40 layers)
Parameters	35B total, 3B active
Experts	256 total, 8 routed + 1 shared per token
Context Length	262,144 tokens
Vocabulary	248,320 tokens

What's Quantized

Component	Precision	Notes
MoE expert MLPs	MXFP4 (uint8 packed + e8m0 scales)	Routed expert gate/up/down projections
Shared experts	BF16	Excluded — `shared_expert`
MoE router gates	BF16	Excluded — `*.mlp.gate.`
Self-attention (Q/K/V/O)	BF16	Excluded — `self_attn`
Linear attention (Gated DeltaNet)	BF16	Excluded
Visual encoder	BF16	Excluded — `visual`
MTP layers	BF16	Excluded — `mtp`
Embeddings, LM head	BF16	Excluded — `lm_head`, `embed_tokens`
LayerNorm weights	BF16	1D, not quantizable

Quantization Method

Format: MXFP4 — 4-bit float (E2M1) with shared e8m0 block exponent per 32 elements
Scale Selection: MSE-optimal over 3 candidate exponents per block
Output Format: compressed-tensors with mxfp4-pack-quantized (compatible with vLLM)

Quality

Perplexity measured on wikitext-2-raw-v1 test set:

Model	Mean NLL	Perplexity
Qwen3.5-35B-A3B (BF16)	1.827	6.216
Qwen3.5-35B-A3B MXFP4	1.815	6.143

MXFP4 quantization slightly improves perplexity over the BF16 baseline, likely due to regularization effects of quantization on the large number of MoE experts.

Usage

vLLM

pip install vllm

vllm serve olka-fi/Qwen3.5-35B-A3B-MXFP4 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 4096
    --reasoning-parser qwen3 
    --enable-auto-tool-choice 
    --tool-call-parser qwen3_coder

Note: Requires vLLM with Qwen3.5 architecture support (not yet in stock vLLM 0.16.0).

Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="Qwen3.5-35B-A3B-MXFP4",
    messages=[{"role": "user", "content": "Who are you?"}],
)
print(response.choices[0].message.content)

Quantization Details

Quantized with qstream — custom MXFP4 quantization tool
MSE-optimal 3-candidate scale selection per block (32 elements)
Per-block shared exponent in e8m0 format
Exclude patterns: *shared_expert*, *.mlp.gate., *lm_head*, *embed_tokens*, *visual*, *mtp*, *self_attn*

Acknowledgments

Based on Qwen3.5-35B-A3B by Tongyi Lab (Alibaba).

License: Apache 2.0

Downloads last month: 197

Safetensors

Model size

20B params

Tensor type

BF16

F32

Model tree for olka-fi/Qwen3.5-35B-A3B-MXFP4

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(243)

this model