Qwen3.5-35B-A3B MXFP4

Overview

MXFP4 quantized version of Qwen3.5-35B-A3B — a 35B total / 3B active parameter MoE model with hybrid Gated DeltaNet + Gated Attention architecture.

Key Improvement: MoE expert MLP weights quantized to MXFP4 (4-bit), reducing size from ~72GB to ~22GB while improving perplexity (6.22 → 6.14).


Model Details

Property Value
Architecture Qwen3.5 MoE — hybrid Gated DeltaNet + Gated Attention (40 layers)
Parameters 35B total, 3B active
Experts 256 total, 8 routed + 1 shared per token
Context Length 262,144 tokens
Vocabulary 248,320 tokens

What's Quantized

Component Precision Notes
MoE expert MLPs MXFP4 (uint8 packed + e8m0 scales) Routed expert gate/up/down projections
Shared experts BF16 Excluded — *shared_expert*
MoE router gates BF16 Excluded — *.mlp.gate.
Self-attention (Q/K/V/O) BF16 Excluded — *self_attn*
Linear attention (Gated DeltaNet) BF16 Excluded
Visual encoder BF16 Excluded — *visual*
MTP layers BF16 Excluded — *mtp*
Embeddings, LM head BF16 Excluded — *lm_head*, *embed_tokens*
LayerNorm weights BF16 1D, not quantizable

Quantization Method

  • Format: MXFP4 — 4-bit float (E2M1) with shared e8m0 block exponent per 32 elements
  • Scale Selection: MSE-optimal over 3 candidate exponents per block
  • Output Format: compressed-tensors with mxfp4-pack-quantized (compatible with vLLM)

Quality

Perplexity measured on wikitext-2-raw-v1 test set:

Model Mean NLL Perplexity
Qwen3.5-35B-A3B (BF16) 1.827 6.216
Qwen3.5-35B-A3B MXFP4 1.815 6.143

MXFP4 quantization slightly improves perplexity over the BF16 baseline, likely due to regularization effects of quantization on the large number of MoE experts.


Usage

vLLM

pip install vllm

vllm serve olka-fi/Qwen3.5-35B-A3B-MXFP4 \
    --gpu-memory-utilization 0.85 \
    --max-model-len 4096
    --reasoning-parser qwen3 
    --enable-auto-tool-choice 
    --tool-call-parser qwen3_coder

Note: Requires vLLM with Qwen3.5 architecture support (not yet in stock vLLM 0.16.0).

Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="Qwen3.5-35B-A3B-MXFP4",
    messages=[{"role": "user", "content": "Who are you?"}],
)
print(response.choices[0].message.content)

Quantization Details

  • Quantized with qstream — custom MXFP4 quantization tool
  • MSE-optimal 3-candidate scale selection per block (32 elements)
  • Per-block shared exponent in e8m0 format
  • Exclude patterns: *shared_expert*, *.mlp.gate., *lm_head*, *embed_tokens*, *visual*, *mtp*, *self_attn*

Acknowledgments

Based on Qwen3.5-35B-A3B by Tongyi Lab (Alibaba).

License: Apache 2.0

Downloads last month
197
Safetensors
Model size
20B params
Tensor type
U8
·
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for olka-fi/Qwen3.5-35B-A3B-MXFP4

Quantized
(243)
this model