Qwen3.5-35B-A3B MXFP4
Overview
MXFP4 quantized version of Qwen3.5-35B-A3B — a 35B total / 3B active parameter MoE model with hybrid Gated DeltaNet + Gated Attention architecture.
Key Improvement: MoE expert MLP weights quantized to MXFP4 (4-bit), reducing size from ~72GB to ~22GB while improving perplexity (6.22 → 6.14).
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen3.5 MoE — hybrid Gated DeltaNet + Gated Attention (40 layers) |
| Parameters | 35B total, 3B active |
| Experts | 256 total, 8 routed + 1 shared per token |
| Context Length | 262,144 tokens |
| Vocabulary | 248,320 tokens |
What's Quantized
| Component | Precision | Notes |
|---|---|---|
| MoE expert MLPs | MXFP4 (uint8 packed + e8m0 scales) | Routed expert gate/up/down projections |
| Shared experts | BF16 | Excluded — *shared_expert* |
| MoE router gates | BF16 | Excluded — *.mlp.gate. |
| Self-attention (Q/K/V/O) | BF16 | Excluded — *self_attn* |
| Linear attention (Gated DeltaNet) | BF16 | Excluded |
| Visual encoder | BF16 | Excluded — *visual* |
| MTP layers | BF16 | Excluded — *mtp* |
| Embeddings, LM head | BF16 | Excluded — *lm_head*, *embed_tokens* |
| LayerNorm weights | BF16 | 1D, not quantizable |
Quantization Method
- Format: MXFP4 — 4-bit float (E2M1) with shared e8m0 block exponent per 32 elements
- Scale Selection: MSE-optimal over 3 candidate exponents per block
- Output Format:
compressed-tensorswithmxfp4-pack-quantized(compatible with vLLM)
Quality
Perplexity measured on wikitext-2-raw-v1 test set:
| Model | Mean NLL | Perplexity |
|---|---|---|
| Qwen3.5-35B-A3B (BF16) | 1.827 | 6.216 |
| Qwen3.5-35B-A3B MXFP4 | 1.815 | 6.143 |
MXFP4 quantization slightly improves perplexity over the BF16 baseline, likely due to regularization effects of quantization on the large number of MoE experts.
Usage
vLLM
pip install vllm
vllm serve olka-fi/Qwen3.5-35B-A3B-MXFP4 \
--gpu-memory-utilization 0.85 \
--max-model-len 4096
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
Note: Requires vLLM with Qwen3.5 architecture support (not yet in stock vLLM 0.16.0).
Python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="Qwen3.5-35B-A3B-MXFP4",
messages=[{"role": "user", "content": "Who are you?"}],
)
print(response.choices[0].message.content)
Quantization Details
- Quantized with qstream — custom MXFP4 quantization tool
- MSE-optimal 3-candidate scale selection per block (32 elements)
- Per-block shared exponent in e8m0 format
- Exclude patterns:
*shared_expert*,*.mlp.gate.,*lm_head*,*embed_tokens*,*visual*,*mtp*,*self_attn*
Acknowledgments
Based on Qwen3.5-35B-A3B by Tongyi Lab (Alibaba).
License: Apache 2.0
- Downloads last month
- 197