Qwen3-235B-A22B-fused

This is a fused-expert version of Qwen/Qwen3-235B-A22B, converted to enable native Expert Parallelism (EP) via Hugging Face Transformers' DistributedConfig.

The model weights are functionally identical to the original — only the storage layout of the MoE expert weights has changed.

What changed and why

The problem

Qwen3-235B-A22B is a Mixture-of-Experts model with 128 experts (8 active per token) across 94 layers. In the original checkpoint, each expert is stored as an independent nn.Linear module:

model.layers.{l}.mlp.experts.{i}.gate_proj.weight  →  [1536, 4096]
model.layers.{l}.mlp.experts.{i}.up_proj.weight     →  [1536, 4096]
model.layers.{l}.mlp.experts.{i}.down_proj.weight    →  [4096, 1536]

This per-expert layout (128 × 3 = 384 small weight matrices per layer) is incompatible with Transformers' Expert Parallelism hooks (GroupedGemmParallel, RouterParallel, GatherParallel), which expect a single fused [num_experts, ...] tensor that can be sliced along the expert dimension across ranks.

The fix

This checkpoint stacks all per-expert weights into fused nn.Parameter tensors:

model.layers.{l}.mlp.experts.gate_proj  →  [128, 1536, 4096]
model.layers.{l}.mlp.experts.up_proj    →  [128, 1536, 4096]
model.layers.{l}.mlp.experts.down_proj  →  [128, 4096, 1536]

With this layout, GroupedGemmParallel can shard experts across EP ranks by slicing dim 0. For example, with EP=8, each rank loads 16 experts ([16, 1536, 4096]).

All other weights (attention, norms, embeddings, router gate) are unchanged.

Modeling changes required

This checkpoint requires a modified transformers with:

Qwen3MoeRouter: returns (router_scores, router_indices) compatible with RouterParallel
Qwen3MoeExperts: holds fused nn.Parameter weights, forward loops over active local experts
base_model_ep_plan in Qwen3MoeConfig: maps modules to EP parallelism styles

The fork is available at: aminediroHF/transformers (branch qwen3-moe-ep, based on v4.57.6).

Usage

Expert Parallelism with TRL SFT

accelerate launch --config_file fsdp2.yaml trl/scripts/sft.py \
  --model_name_or_path aminediroHF/Qwen3-235B-A22B-fused \
  --enable_expert_parallel \
  --dataset_name THUDM/LongAlign-10k \
  --max_length 32768 --per_device_train_batch_size 1 \
  --gradient_checkpointing true --packing --packing_strategy wrapped \
  --max_steps 100 --logging_steps 1 \
  --output_dir ./output --report_to wandb

EP forward pass (no training)

import torch
from transformers import AutoModelForCausalLM
from transformers.distributed.configuration_utils import DistributedConfig

model = AutoModelForCausalLM.from_pretrained(
    "aminediroHF/Qwen3-235B-A22B-fused",
    dtype=torch.bfloat16,
    distributed_config=DistributedConfig(enable_expert_parallel=True),
).cuda()

Technical details

Property	Value
Base model	Qwen/Qwen3-235B-A22B
Total parameters	235B (22B active per token)
Architecture	94 layers, hidden=4096, 64 attention heads, 4 KV heads
Experts	128 total, 8 active per token, moe_intermediate=1536
EP sharding	dim 0 of fused expert tensors (128 / EP_size per rank)
Model size (bf16)	~470 GB
Checkpoint format	safetensors, sharded

Benchmark results

Tested with TRL SFTTrainer on AWS p5.48xlarge (H100 SXM 80GB), FSDP2 + EP:

Context	Nodes	GPUs	CP	EP	Offload	MFU	TPS/GPU
16k	8	64	4	64	yes	0.51%	70
32k	8	64	8	64	yes	0.73%	133
16k	8	64	1-8	64	no	-	OOM

CPU offload is required on 8 nodes (64 GPUs) because model + optimizer states (~2.3 TB) exceed total GPU memory (5.1 TB) with FSDP overhead. The Megatron reference config uses TP=4, PP=16, EP=8, CP=2 on 16 nodes (128 GPUs) to avoid offload.

Conversion

python scripts/convert_qwen3_moe_to_fused.py \
  --source_dir /path/to/Qwen3-235B-A22B \
  --output_dir /path/to/Qwen3-235B-A22B-fused

The conversion script is in the transformers fork (scripts/convert_qwen3_moe_to_fused.py).

Downloads last month: 123

Safetensors

Model size

235B params

Tensor type

BF16

Model tree for aminediroHF/Qwen3-235B-A22B-fused

Base model

Qwen/Qwen3-235B-A22B

Finetuned

(37)

this model