Qwen3-235B-A22B-fused
This is a fused-expert version of Qwen/Qwen3-235B-A22B, converted to enable native Expert Parallelism (EP) via Hugging Face Transformers' DistributedConfig.
The model weights are functionally identical to the original — only the storage layout of the MoE expert weights has changed.
What changed and why
The problem
Qwen3-235B-A22B is a Mixture-of-Experts model with 128 experts (8 active per token) across 94 layers. In the original checkpoint, each expert is stored as an independent nn.Linear module:
model.layers.{l}.mlp.experts.{i}.gate_proj.weight → [1536, 4096]
model.layers.{l}.mlp.experts.{i}.up_proj.weight → [1536, 4096]
model.layers.{l}.mlp.experts.{i}.down_proj.weight → [4096, 1536]
This per-expert layout (128 × 3 = 384 small weight matrices per layer) is incompatible with Transformers' Expert Parallelism hooks (GroupedGemmParallel, RouterParallel, GatherParallel), which expect a single fused [num_experts, ...] tensor that can be sliced along the expert dimension across ranks.
The fix
This checkpoint stacks all per-expert weights into fused nn.Parameter tensors:
model.layers.{l}.mlp.experts.gate_proj → [128, 1536, 4096]
model.layers.{l}.mlp.experts.up_proj → [128, 1536, 4096]
model.layers.{l}.mlp.experts.down_proj → [128, 4096, 1536]
With this layout, GroupedGemmParallel can shard experts across EP ranks by slicing dim 0. For example, with EP=8, each rank loads 16 experts ([16, 1536, 4096]).
All other weights (attention, norms, embeddings, router gate) are unchanged.
Modeling changes required
This checkpoint requires a modified transformers with:
Qwen3MoeRouter: returns(router_scores, router_indices)compatible withRouterParallelQwen3MoeExperts: holds fusednn.Parameterweights, forward loops over active local expertsbase_model_ep_planinQwen3MoeConfig: maps modules to EP parallelism styles
The fork is available at: aminediroHF/transformers (branch qwen3-moe-ep, based on v4.57.6).
Usage
Expert Parallelism with TRL SFT
accelerate launch --config_file fsdp2.yaml trl/scripts/sft.py \
--model_name_or_path aminediroHF/Qwen3-235B-A22B-fused \
--enable_expert_parallel \
--dataset_name THUDM/LongAlign-10k \
--max_length 32768 --per_device_train_batch_size 1 \
--gradient_checkpointing true --packing --packing_strategy wrapped \
--max_steps 100 --logging_steps 1 \
--output_dir ./output --report_to wandb
EP forward pass (no training)
import torch
from transformers import AutoModelForCausalLM
from transformers.distributed.configuration_utils import DistributedConfig
model = AutoModelForCausalLM.from_pretrained(
"aminediroHF/Qwen3-235B-A22B-fused",
dtype=torch.bfloat16,
distributed_config=DistributedConfig(enable_expert_parallel=True),
).cuda()
Technical details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-235B-A22B |
| Total parameters | 235B (22B active per token) |
| Architecture | 94 layers, hidden=4096, 64 attention heads, 4 KV heads |
| Experts | 128 total, 8 active per token, moe_intermediate=1536 |
| EP sharding | dim 0 of fused expert tensors (128 / EP_size per rank) |
| Model size (bf16) | ~470 GB |
| Checkpoint format | safetensors, sharded |
Benchmark results
Tested with TRL SFTTrainer on AWS p5.48xlarge (H100 SXM 80GB), FSDP2 + EP:
| Context | Nodes | GPUs | CP | EP | Offload | MFU | TPS/GPU |
|---|---|---|---|---|---|---|---|
| 16k | 8 | 64 | 4 | 64 | yes | 0.51% | 70 |
| 32k | 8 | 64 | 8 | 64 | yes | 0.73% | 133 |
| 16k | 8 | 64 | 1-8 | 64 | no | - | OOM |
CPU offload is required on 8 nodes (64 GPUs) because model + optimizer states (~2.3 TB) exceed total GPU memory (5.1 TB) with FSDP overhead. The Megatron reference config uses TP=4, PP=16, EP=8, CP=2 on 16 nodes (128 GPUs) to avoid offload.
Conversion
python scripts/convert_qwen3_moe_to_fused.py \
--source_dir /path/to/Qwen3-235B-A22B \
--output_dir /path/to/Qwen3-235B-A22B-fused
The conversion script is in the transformers fork (scripts/convert_qwen3_moe_to_fused.py).
- Downloads last month
- 123
Model tree for aminediroHF/Qwen3-235B-A22B-fused
Base model
Qwen/Qwen3-235B-A22B