Qwen3-30B-A3B-fused
This is a fused-expert version of Qwen/Qwen3-30B-A3B, converted to enable native Expert Parallelism (EP) via Hugging Face Transformers' DistributedConfig.
The model weights are functionally identical to the original — only the storage layout of the MoE expert weights has changed.
What changed and why
The problem
Qwen3-30B-A3B is a Mixture-of-Experts model with 128 experts (8 active per token). In the original checkpoint, each expert is stored as an independent nn.Linear module:
model.layers.{l}.mlp.experts.{i}.gate_proj.weight → [768, 2048]
model.layers.{l}.mlp.experts.{i}.up_proj.weight → [768, 2048]
model.layers.{l}.mlp.experts.{i}.down_proj.weight → [2048, 768]
This per-expert layout (128 × 3 = 384 small weight matrices per layer) is incompatible with Transformers' Expert Parallelism hooks (GroupedGemmParallel, RouterParallel, GatherParallel), which expect a single fused [num_experts, ...] tensor that can be sliced along the expert dimension across ranks.
The fix
This checkpoint stacks all per-expert weights into fused nn.Parameter tensors:
model.layers.{l}.mlp.experts.gate_proj → [128, 768, 2048]
model.layers.{l}.mlp.experts.up_proj → [128, 768, 2048]
model.layers.{l}.mlp.experts.down_proj → [128, 2048, 768]
With this layout, GroupedGemmParallel can shard experts across EP ranks by slicing dim 0. For example, with EP=8, each rank loads 16 experts ([16, 768, 2048]).
All other weights (attention, norms, embeddings, router gate) are unchanged.
Modeling changes required
This checkpoint requires a modified transformers with:
Qwen3MoeRouter: returns(router_scores, router_indices)compatible withRouterParallelQwen3MoeExperts: holds fusednn.Parameterweights, forward loops over active local expertsbase_model_ep_planinQwen3MoeConfig: maps modules to EP parallelism styles
The fork is available at: /fsx/amine_dirhoussi/transformers (branch qwen3-moe-ep, based on v4.57.6).
Usage
Expert Parallelism with TRL SFT
# Accelerate config (fsdp2_ep.yaml)
distributed_type: FSDP
fsdp_config:
fsdp_version: 2
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_cpu_ram_efficient_loading: false
fsdp_offload_params: false
num_machines: 1
num_processes: 8
parallelism_config:
parallelism_config_ep_size: 8
accelerate launch --config_file fsdp2_ep.yaml trl/scripts/sft.py \
--model_name_or_path aminediroHF/Qwen3-30B-A3B-fused \
--enable_expert_parallel \
--dataset_name THUDM/LongAlign-10k \
--max_length 4096 --per_device_train_batch_size 1 \
--gradient_checkpointing true --packing --packing_strategy wrapped \
--bf16 true --max_steps 100 --logging_steps 1 \
--output_dir ./output --report_to none
EP forward pass (no training)
import torch
from transformers import AutoModelForCausalLM
from transformers.distributed.configuration_utils import DistributedConfig
model = AutoModelForCausalLM.from_pretrained(
"aminediroHF/Qwen3-30B-A3B-fused",
dtype=torch.bfloat16,
distributed_config=DistributedConfig(enable_expert_parallel=True),
).cuda()
Technical details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-30B-A3B |
| Total parameters | 30.5B (3.3B active per token) |
| Experts | 128 total, 8 active per token |
| EP sharding | dim 0 of fused expert tensors (128 / EP_size per rank) |
| Attention | NOT sharded by EP (num_kv_heads=4 < EP_size=8); FSDP2 handles memory |
| Checkpoint format | safetensors, 13 shards |
Conversion
The checkpoint was produced by:
python scripts/convert_qwen3_moe_to_fused.py \
--source_dir /path/to/Qwen3-30B-A3B \
--output_dir /path/to/Qwen3-30B-A3B-fused
This script is in the transformers fork (scripts/convert_qwen3_moe_to_fused.py).
- Downloads last month
- 377