Qwen3-30B-A3B-fused

This is a fused-expert version of Qwen/Qwen3-30B-A3B, converted to enable native Expert Parallelism (EP) via Hugging Face Transformers' DistributedConfig.

The model weights are functionally identical to the original — only the storage layout of the MoE expert weights has changed.

What changed and why

The problem

Qwen3-30B-A3B is a Mixture-of-Experts model with 128 experts (8 active per token). In the original checkpoint, each expert is stored as an independent nn.Linear module:

model.layers.{l}.mlp.experts.{i}.gate_proj.weight  →  [768, 2048]
model.layers.{l}.mlp.experts.{i}.up_proj.weight     →  [768, 2048]
model.layers.{l}.mlp.experts.{i}.down_proj.weight    →  [2048, 768]

This per-expert layout (128 × 3 = 384 small weight matrices per layer) is incompatible with Transformers' Expert Parallelism hooks (GroupedGemmParallel, RouterParallel, GatherParallel), which expect a single fused [num_experts, ...] tensor that can be sliced along the expert dimension across ranks.

The fix

This checkpoint stacks all per-expert weights into fused nn.Parameter tensors:

model.layers.{l}.mlp.experts.gate_proj  →  [128, 768, 2048]
model.layers.{l}.mlp.experts.up_proj    →  [128, 768, 2048]
model.layers.{l}.mlp.experts.down_proj  →  [128, 2048, 768]

With this layout, GroupedGemmParallel can shard experts across EP ranks by slicing dim 0. For example, with EP=8, each rank loads 16 experts ([16, 768, 2048]).

All other weights (attention, norms, embeddings, router gate) are unchanged.

Modeling changes required

This checkpoint requires a modified transformers with:

Qwen3MoeRouter: returns (router_scores, router_indices) compatible with RouterParallel
Qwen3MoeExperts: holds fused nn.Parameter weights, forward loops over active local experts
base_model_ep_plan in Qwen3MoeConfig: maps modules to EP parallelism styles

The fork is available at: /fsx/amine_dirhoussi/transformers (branch qwen3-moe-ep, based on v4.57.6).

Usage

Expert Parallelism with TRL SFT

# Accelerate config (fsdp2_ep.yaml)
distributed_type: FSDP
fsdp_config:
  fsdp_version: 2
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_cpu_ram_efficient_loading: false
  fsdp_offload_params: false
num_machines: 1
num_processes: 8
parallelism_config:
  parallelism_config_ep_size: 8

accelerate launch --config_file fsdp2_ep.yaml trl/scripts/sft.py \
  --model_name_or_path aminediroHF/Qwen3-30B-A3B-fused \
  --enable_expert_parallel \
  --dataset_name THUDM/LongAlign-10k \
  --max_length 4096 --per_device_train_batch_size 1 \
  --gradient_checkpointing true --packing --packing_strategy wrapped \
  --bf16 true --max_steps 100 --logging_steps 1 \
  --output_dir ./output --report_to none

EP forward pass (no training)

import torch
from transformers import AutoModelForCausalLM
from transformers.distributed.configuration_utils import DistributedConfig

model = AutoModelForCausalLM.from_pretrained(
    "aminediroHF/Qwen3-30B-A3B-fused",
    dtype=torch.bfloat16,
    distributed_config=DistributedConfig(enable_expert_parallel=True),
).cuda()

Technical details

Property	Value
Base model	Qwen/Qwen3-30B-A3B
Total parameters	30.5B (3.3B active per token)
Experts	128 total, 8 active per token
EP sharding	dim 0 of fused expert tensors (128 / EP_size per rank)
Attention	NOT sharded by EP (num_kv_heads=4 < EP_size=8); FSDP2 handles memory
Checkpoint format	safetensors, 13 shards

Conversion

The checkpoint was produced by:

python scripts/convert_qwen3_moe_to_fused.py \
  --source_dir /path/to/Qwen3-30B-A3B \
  --output_dir /path/to/Qwen3-30B-A3B-fused

This script is in the transformers fork (scripts/convert_qwen3_moe_to_fused.py).

Downloads last month: 377

Safetensors

Model size

31B params

Tensor type

BF16

Model tree for aminediroHF/Qwen3-30B-A3B-fused

Base model

Qwen/Qwen3-30B-A3B-Base

Finetuned

Qwen/Qwen3-30B-A3B

Finetuned

(43)

this model