Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,159 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# [ICLR 2026] Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance
|
| 6 |
+
|
| 7 |
+
<div align="center">
|
| 8 |
+
|
| 9 |
+
[](https://arxiv.org/abs/2510.24711)
|
| 10 |
+
|
| 11 |
+
_**[Yujie Wei<sup>1</sup>](https://weilllllls.github.io), [Shiwei Zhang<sup>2*</sup>](https://scholar.google.com.hk/citations?user=ZO3OQ-8AAAAJ), [Hangjie Yuan<sup>3</sup>](https://jacobyuan7.github.io), [Yujin Han<sup>4</sup>](https://yujinhanml.github.io/), [Zhekai Chen<sup>4,5</sup>](https://scholar.google.com/citations?user=_eZWcIMAAAAJ), [Jiayu Wang<sup>2</sup>](https://openreview.net/profile?id=~Jiayu_Wang2), [Difan Zou<sup>4</sup>](https://difanzou.github.io/), [Xihui Liu<sup>4,5</sup>](https://xh-liu.github.io/), [Yingya Zhang<sup>2</sup>](https://scholar.google.com/citations?user=16RDSEUAAAAJ), [Yu Liu<sup>2</sup>](https://scholar.google.com/citations?user=8zksQb4AAAAJ), [Hongming Shan<sup>1†</sup>](http://hmshan.io)**_
|
| 12 |
+
<br>
|
| 13 |
+
(*Project Leader, †Corresponding Author)
|
| 14 |
+
|
| 15 |
+
<sup>1</sup>Fudan University <sup>2</sup>Tongyi Lab, Alibaba Group <sup>3</sup>Zhejiang University <sup>4</sup>The University of Hong Kong <sup>5</sup>MMLab
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
<!-- Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. -->
|
| 19 |
+
|
| 20 |
+
<div align="center">
|
| 21 |
+
<img src="https://img.alicdn.com/imgextra/i1/O1CN01rduqOi22t7gZTZwFG_!!6000000007177-2-tps-1722-1292.png" width="60%">
|
| 22 |
+
</div>
|
| 23 |
+
|
| 24 |
+
ProMoE is an MoE framework that employs a two-step router with explicit routing guidance to promote expert specialization for scaling Diffusion Transformers.
|
| 25 |
+
|
| 26 |
+
<!-- Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. -->
|
| 27 |
+
|
| 28 |
+
## 🤗 Overview
|
| 29 |
+
|
| 30 |
+
This codebase supports:
|
| 31 |
+
|
| 32 |
+
* **Baselines:** Dense-DiT, TC-DiT, EC-DiT, DiffMoE, and their variants.
|
| 33 |
+
* **Proposed Models:** ProMoE variants (S, B, L, XL) with Token-Choice (TC) and Expert-Choice (EC) routing.
|
| 34 |
+
* **VAE Latent Preprocessing:** Pre-encode raw images into latents and cache them for faster training; supports multi-GPU parallel processing.
|
| 35 |
+
* **Sampling and Metric Evaluation:** Image sampling, Inception feature extraction, and calculation of FID, IS, sFID, Precision, and Recall; supports multi-GPU parallel processing.
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
## 🔥 Updates
|
| 39 |
+
- __[2026.03]__: Release the model weights, generated 50K images, and evaluation results on both [Hugging Face](https://huggingface.co/weilllllls/ProMoE) and [ModelScope](https://modelscope.cn/models/weilllllls/ProMoE)!
|
| 40 |
+
- __[2026.02]__: Release the training, sampling, and evaluation code of ProMoE.
|
| 41 |
+
- __[2026.01]__: 🎉 Our paper has been accepted by **ICLR 2026**!
|
| 42 |
+
- __[2025.10]__: Release the [paper](https://arxiv.org/abs/2510.24711) of ProMoE.
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
## 🗂️ Pretrained Models
|
| 46 |
+
|
| 47 |
+
We have released the model weights, generated 50K images, and corresponding evaluation results on both Hugging Face and ModelScope.
|
| 48 |
+
|
| 49 |
+
| Model | Platform | Weights | 50K Images | Eval Results (CFG=1.0) | Eval Results (CFG=1.5) |
|
| 50 |
+
| :--- | :--- | :--- | :--- | :--- | :--- |
|
| 51 |
+
| **ProMoE-B-Flow** <br> (500K) | Hugging Face <br> ModelScope | [Link](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_B/ckpt_step_500000.pth) <br> [Link](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_B%2Fckpt_step_500000.pth?status=2) | [cfg=1.0](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_B/images_cfg1.0.npz), [cfg=1.5](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_B/images_cfg1.5.npz) <br> [cfg=1.0](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_B%2Fimages_cfg1.0.npz?status=2), [cfg=1.5](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_B%2Fimages_cfg1.5.npz?status=2) | [FID 24.44, IS 60.38](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_B/images_eval_openai_cfg1.0.txt) <br> [FID 24.44, IS 60.38](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_B%2Fimages_eval_openai_cfg1.0.txt?status=1) | [FID 6.39, IS 154.21](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_B/images_eval_openai_cfg1.5.txt) <br> [FID 6.39, IS 154.21](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_B%2Fimages_eval_openai_cfg1.5.txt?status=1) |
|
| 52 |
+
| **ProMoE-L-Flow** <br> (500K) | Hugging Face <br> ModelScope | [Link](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_L/ckpt_step_500000.pth) <br> [Link](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_L%2Fckpt_step_500000.pth?status=2) | [cfg=1.0](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_L/images_cfg1.0.npz), [cfg=1.5](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_L/images_cfg1.5.npz) <br> [cfg=1.0](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_L%2Fimages_cfg1.0.npz?status=2), [cfg=1.5](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_L%2Fimages_cfg1.5.npz?status=2) | [FID 11.61, IS 100.82](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_L/images_eval_openai_cfg1.0.txt) <br> [FID 11.61, IS 100.82](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_L%2Fimages_eval_openai_cfg1.0.txt?status=1) | [FID 2.79, IS 244.21](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_L/images_eval_openai_cfg1.5.txt) <br> [FID 2.79, IS 244.21](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_L%2Fimages_eval_openai_cfg1.5.txt?status=1) |
|
| 53 |
+
| **ProMoE-XL-Flow** <br> (500K) | Hugging Face <br> ModelScope | [Link](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_XL/ckpt_step_500000.pth) <br> [Link](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_XL%2Fckpt_step_500000.pth?status=2) | [cfg=1.0](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_XL/images_cfg1.0.npz), [cfg=1.5](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_XL/images_cfg1.5.npz) <br> [cfg=1.0](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_XL%2Fimages_cfg1.0.npz?status=2), [cfg=1.5](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_XL%2Fimages_cfg1.5.npz?status=2) | [FID 9.44, IS 114.94](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_XL/images_eval_openai_cfg1.0.txt) <br> [FID 9.44, IS 114.94](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_XL%2Fimages_eval_openai_cfg1.0.txt?status=1) | [FID 2.59, IS 265.62](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_XL/images_eval_openai_cfg1.5.txt) <br> [FID 2.59, IS 265.62](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_XL%2Fimages_eval_openai_cfg1.5.txt?status=1) |
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
## ⚙️ Preparation
|
| 57 |
+
### 1. Requirements & Installation
|
| 58 |
+
```bash
|
| 59 |
+
conda create -n promoe python=3.10 -y
|
| 60 |
+
conda activate promoe
|
| 61 |
+
pip install -r requirements.txt
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
### 2. Dataset Preparation
|
| 65 |
+
|
| 66 |
+
Download [ImageNet](http://image-net.org/download) dataset, and modify `cfg.data_path` in `config.py`.
|
| 67 |
+
|
| 68 |
+
### 3. VAE Latent Preprocessing (Optional)
|
| 69 |
+
|
| 70 |
+
For faster training and more efficient GPU usage, you can **precompute VAE latents** and train with `cfg.use_pre_latents=True`.
|
| 71 |
+
|
| 72 |
+
Run latent preprocessing:
|
| 73 |
+
```bash
|
| 74 |
+
# bash
|
| 75 |
+
ImageNet_path=/path/to/ImageNet
|
| 76 |
+
|
| 77 |
+
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python preprocess/preprocess_vae.py --latent_save_root "$ImageNet_path/sd-vae-ft-mse_Latents_256img_npz"
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
## 🚀 Training
|
| 81 |
+
|
| 82 |
+
Training is launched via `train.py` with a YAML config:
|
| 83 |
+
```bash
|
| 84 |
+
python train.py --config configs/004_ProMoE_L.yaml
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
**Notes:**
|
| 88 |
+
|
| 89 |
+
- This repository currently supports Rectified Flow with Logit-Normal sampling (following [SD3](https://arxiv.org/pdf/2403.03206)). For the DDPM implementation, please refer to this [repository](https://github.com/KlingTeam/DiffMoE/tree/main).
|
| 90 |
+
- By default, ProMoE utilizes Token-Choice routing. However, for DDPM-based training, we recommend using Expert-Choice in `models/models_ProMoE_EC.py`.
|
| 91 |
+
- Configuration files for all baseline models are provided in the `configs` directory.
|
| 92 |
+
- All results reported in the paper are obtained with `qk_norm=False`. For extended training steps (>2M steps), we suggest enabling `qk_norm=True` to ensure training stability.
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
## 💫 Sampling
|
| 96 |
+
|
| 97 |
+
Image generation is performed via the `sample.py` script, utilizing the same YAML configuration file used for training.
|
| 98 |
+
|
| 99 |
+
```bash
|
| 100 |
+
# use default setting
|
| 101 |
+
CUDA_VISIBLE_DEVICES=0 python sample.py --config configs/004_ProMoE_L.yaml
|
| 102 |
+
|
| 103 |
+
# use custom setting
|
| 104 |
+
CUDA_VISIBLE_DEVICES=0 python sample.py \
|
| 105 |
+
--config configs/004_ProMoE_L.yaml \
|
| 106 |
+
--step_list_for_sample 200000,300000 \
|
| 107 |
+
--guide_scale_list 1.0,1.5,4.0 \
|
| 108 |
+
--num_fid_samples 10000
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
**Notes:**
|
| 112 |
+
|
| 113 |
+
- By default, the script loads the checkpoint at **500k steps** and generates **50,000 images** using a **single GPU**, sweeping across guidance scales (CFG) of **1.0** and **1.5**.
|
| 114 |
+
- To use **multiple GPUs** for sampling, specify the devices using `CUDA_VISIBLE_DEVICES` or by adding `sample_gpu_ids` in the configuration file. Please be aware that multi-GPU inference produces a globally different random sequence (e.g., class labels) compared to single-GPU inference.
|
| 115 |
+
- Generated images are saved as PNG files in the `sample/` directory within the same parent directory as the checkpoint folder. Filenames include both the sample index and class label.
|
| 116 |
+
- If you only want to calculate FID, you can set `cfg.save_inception_features=True` to save Inception features and reduce `cfg.save_img_num`.
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
## 📝 Evaluation
|
| 120 |
+
|
| 121 |
+
We follow the standard evaluation protocol outlined in [openai's guided-diffusion](https://github.com/openai/guided-diffusion/tree/main/evaluations). All relevant code is located in the `evaluation` directory.
|
| 122 |
+
|
| 123 |
+
### 1. Environment Setup
|
| 124 |
+
Since the evaluation pipeline relies on TensorFlow, we strongly recommend creating a dedicated environment to avoid dependency conflicts.
|
| 125 |
+
```bash
|
| 126 |
+
conda create -n promoe_eval python=3.9 -y
|
| 127 |
+
conda activate promoe_eval
|
| 128 |
+
cd evaluation
|
| 129 |
+
pip install -r requirements.txt
|
| 130 |
+
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### 2. Download Reference Batch
|
| 134 |
+
Download the reference statistics file [VIRTUAL_imagenet256_labeled.npz](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npz) (for 256x256 images) and place it in the `evaluation` directory.
|
| 135 |
+
|
| 136 |
+
### 3. Execution
|
| 137 |
+
To calculate the metrics, run the evaluation script by specifying the path to your folder of generated images.
|
| 138 |
+
```bash
|
| 139 |
+
CUDA_VISIBLE_DEVICES=0 python run_eval.py /path/to/generated/images
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
## Acknowledgements
|
| 144 |
+
|
| 145 |
+
This code is built on top of [DiffMoE](https://github.com/KlingTeam/DiffMoE), [DiT](https://github.com/facebookresearch/DiT), and [guided-diffusion](https://github.com/openai/guided-diffusion/tree/main/evaluations). We thank the authors for their great work.
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
## 🌟 Citation
|
| 149 |
+
|
| 150 |
+
If you find this code useful for your research, please cite our paper:
|
| 151 |
+
|
| 152 |
+
```bibtex
|
| 153 |
+
@inproceedings{wei2026promoe,
|
| 154 |
+
title={Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance},
|
| 155 |
+
author={Wei, Yujie and Zhang, Shiwei and Yuan, Hangjie and Han, Yujin and Chen, Zhekai and Wang, Jiayu and Zou, Difan and Liu, Xihui and Zhang, Yingya and Liu, Yu and others},
|
| 156 |
+
booktitle={International Conference on Learning Representations},
|
| 157 |
+
year={2026}
|
| 158 |
+
}
|
| 159 |
+
```
|