weilllllls commited on
Commit
442b0d8
·
verified ·
1 Parent(s): fc79684

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -3
README.md CHANGED
@@ -1,3 +1,159 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # [ICLR 2026] Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance
6
+
7
+ <div align="center">
8
+
9
+ [![arXiv](https://img.shields.io/badge/arXiv-2510.24711-b31b1b.svg)](https://arxiv.org/abs/2510.24711)
10
+
11
+ _**[Yujie Wei<sup>1</sup>](https://weilllllls.github.io), [Shiwei Zhang<sup>2*</sup>](https://scholar.google.com.hk/citations?user=ZO3OQ-8AAAAJ), [Hangjie Yuan<sup>3</sup>](https://jacobyuan7.github.io), [Yujin Han<sup>4</sup>](https://yujinhanml.github.io/), [Zhekai Chen<sup>4,5</sup>](https://scholar.google.com/citations?user=_eZWcIMAAAAJ), [Jiayu Wang<sup>2</sup>](https://openreview.net/profile?id=~Jiayu_Wang2), [Difan Zou<sup>4</sup>](https://difanzou.github.io/), [Xihui Liu<sup>4,5</sup>](https://xh-liu.github.io/), [Yingya Zhang<sup>2</sup>](https://scholar.google.com/citations?user=16RDSEUAAAAJ), [Yu Liu<sup>2</sup>](https://scholar.google.com/citations?user=8zksQb4AAAAJ), [Hongming Shan<sup>1†</sup>](http://hmshan.io)**_
12
+ <br>
13
+ (*Project Leader, †Corresponding Author)
14
+
15
+ <sup>1</sup>Fudan University <sup>2</sup>Tongyi Lab, Alibaba Group <sup>3</sup>Zhejiang University <sup>4</sup>The University of Hong Kong <sup>5</sup>MMLab
16
+ </div>
17
+
18
+ <!-- Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. -->
19
+
20
+ <div align="center">
21
+ <img src="https://img.alicdn.com/imgextra/i1/O1CN01rduqOi22t7gZTZwFG_!!6000000007177-2-tps-1722-1292.png" width="60%">
22
+ </div>
23
+
24
+ ProMoE is an MoE framework that employs a two-step router with explicit routing guidance to promote expert specialization for scaling Diffusion Transformers.
25
+
26
+ <!-- Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. -->
27
+
28
+ ## 🤗 Overview
29
+
30
+ This codebase supports:
31
+
32
+ * **Baselines:** Dense-DiT, TC-DiT, EC-DiT, DiffMoE, and their variants.
33
+ * **Proposed Models:** ProMoE variants (S, B, L, XL) with Token-Choice (TC) and Expert-Choice (EC) routing.
34
+ * **VAE Latent Preprocessing:** Pre-encode raw images into latents and cache them for faster training; supports multi-GPU parallel processing.
35
+ * **Sampling and Metric Evaluation:** Image sampling, Inception feature extraction, and calculation of FID, IS, sFID, Precision, and Recall; supports multi-GPU parallel processing.
36
+
37
+
38
+ ## 🔥 Updates
39
+ - __[2026.03]__: Release the model weights, generated 50K images, and evaluation results on both [Hugging Face](https://huggingface.co/weilllllls/ProMoE) and [ModelScope](https://modelscope.cn/models/weilllllls/ProMoE)!
40
+ - __[2026.02]__: Release the training, sampling, and evaluation code of ProMoE.
41
+ - __[2026.01]__: 🎉 Our paper has been accepted by **ICLR 2026**!
42
+ - __[2025.10]__: Release the [paper](https://arxiv.org/abs/2510.24711) of ProMoE.
43
+
44
+
45
+ ## 🗂️ Pretrained Models
46
+
47
+ We have released the model weights, generated 50K images, and corresponding evaluation results on both Hugging Face and ModelScope.
48
+
49
+ | Model | Platform | Weights | 50K Images | Eval Results (CFG=1.0) | Eval Results (CFG=1.5) |
50
+ | :--- | :--- | :--- | :--- | :--- | :--- |
51
+ | **ProMoE-B-Flow** <br> (500K) | Hugging Face <br> ModelScope | [Link](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_B/ckpt_step_500000.pth) <br> [Link](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_B%2Fckpt_step_500000.pth?status=2) | [cfg=1.0](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_B/images_cfg1.0.npz), [cfg=1.5](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_B/images_cfg1.5.npz) <br> [cfg=1.0](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_B%2Fimages_cfg1.0.npz?status=2), [cfg=1.5](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_B%2Fimages_cfg1.5.npz?status=2) | [FID 24.44, IS 60.38](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_B/images_eval_openai_cfg1.0.txt) <br> [FID 24.44, IS 60.38](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_B%2Fimages_eval_openai_cfg1.0.txt?status=1) | [FID 6.39, IS 154.21](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_B/images_eval_openai_cfg1.5.txt) <br> [FID 6.39, IS 154.21](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_B%2Fimages_eval_openai_cfg1.5.txt?status=1) |
52
+ | **ProMoE-L-Flow** <br> (500K) | Hugging Face <br> ModelScope | [Link](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_L/ckpt_step_500000.pth) <br> [Link](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_L%2Fckpt_step_500000.pth?status=2) | [cfg=1.0](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_L/images_cfg1.0.npz), [cfg=1.5](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_L/images_cfg1.5.npz) <br> [cfg=1.0](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_L%2Fimages_cfg1.0.npz?status=2), [cfg=1.5](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_L%2Fimages_cfg1.5.npz?status=2) | [FID 11.61, IS 100.82](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_L/images_eval_openai_cfg1.0.txt) <br> [FID 11.61, IS 100.82](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_L%2Fimages_eval_openai_cfg1.0.txt?status=1) | [FID 2.79, IS 244.21](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_L/images_eval_openai_cfg1.5.txt) <br> [FID 2.79, IS 244.21](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_L%2Fimages_eval_openai_cfg1.5.txt?status=1) |
53
+ | **ProMoE-XL-Flow** <br> (500K) | Hugging Face <br> ModelScope | [Link](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_XL/ckpt_step_500000.pth) <br> [Link](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_XL%2Fckpt_step_500000.pth?status=2) | [cfg=1.0](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_XL/images_cfg1.0.npz), [cfg=1.5](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_XL/images_cfg1.5.npz) <br> [cfg=1.0](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_XL%2Fimages_cfg1.0.npz?status=2), [cfg=1.5](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_XL%2Fimages_cfg1.5.npz?status=2) | [FID 9.44, IS 114.94](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_XL/images_eval_openai_cfg1.0.txt) <br> [FID 9.44, IS 114.94](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_XL%2Fimages_eval_openai_cfg1.0.txt?status=1) | [FID 2.59, IS 265.62](https://huggingface.co/weilllllls/ProMoE/blob/main/ProMoE_XL/images_eval_openai_cfg1.5.txt) <br> [FID 2.59, IS 265.62](https://modelscope.cn/models/weilllllls/ProMoE/file/view/master/ProMoE_XL%2Fimages_eval_openai_cfg1.5.txt?status=1) |
54
+
55
+
56
+ ## ⚙️ Preparation
57
+ ### 1. Requirements & Installation
58
+ ```bash
59
+ conda create -n promoe python=3.10 -y
60
+ conda activate promoe
61
+ pip install -r requirements.txt
62
+ ```
63
+
64
+ ### 2. Dataset Preparation
65
+
66
+ Download [ImageNet](http://image-net.org/download) dataset, and modify `cfg.data_path` in `config.py`.
67
+
68
+ ### 3. VAE Latent Preprocessing (Optional)
69
+
70
+ For faster training and more efficient GPU usage, you can **precompute VAE latents** and train with `cfg.use_pre_latents=True`.
71
+
72
+ Run latent preprocessing:
73
+ ```bash
74
+ # bash
75
+ ImageNet_path=/path/to/ImageNet
76
+
77
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python preprocess/preprocess_vae.py --latent_save_root "$ImageNet_path/sd-vae-ft-mse_Latents_256img_npz"
78
+ ```
79
+
80
+ ## 🚀 Training
81
+
82
+ Training is launched via `train.py` with a YAML config:
83
+ ```bash
84
+ python train.py --config configs/004_ProMoE_L.yaml
85
+ ```
86
+
87
+ **Notes:**
88
+
89
+ - This repository currently supports Rectified Flow with Logit-Normal sampling (following [SD3](https://arxiv.org/pdf/2403.03206)). For the DDPM implementation, please refer to this [repository](https://github.com/KlingTeam/DiffMoE/tree/main).
90
+ - By default, ProMoE utilizes Token-Choice routing. However, for DDPM-based training, we recommend using Expert-Choice in `models/models_ProMoE_EC.py`.
91
+ - Configuration files for all baseline models are provided in the `configs` directory.
92
+ - All results reported in the paper are obtained with `qk_norm=False`. For extended training steps (>2M steps), we suggest enabling `qk_norm=True` to ensure training stability.
93
+
94
+
95
+ ## 💫 Sampling
96
+
97
+ Image generation is performed via the `sample.py` script, utilizing the same YAML configuration file used for training.
98
+
99
+ ```bash
100
+ # use default setting
101
+ CUDA_VISIBLE_DEVICES=0 python sample.py --config configs/004_ProMoE_L.yaml
102
+
103
+ # use custom setting
104
+ CUDA_VISIBLE_DEVICES=0 python sample.py \
105
+ --config configs/004_ProMoE_L.yaml \
106
+ --step_list_for_sample 200000,300000 \
107
+ --guide_scale_list 1.0,1.5,4.0 \
108
+ --num_fid_samples 10000
109
+ ```
110
+
111
+ **Notes:**
112
+
113
+ - By default, the script loads the checkpoint at **500k steps** and generates **50,000 images** using a **single GPU**, sweeping across guidance scales (CFG) of **1.0** and **1.5**.
114
+ - To use **multiple GPUs** for sampling, specify the devices using `CUDA_VISIBLE_DEVICES` or by adding `sample_gpu_ids` in the configuration file. Please be aware that multi-GPU inference produces a globally different random sequence (e.g., class labels) compared to single-GPU inference.
115
+ - Generated images are saved as PNG files in the `sample/` directory within the same parent directory as the checkpoint folder. Filenames include both the sample index and class label.
116
+ - If you only want to calculate FID, you can set `cfg.save_inception_features=True` to save Inception features and reduce `cfg.save_img_num`.
117
+
118
+
119
+ ## 📝 Evaluation
120
+
121
+ We follow the standard evaluation protocol outlined in [openai's guided-diffusion](https://github.com/openai/guided-diffusion/tree/main/evaluations). All relevant code is located in the `evaluation` directory.
122
+
123
+ ### 1. Environment Setup
124
+ Since the evaluation pipeline relies on TensorFlow, we strongly recommend creating a dedicated environment to avoid dependency conflicts.
125
+ ```bash
126
+ conda create -n promoe_eval python=3.9 -y
127
+ conda activate promoe_eval
128
+ cd evaluation
129
+ pip install -r requirements.txt
130
+ conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
131
+ ```
132
+
133
+ ### 2. Download Reference Batch
134
+ Download the reference statistics file [VIRTUAL_imagenet256_labeled.npz](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npz) (for 256x256 images) and place it in the `evaluation` directory.
135
+
136
+ ### 3. Execution
137
+ To calculate the metrics, run the evaluation script by specifying the path to your folder of generated images.
138
+ ```bash
139
+ CUDA_VISIBLE_DEVICES=0 python run_eval.py /path/to/generated/images
140
+ ```
141
+
142
+
143
+ ## Acknowledgements
144
+
145
+ This code is built on top of [DiffMoE](https://github.com/KlingTeam/DiffMoE), [DiT](https://github.com/facebookresearch/DiT), and [guided-diffusion](https://github.com/openai/guided-diffusion/tree/main/evaluations). We thank the authors for their great work.
146
+
147
+
148
+ ## 🌟 Citation
149
+
150
+ If you find this code useful for your research, please cite our paper:
151
+
152
+ ```bibtex
153
+ @inproceedings{wei2026promoe,
154
+ title={Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance},
155
+ author={Wei, Yujie and Zhang, Shiwei and Yuan, Hangjie and Han, Yujin and Chen, Zhekai and Wang, Jiayu and Zou, Difan and Liu, Xihui and Zhang, Yingya and Liu, Yu and others},
156
+ booktitle={International Conference on Learning Representations},
157
+ year={2026}
158
+ }
159
+ ```