| --- |
| license: mit |
| --- |
| |
| # Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion |
|
|
| ## π© Overview |
| <p align="center"> |
| <img src="https://raw.githubusercontent.com/yuemingPAN/SFD/main/images/teaser_v5.png" width="95%"> |
| </p> |
|
|
| <div align="center" style="max-width:900px; text-align:justify; font-size:14px; line-height:1.5;"> |
| <p> |
| <strong>(a) Overview of Semantic-First Diffusion (SFD).</strong> |
| Semantics (dashed curve) and textures (solid curve) follow asynchronous denoising trajectories. |
| SFD operates in three phases: |
| <span style="color:#d62728;">Stage I β Semantic initialization</span>, where semantic latents denoise first; |
| <span style="color:#4472c4;">Stage II β Asynchronous generation</span>, where semantics and textures denoise jointly but asynchronously, with semantics ahead of textures; |
| <span style="color:#2ca02c;">Stage III β Texture completion</span>, where only textures continue refining. |
| After denoising, the generated semantic latent <b>sβ</b> is discarded, and the final image is decoded solely from the texture latent <b>zβ</b>. |
| <strong>(b) Training convergence on ImageNet 256Γ256 without guidance.</strong> |
| SFD achieves substantially faster convergence than DiT-XL/2 and LightningDiT-XL/1 by approximately <b>100Γ</b> and <b>33.3Γ</b>, respectively. |
| </p> |
| </div> |
| |
| --- |
|
|
| ## β¨ Highlights |
| - We propose **Semantic-First Diffusion (SFD)**, a novel latent diffusion paradigm that performs asynchronous denoising on semantic and texture latents, allowing semantics to denoise earlier and subsequently guide texture generation. |
| - **SFD achieves state-of-the-art FID score of 1.04** on ImageNet 256Γ256 generation. |
| - Exhibits **100Γ** and **33.3Γ faster** training convergence compared to **DiT** and **LightningDiT**, respectively. |
|
|
| --- |
|
|
| ## π§ͺ Quantitative Results |
| Explicitly **leading semantics ahead of textures with a moderate offset (Ξt = 0.3)** achieves an optimal balance between early semantic stabilization and texture collaboration, effectively harmonizing their joint modeling. |
| <p align="center"> |
| <img src="https://raw.githubusercontent.com/yuemingPAN/SFD/main/images/fid_vs_delta_t.png" width="50%"> |
| </p> |
|
|
|
|
| ### With AutoGuidance |
|
|
| | Model | Epochs | #Params | FID (NPU) | |
| |:--------|:-------:|:--------:|:----------:| |
| | SFD-XL | 80 | 675M | 1.30 | |
| | SFD-XL | 800 | 675M | **1.06** | |
| | SFD-XXL | 80 | 1.0B | 1.19 | |
| | SFD-XXL | 800 | 1.0B | **1.04** | |
|
|
|
|
| --- |
|
|
| ## π¨ Visual Results |
|
|
| <p align="center"> |
| <img src="https://raw.githubusercontent.com/yuemingPAN/SFD/main/images/demo_Sample.png" width="90%"> |
| </p> |
|
|
| --- |
|
|
| ## π Links |
| - π **Project Page:** [https://yuemingpan.github.io/SFD.github.io/](https://yuemingpan.github.io/SFD.github.io/) |
| - π **Paper (arXiv):** [https://arxiv.org/pdf/2512.04926](https://arxiv.org/pdf/2512.04926) |
| - πΎ **Code:** [https://github.com/yuemingPAN/SFD](https://github.com/yuemingPAN/SFD) |
| - π§° **License:** MIT |
|
|
| --- |
|
|
| ## π§© Citation |
| ```bibtex |
| @article{Pan2025SFD, |
| title={Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion}, |
| author={Pan, Yueming and Feng, Ruoyu and Dai, Qi and Wang, Yuqi and Lin, Wenfeng and Guo, Mingyu and Luo, Chong and Zheng, Nanning}, |
| journal={arXiv preprint arXiv:2512.04926}, |
| year={2025} |
| } |
| |
| |