Improve model card with abstract, diffusers usage, benchmarks, and showcases
Browse filesThis PR significantly enhances the model card by incorporating more comprehensive information derived from the paper's abstract and the project's GitHub repository.
Key improvements include:
- **Expanded "About" section:** The brief description has been replaced with the full abstract of the paper, offering a detailed overview of the TACA method, its motivations, and findings.
- **Visual Overview:** An embedded GIF/video from the GitHub README is added for a quick visual summary of the model's capabilities.
- **Detailed Usage Instructions:** New Python code snippets using the `diffusers` library are provided for both Stable Diffusion 3.5 and FLUX.1, making it easier for users to integrate and experiment with TACA.
- **Benchmark Results:** The comprehensive evaluation table from the paper/GitHub repository is included, showcasing TACA's performance on the T2I-CompBench benchmark.
- **Visual Showcases:** Example generated images are added, with their paths corrected to ensure they render properly on the Hugging Face Hub.
- **Citation Information:** A BibTeX entry is added for proper academic attribution.
These updates aim to make the model card more informative, user-friendly, and aligned with best practices for documenting AI artifacts on the Hugging Face Hub.
|
@@ -25,7 +25,7 @@ pipeline_tag: text-to-image
|
|
| 25 |
<a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
|
| 26 |
</span>
|
| 27 |
<span class="author-block">
|
| 28 |
-
<a href="https://homepage.hit.edu.cn/
|
| 29 |
</span>
|
| 30 |
<span class="author-block">
|
| 31 |
<a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
|
|
@@ -52,4 +52,79 @@ pipeline_tag: text-to-image
|
|
| 52 |
</p>
|
| 53 |
|
| 54 |
# About
|
| 55 |
-
We propose **TACA**, a parameter-efficient method that dynamically rebalances cross-modal attention in
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
<a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
|
| 26 |
</span>
|
| 27 |
<span class="author-block">
|
| 28 |
+
<a href="https://homepage.hit.edu.cn/wangmengmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span>
|
| 29 |
</span>
|
| 30 |
<span class="author-block">
|
| 31 |
<a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
|
|
|
|
| 52 |
</p>
|
| 53 |
|
| 54 |
# About
|
| 55 |
+
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose **Temperature-Adjusted Cross-modal Attention (TACA)**, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{ this https URL }
|
| 56 |
+
|
| 57 |
+
https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9
|
| 58 |
+
|
| 59 |
+
# Usage
|
| 60 |
+
|
| 61 |
+
You can use `TACA` with `Stable Diffusion 3.5` or `FLUX.1` models.
|
| 62 |
+
|
| 63 |
+
## With Stable Diffusion 3.5
|
| 64 |
+
|
| 65 |
+
```python
|
| 66 |
+
from diffusers import StableDiffusionXLPipeline
|
| 67 |
+
import torch
|
| 68 |
+
|
| 69 |
+
# Load the base model and LoRA weights
|
| 70 |
+
pipe = StableDiffusionXLPipeline.from_pretrained(
|
| 71 |
+
"stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.float16
|
| 72 |
+
)
|
| 73 |
+
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_sd3_r64.safetensors")
|
| 74 |
+
pipe.to("cuda")
|
| 75 |
+
|
| 76 |
+
# Generate an image
|
| 77 |
+
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
|
| 78 |
+
image = pipe(prompt).images[0]
|
| 79 |
+
|
| 80 |
+
image.save("lion_sunset.png")
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
## With FLUX.1
|
| 84 |
+
|
| 85 |
+
```python
|
| 86 |
+
from diffusers import FluxPipeline
|
| 87 |
+
import torch
|
| 88 |
+
|
| 89 |
+
# Load the base model and LoRA weights
|
| 90 |
+
pipe = FluxPipeline.from_pretrained(
|
| 91 |
+
"black-forest-labs/FLUX.1-dev", torch_dtype=torch.float16
|
| 92 |
+
)
|
| 93 |
+
pipe.load_lora_weights("ldiex/TACA", weight_name="taca_flux_r64.safetensors")
|
| 94 |
+
pipe.to("cuda")
|
| 95 |
+
|
| 96 |
+
# Generate an image
|
| 97 |
+
prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
|
| 98 |
+
image = pipe(prompt).images[0]
|
| 99 |
+
|
| 100 |
+
image.save("lion_sunset.png")
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
# Benchmark
|
| 104 |
+
Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.
|
| 105 |
+
|
| 106 |
+
| Model | Attribute Binding | | | Object Relationship | | Complex $\uparrow$ |
|
| 107 |
+
|---|---|---|---|---|---|---|
|
| 108 |
+
| | Color $\uparrow$ | Shape $\uparrow$ | Texture $\uparrow$ | Spatial $\uparrow$ | Non-Spatial $\uparrow$ | |
|
| 109 |
+
| FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 |
|
| 110 |
+
| FLUX.1-Dev + TACA ($r = 64$) | **0.7843** | **0.5362** | **0.6872** | **0.2405** | 0.3041 | **0.4494** |
|
| 111 |
+
| FLUX.1-Dev + TACA ($r = 16$) | 0.7842 | 0.5347 | 0.6814 | 0.2321 | **0.3046** | 0.4479 |
|
| 112 |
+
| SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 |
|
| 113 |
+
| SD3.5-Medium + TACA ($r = 64$) | **0.8074** | **0.5938** | **0.7522** | **0.2678** | 0.3106 | 0.4470 |
|
| 114 |
+
| SD3.5-Medium + TACA ($r = 16$) | 0.7984 | 0.5834 | 0.7467 | 0.2374 | **0.3111** | **0.4505** |
|
| 115 |
+
|
| 116 |
+
# Showcases
|
| 117 |
+

|
| 118 |
+

|
| 119 |
+

|
| 120 |
+

|
| 121 |
+
|
| 122 |
+
# Citation
|
| 123 |
+
```bibtex
|
| 124 |
+
@article{lv2025taca,
|
| 125 |
+
title={TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers},
|
| 126 |
+
author={Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Wong, Kwan-Yee K},
|
| 127 |
+
journal={arXiv preprint arXiv:2506.07986},
|
| 128 |
+
year={2025}
|
| 129 |
+
}
|
| 130 |
+
```
|