nielsr HF Staff commited on
Commit
8719203
·
verified ·
1 Parent(s): b854c46

Improve model card with abstract, diffusers usage, benchmarks, and showcases

Browse files

This PR significantly enhances the model card by incorporating more comprehensive information derived from the paper's abstract and the project's GitHub repository.

Key improvements include:
- **Expanded "About" section:** The brief description has been replaced with the full abstract of the paper, offering a detailed overview of the TACA method, its motivations, and findings.
- **Visual Overview:** An embedded GIF/video from the GitHub README is added for a quick visual summary of the model's capabilities.
- **Detailed Usage Instructions:** New Python code snippets using the `diffusers` library are provided for both Stable Diffusion 3.5 and FLUX.1, making it easier for users to integrate and experiment with TACA.
- **Benchmark Results:** The comprehensive evaluation table from the paper/GitHub repository is included, showcasing TACA's performance on the T2I-CompBench benchmark.
- **Visual Showcases:** Example generated images are added, with their paths corrected to ensure they render properly on the Hugging Face Hub.
- **Citation Information:** A BibTeX entry is added for proper academic attribution.

These updates aim to make the model card more informative, user-friendly, and aligned with best practices for documenting AI artifacts on the Hugging Face Hub.

Files changed (1) hide show
  1. README.md +77 -2
README.md CHANGED
@@ -25,7 +25,7 @@ pipeline_tag: text-to-image
25
  <a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
26
  </span>
27
  <span class="author-block">
28
- <a href="https://homepage.hit.edu.cn/wangmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span>
29
  </span>
30
  <span class="author-block">
31
  <a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
@@ -52,4 +52,79 @@ pipeline_tag: text-to-image
52
  </p>
53
 
54
  # About
55
- We propose **TACA**, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  <a href="https://frozenburning.github.io/" target="_blank">Zhaoxi Chen</a><sup>4</sup>,</span>
26
  </span>
27
  <span class="author-block">
28
+ <a href="https://homepage.hit.edu.cn/wangmengmengzuo" target="_blank">Wangmeng Zuo</a><sup>5</sup>,</span>
29
  </span>
30
  <span class="author-block">
31
  <a href="https://liuziwei7.github.io/" target="_blank">Ziwei Liu</a><sup>4†</sup>,</span>
 
52
  </p>
53
 
54
  # About
55
+ Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose **Temperature-Adjusted Cross-modal Attention (TACA)**, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{ this https URL }
56
+
57
+ https://github.com/user-attachments/assets/ae15a853-ee99-4eee-b0fd-8f5f53c308f9
58
+
59
+ # Usage
60
+
61
+ You can use `TACA` with `Stable Diffusion 3.5` or `FLUX.1` models.
62
+
63
+ ## With Stable Diffusion 3.5
64
+
65
+ ```python
66
+ from diffusers import StableDiffusionXLPipeline
67
+ import torch
68
+
69
+ # Load the base model and LoRA weights
70
+ pipe = StableDiffusionXLPipeline.from_pretrained(
71
+ "stabilityai/stable-diffusion-3.5-medium", torch_dtype=torch.float16
72
+ )
73
+ pipe.load_lora_weights("ldiex/TACA", weight_name="taca_sd3_r64.safetensors")
74
+ pipe.to("cuda")
75
+
76
+ # Generate an image
77
+ prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
78
+ image = pipe(prompt).images[0]
79
+
80
+ image.save("lion_sunset.png")
81
+ ```
82
+
83
+ ## With FLUX.1
84
+
85
+ ```python
86
+ from diffusers import FluxPipeline
87
+ import torch
88
+
89
+ # Load the base model and LoRA weights
90
+ pipe = FluxPipeline.from_pretrained(
91
+ "black-forest-labs/FLUX.1-dev", torch_dtype=torch.float16
92
+ )
93
+ pipe.load_lora_weights("ldiex/TACA", weight_name="taca_flux_r64.safetensors")
94
+ pipe.to("cuda")
95
+
96
+ # Generate an image
97
+ prompt = "A majestic lion standing proudly on a rocky cliff overlooking a vast savanna at sunset."
98
+ image = pipe(prompt).images[0]
99
+
100
+ image.save("lion_sunset.png")
101
+ ```
102
+
103
+ # Benchmark
104
+ Comparison of alignment evaluation on T2I-CompBench for FLUX.1-Dev-based and SD3.5-Medium-based models.
105
+
106
+ | Model | Attribute Binding | | | Object Relationship | | Complex $\uparrow$ |
107
+ |---|---|---|---|---|---|---|
108
+ | | Color $\uparrow$ | Shape $\uparrow$ | Texture $\uparrow$ | Spatial $\uparrow$ | Non-Spatial $\uparrow$ | |
109
+ | FLUX.1-Dev | 0.7678 | 0.5064 | 0.6756 | 0.2066 | 0.3035 | 0.4359 |
110
+ | FLUX.1-Dev + TACA ($r = 64$) | **0.7843** | **0.5362** | **0.6872** | **0.2405** | 0.3041 | **0.4494** |
111
+ | FLUX.1-Dev + TACA ($r = 16$) | 0.7842 | 0.5347 | 0.6814 | 0.2321 | **0.3046** | 0.4479 |
112
+ | SD3.5-Medium | 0.7890 | 0.5770 | 0.7328 | 0.2087 | 0.3104 | 0.4441 |
113
+ | SD3.5-Medium + TACA ($r = 64$) | **0.8074** | **0.5938** | **0.7522** | **0.2678** | 0.3106 | 0.4470 |
114
+ | SD3.5-Medium + TACA ($r = 16$) | 0.7984 | 0.5834 | 0.7467 | 0.2374 | **0.3111** | **0.4505** |
115
+
116
+ # Showcases
117
+ ![](https://github.com/Vchitect/TACA/raw/main/static/images/short_1.png)
118
+ ![](https://github.com/Vchitect/TACA/raw/main/static/images/short_2.png)
119
+ ![](https://github.com/Vchitect/TACA/raw/main/static/images/long_1.png)
120
+ ![](https://github.com/Vchitect/TACA/raw/main/static/images/long_2.png)
121
+
122
+ # Citation
123
+ ```bibtex
124
+ @article{lv2025taca,
125
+ title={TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers},
126
+ author={Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Wong, Kwan-Yee K},
127
+ journal={arXiv preprint arXiv:2506.07986},
128
+ year={2025}
129
+ }
130
+ ```