nielsr HF Staff commited on
Commit
3f9453c
·
verified ·
1 Parent(s): 0b8ddc4

Add model card for MVSplit-DiT

Browse files

Hi! I'm Niels from the Hugging Face community team. I've updated the model card to include:
- Relevant metadata (`pipeline_tag: text-to-image`).
- Links to the paper, GitHub repository, and project page.
- A brief description of the architecture and the "Mean Mode Screaming" problem it solves.
- Sample usage instructions for image generation as found in your documentation.
- The BibTeX citation.

This will help users discover and use your 1000-layer DiT more effectively!

Files changed (1) hide show
  1. README.md +54 -3
README.md CHANGED
@@ -1,3 +1,54 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-to-image
4
+ ---
5
+
6
+ # MVSplit-DiT (1000 layers)
7
+
8
+ This repository contains the weights for the 1000-layer Diffusion Transformer (DiT) presented in the paper [Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers](https://huggingface.co/papers/2605.06169).
9
+
10
+ [Project Page](https://erwold.github.io/mv-split/) | [GitHub Repository](https://github.com/erwold/mv-split)
11
+
12
+ ## Introduction
13
+
14
+ Scaling Diffusion Transformers to extreme depths (hundreds or thousands of layers) introduces a structural vulnerability known as **Mean Mode Screaming (MMS)**. In this state, token representations homogenize, and centered variation is suppressed, leading to model collapse.
15
+
16
+ MVSplit-DiT addresses this by using **Mean-Variance Split (MV-Split) Residuals**, which combine a separately gained centered residual update with a leaky trunk-mean replacement. This architecture enables the stable training of DiTs at boundary scales, such as the 1000-layer model provided here.
17
+
18
+ ## Usage
19
+
20
+ To use this model for image generation, please refer to the official [GitHub repository](https://github.com/erwold/mv-split) for installation instructions and requirements.
21
+
22
+ ### Sampling
23
+
24
+ You can generate images using the `sample.py` script. The model requires the DiT checkpoint from this repo, a FLUX.2 VAE, and a Qwen3 text encoder.
25
+
26
+ ```bash
27
+ # Custom prompt example
28
+ python sample.py \
29
+ --checkpoint_path /path/to/model.pt \
30
+ --flux_vae_path /path/to/flux2_ae.safetensors \
31
+ --qwen_model_path Qwen/Qwen3-0.6B \
32
+ --prompt "a red panda climbing a bamboo stalk" \
33
+ --output_dir ./samples
34
+ ```
35
+
36
+ ### Key sampling flags
37
+
38
+ | Flag | Default | Meaning |
39
+ |---|---|---|
40
+ | `--image_size` | 256 | Square output side in pixels. |
41
+ | `--num_inference_steps` | 35 | Euler steps for the flow-matching ODE. |
42
+ | `--cfg_scale` | 2.0 | Classifier-free guidance. |
43
+ | `--time_shift_alpha` | 4.0 | Time-shift in the flow schedule (must match training). |
44
+
45
+ ## Citation
46
+
47
+ ```bibtex
48
+ @article{lu2026mms,
49
+ title = {Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers},
50
+ author = {Lu, Pengqi},
51
+ journal = {arXiv preprint arXiv:2605.06169},
52
+ year = {2026},
53
+ }
54
+ ```