Add model card for MVSplit-DiT (#1)

- Add model card for MVSplit-DiT (3f9453c584db59fcdd7319dd8dec979207db3438)

Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +54 -3

README.md CHANGED Viewed

@@ -1,3 +1,54 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+pipeline_tag: text-to-image
+---
+# MVSplit-DiT (1000 layers)
+This repository contains the weights for the 1000-layer Diffusion Transformer (DiT) presented in the paper [Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers](https://huggingface.co/papers/2605.06169).
+[Project Page](https://erwold.github.io/mv-split/) | [GitHub Repository](https://github.com/erwold/mv-split)
+## Introduction
+Scaling Diffusion Transformers to extreme depths (hundreds or thousands of layers) introduces a structural vulnerability known as **Mean Mode Screaming (MMS)**. In this state, token representations homogenize, and centered variation is suppressed, leading to model collapse.
+MVSplit-DiT addresses this by using **Mean-Variance Split (MV-Split) Residuals**, which combine a separately gained centered residual update with a leaky trunk-mean replacement. This architecture enables the stable training of DiTs at boundary scales, such as the 1000-layer model provided here.
+## Usage
+To use this model for image generation, please refer to the official [GitHub repository](https://github.com/erwold/mv-split) for installation instructions and requirements.
+### Sampling
+You can generate images using the `sample.py` script. The model requires the DiT checkpoint from this repo, a FLUX.2 VAE, and a Qwen3 text encoder.
+```bash
+# Custom prompt example
+python sample.py \
+    --checkpoint_path /path/to/model.pt \
+    --flux_vae_path   /path/to/flux2_ae.safetensors \
+    --qwen_model_path Qwen/Qwen3-0.6B \
+    --prompt "a red panda climbing a bamboo stalk" \
+    --output_dir ./samples
+```
+### Key sampling flags
+| Flag | Default | Meaning |
+|---|---|---|
+| `--image_size` | 256 | Square output side in pixels. |
+| `--num_inference_steps` | 35 | Euler steps for the flow-matching ODE. |
+| `--cfg_scale` | 2.0 | Classifier-free guidance. |
+| `--time_shift_alpha` | 4.0 | Time-shift in the flow schedule (must match training). |
+## Citation
+```bibtex
+@article{lu2026mms,
+  title   = {Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers},
+  author  = {Lu, Pengqi},
+  journal = {arXiv preprint arXiv:2605.06169},
+  year    = {2026},
+}
+```