license: apache-2.0
pipeline_tag: text-to-image
MVSplit-DiT (1000 layers)
This repository contains the weights for the 1000-layer Diffusion Transformer (DiT) presented in the paper Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers.
Project Page | GitHub Repository
Introduction
Scaling Diffusion Transformers to extreme depths (hundreds or thousands of layers) introduces a structural vulnerability known as Mean Mode Screaming (MMS). In this state, token representations homogenize, and centered variation is suppressed, leading to model collapse.
MVSplit-DiT addresses this by using Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. This architecture enables the stable training of DiTs at boundary scales, such as the 1000-layer model provided here.
Usage
To use this model for image generation, please refer to the official GitHub repository for installation instructions and requirements.
Sampling
You can generate images using the sample.py script. The model requires the DiT checkpoint from this repo, a FLUX.2 VAE, and a Qwen3 text encoder.
# Custom prompt example
python sample.py \
--checkpoint_path /path/to/model.pt \
--flux_vae_path /path/to/flux2_ae.safetensors \
--qwen_model_path Qwen/Qwen3-0.6B \
--prompt "a red panda climbing a bamboo stalk" \
--output_dir ./samples
Key sampling flags
| Flag | Default | Meaning |
|---|---|---|
--image_size |
256 | Square output side in pixels. |
--num_inference_steps |
35 | Euler steps for the flow-matching ODE. |
--cfg_scale |
2.0 | Classifier-free guidance. |
--time_shift_alpha |
4.0 | Time-shift in the flow schedule (must match training). |
Citation
@article{lu2026mms,
title = {Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers},
author = {Lu, Pengqi},
journal = {arXiv preprint arXiv:2605.06169},
year = {2026},
}