Add model card for MVSplit-DiT (#1)

e490a58 about 18 hours ago

2.23 kB

license: apache-2.0
pipeline_tag: text-to-image

MVSplit-DiT (1000 layers)

This repository contains the weights for the 1000-layer Diffusion Transformer (DiT) presented in the paper Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers.

Project Page | GitHub Repository

Introduction

Scaling Diffusion Transformers to extreme depths (hundreds or thousands of layers) introduces a structural vulnerability known as Mean Mode Screaming (MMS). In this state, token representations homogenize, and centered variation is suppressed, leading to model collapse.

MVSplit-DiT addresses this by using Mean-Variance Split (MV-Split) Residuals, which combine a separately gained centered residual update with a leaky trunk-mean replacement. This architecture enables the stable training of DiTs at boundary scales, such as the 1000-layer model provided here.

Usage

To use this model for image generation, please refer to the official GitHub repository for installation instructions and requirements.

Sampling

You can generate images using the sample.py script. The model requires the DiT checkpoint from this repo, a FLUX.2 VAE, and a Qwen3 text encoder.

# Custom prompt example
python sample.py \
    --checkpoint_path /path/to/model.pt \
    --flux_vae_path   /path/to/flux2_ae.safetensors \
    --qwen_model_path Qwen/Qwen3-0.6B \
    --prompt "a red panda climbing a bamboo stalk" \
    --output_dir ./samples

Key sampling flags

Flag	Default	Meaning
`--image_size`	256	Square output side in pixels.
`--num_inference_steps`	35	Euler steps for the flow-matching ODE.
`--cfg_scale`	2.0	Classifier-free guidance.
`--time_shift_alpha`	4.0	Time-shift in the flow schedule (must match training).

Citation

@article{lu2026mms,
  title   = {Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers},
  author  = {Lu, Pengqi},
  journal = {arXiv preprint arXiv:2605.06169},
  year    = {2026},
}