le723z
/

RiT

 ---
+license: mit
+library_name: pytorch
+pipeline_tag: unconditional-image-generation
+tags:
+  - diffusion
+  - flow-matching
+  - representation-learning
+  - dinov2
+  - imagenet
+datasets:
+  - imagenet-1k
 ---
+# RiT-XL: Vanilla Diffusion Transformers Are Enough in Representation Space
+This repository hosts the released **RiT-XL** checkpoint trained for 800 epochs
+on ImageNet 256×256 with frozen DINOv2-Small features.
+[![GitHub](https://img.shields.io/badge/GitHub-lezhang7%2FRiT-181717.svg)](https://github.com/lezhang7/RiT)
+[![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b.svg)](https://arxiv.org/)
+## Results on ImageNet 256×256
+| Method                   | Encoder        | Params |  FID ↓ (CFG=1) | FID ↓ (CFG≈3.7) |
+|--------------------------|---------------:|-------:|---------------:|----------------:|
+| DiT-XL                   | SD-VAE         |  675M  | 9.62           | 2.27            |
+| SiT-XL                   | SD-VAE         |  675M  | 8.61           | 2.06            |
+| REPA-XL                  | SD-VAE         |  675M  | 5.78           | 1.29            |
+| DDT-XL                   | SD-VAE         |  675M  | 6.27           | 1.26            |
+| REG-XL                   | SD-VAE         |  675M  | 1.80           | 1.36            |
+| RAE-XL                   | DINOv2-S       |  676M  | 1.87           | 1.41            |
+| RAE-XL<sup>DH</sup>      | DINOv2-B       |  839M  | 1.51           | 1.16            |
+| FAE-XL                   | FAE-DINOv2-G   |  675M  | 1.48           | 1.29            |
+| **RiT-XL (ours)**        | **DINOv2-S**   | **676M** | **1.45**     | **1.14**        |
+All FIDs use 25 Heun steps with the time-shift schedule.
+Few-step generation (no distillation, no consistency training):
+| Heun steps     |  5   | 10   | 25   | 50   |
+|---------------:|-----:|-----:|-----:|-----:|
+| FID (CFG=1.0)  | 2.44 | 1.59 | 1.47 | 1.46 |
+| FID (CFG=3.7)  | 1.99 | 1.27 | 1.15 | 1.15 |
+## Quick start
+The full training/inference code lives at
+[**lezhang7/RiT**](https://github.com/lezhang7/RiT). The eval script auto-pulls
+this checkpoint plus the matching RAE decoder on first run:
+```bash
+git clone https://github.com/lezhang7/RiT.git
+cd RiT
+pip install -r requirements.txt
+bash scripts/eval.sh        # CFG=3.7, FID ~1.14 on ImageNet 256x256
+```
+To download just the weights manually:
+```python
+from huggingface_hub import hf_hub_download
+ckpt = hf_hub_download(repo_id="le723z/RiT", filename="checkpoint-last.pth")
+import torch
+state = torch.load(ckpt, map_location="cpu", weights_only=False)
+# state['model'] / state['model_ema1'] / state['model_ema2'] are the
+# trainable + two EMA-decay parameter dictionaries.
+```
+## Checkpoint contents
+`checkpoint-last.pth` is a PyTorch checkpoint produced after 740 training
+epochs (the released model used for the paper's headline numbers). Top-level
+keys:
+- `model` — main parameters of the `Denoiser` (RiT-XL backbone).
+- `model_ema1` — EMA decay 0.9999 (used for sampling by default).
+- `model_ema2` — EMA decay 0.9996 (tracked but unused at inference).
+- `optimizer` — AdamW state for resuming training.
+- `epoch` — `740`.
+- `args` — argparse namespace from the original training run (legacy
+  `JiT-RAE-XL/16` model name; the architecture matches the released
+  `RiT-XL/16`).
+Loading uses only `model` / `model_ema*`, so the legacy `args` field does not
+matter — `eval.sh` constructs the model from the CLI flags.
+## Model details
+- **Architecture:** vanilla Diffusion Transformer — 28 layers, hidden 1152,
+  16 heads, SwiGLU FFN, RMSNorm, QK-norm, 2D VisionRoPE, 32 in-context class
+  tokens, joint [CLS]-patch modeling.
+- **Encoder (frozen):** `facebook/dinov2-with-registers-small` (d=384).
+- **Decoder (frozen):** ViT-MAE-style decoder from
+  [nyu-visionx/RAE-collections](https://huggingface.co/nyu-visionx/RAE-collections),
+  variant `decoders/dinov2/wReg_small/ViTXL_n08/model.pt`.
+- **Parameters (denoiser only):** 676M.
+- **Training:** 8×H200, batch 1536 effective, AdamW lr=5e-5, 800 epochs (this
+  ckpt: epoch 740), x-prediction loss, dimension-aware time shift
+  (s ≈ 4.9), CLS auxiliary loss weight λ=0.2.
+- **Sampling defaults:** Heun, 25 steps, time-shift schedule, CFG=3.7 in
+  interval [0.1, 0.98], coupled-noise initialization for [CLS].
+## Citation
+```bibtex
+@article{zhang2025rit,
+  title  = {RiT: Vanilla Diffusion Transformers Are Enough in Representation Space},
+  author = {Zhang, Le and Mang, Ning and Agrawal, Aishwarya},
+  year   = {2025}
+}
+```
+## Acknowledgments
+This release reuses the frozen DINOv2 encoder + ViT decoder pairing from
+[**RAE**](https://github.com/bytetriper/RAE) and adopts the modernized DiT
+block design + in-context class tokens from [**JiT**](https://github.com/LTH14/JiT).