Improve model card: add paper link and sample usage (#1)

- Improve model card: add paper link and sample usage (7eb882328f860d690f9c50438625886f63167539)

Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +40 -67

README.md CHANGED Viewed

@@ -1,24 +1,25 @@
 ---
-license: mit
 library_name: pytorch
 pipeline_tag: unconditional-image-generation
 tags:
-  - diffusion
-  - flow-matching
-  - representation-learning
-  - dinov2
-  - imagenet
-datasets:
-  - imagenet-1k
 ---
-# RiT-XL: Vanilla Diffusion Transformers Are Enough in Representation Space
-This repository hosts the released **RiT-XL** checkpoint trained for 800 epochs
-on ImageNet 256×256 with frozen DINOv2-Small features.
 [![GitHub](https://img.shields.io/badge/GitHub-lezhang7%2FRiT-181717.svg)](https://github.com/lezhang7/RiT)
-[![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b.svg)](https://arxiv.org/)
 ## Results on ImageNet 256×256
@@ -36,83 +37,55 @@ on ImageNet 256×256 with frozen DINOv2-Small features.
 All FIDs use 25 Heun steps with the time-shift schedule.
-Few-step generation (no distillation, no consistency training):
-| Heun steps     |  5   | 10   | 25   | 50   |
-|---------------:|-----:|-----:|-----:|-----:|
-| FID (CFG=1.0)  | 2.44 | 1.59 | 1.47 | 1.46 |
-| FID (CFG=3.7)  | 1.99 | 1.27 | 1.15 | 1.15 |
-## Quick start
-The full training/inference code lives at
-[**lezhang7/RiT**](https://github.com/lezhang7/RiT). The eval script auto-pulls
-this checkpoint plus the matching RAE decoder on first run:
-```bash
-git clone https://github.com/lezhang7/RiT.git
-cd RiT
-pip install -r requirements.txt
-bash scripts/eval.sh        # CFG=3.7, FID ~1.14 on ImageNet 256x256
-```
-To download just the weights manually:
 ```python
 from huggingface_hub import hf_hub_download
 ckpt = hf_hub_download(repo_id="le723z/RiT", filename="checkpoint-last.pth")
-import torch
 state = torch.load(ckpt, map_location="cpu", weights_only=False)
 # state['model'] / state['model_ema1'] / state['model_ema2'] are the
 # trainable + two EMA-decay parameter dictionaries.
 ```
-## Checkpoint contents
-`checkpoint-last.pth` is a PyTorch checkpoint produced after 740 training
-epochs (the released model used for the paper's headline numbers). Top-level
-keys:
-- `model` — main parameters of the `Denoiser` (RiT-XL backbone).
-- `model_ema1` — EMA decay 0.9999 (used for sampling by default).
-- `model_ema2` — EMA decay 0.9996 (tracked but unused at inference).
-- `optimizer` — AdamW state for resuming training.
-- `epoch` — `740`.
-- `args` — argparse namespace from the original training run (legacy
-  `JiT-RAE-XL/16` model name; the architecture matches the released
-  `RiT-XL/16`).
-Loading uses only `model` / `model_ema*`, so the legacy `args` field does not
-matter — `eval.sh` constructs the model from the CLI flags.
 ## Model details
-- **Architecture:** vanilla Diffusion Transformer — 28 layers, hidden 1152,
-  16 heads, SwiGLU FFN, RMSNorm, QK-norm, 2D VisionRoPE, 32 in-context class
-  tokens, joint [CLS]-patch modeling.
 - **Encoder (frozen):** `facebook/dinov2-with-registers-small` (d=384).
-- **Decoder (frozen):** ViT-MAE-style decoder from
-  [nyu-visionx/RAE-collections](https://huggingface.co/nyu-visionx/RAE-collections),
-  variant `decoders/dinov2/wReg_small/ViTXL_n08/model.pt`.
 - **Parameters (denoiser only):** 676M.
-- **Training:** 8×H200, batch 1536 effective, AdamW lr=5e-5, 800 epochs (this
-  ckpt: epoch 740), x-prediction loss, dimension-aware time shift
-  (s ≈ 4.9), CLS auxiliary loss weight λ=0.2.
-- **Sampling defaults:** Heun, 25 steps, time-shift schedule, CFG=3.7 in
-  interval [0.1, 0.98], coupled-noise initialization for [CLS].
 ## Citation
 ```bibtex
-@article{zhang2025rit,
-  title  = {RiT: Vanilla Diffusion Transformers Are Enough in Representation Space},
-  author = {Zhang, Le and Mang, Ning and Agrawal, Aishwarya},
-  year   = {2025}
 }
 ```
 ## Acknowledgments
-This release reuses the frozen DINOv2 encoder + ViT decoder pairing from
-[**RAE**](https://github.com/bytetriper/RAE) and adopts the modernized DiT
-block design + in-context class tokens from [**JiT**](https://github.com/LTH14/JiT).

 ---
+datasets:
+- imagenet-1k
 library_name: pytorch
+license: mit
 pipeline_tag: unconditional-image-generation
 tags:
+- diffusion
+- flow-matching
+- representation-learning
+- dinov2
+- imagenet
 ---
+# RiT-XL: Vanilla Diffusion Transformers Suffice in Representation Space
+This repository hosts the released **RiT-XL** checkpoint trained for 800 epochs on ImageNet 256×256 with frozen DINOv2-Small features.
+RiT (Representation Image Transformer) is a vanilla Diffusion Transformer that effectively models distributions in high-dimensional representation spaces, as presented in the paper [RiT: Vanilla Diffusion Transformers Suffice in Representation Space](https://huggingface.co/papers/2605.21981).
 [![GitHub](https://img.shields.io/badge/GitHub-lezhang7%2FRiT-181717.svg)](https://github.com/lezhang7/RiT)
+[![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b.svg)](https://huggingface.co/papers/2605.21981)
 ## Results on ImageNet 256×256
 All FIDs use 25 Heun steps with the time-shift schedule.
+## Sample Usage
+The full training/inference code is available at [lezhang7/RiT](https://github.com/lezhang7/RiT). To download the weights manually and load them in PyTorch:
 ```python
+import torch
 from huggingface_hub import hf_hub_download
+# Download the checkpoint
 ckpt = hf_hub_download(repo_id="le723z/RiT", filename="checkpoint-last.pth")
+# Load the state dictionary
 state = torch.load(ckpt, map_location="cpu", weights_only=False)
 # state['model'] / state['model_ema1'] / state['model_ema2'] are the
 # trainable + two EMA-decay parameter dictionaries.
+# state['model_ema1'] is the EMA decay 0.9999 (used for sampling by default).
+model_weights = state['model_ema1']
 ```
+To run the evaluation script (which auto-pulls this checkpoint plus the matching RAE decoder):
+```bash
+git clone https://github.com/lezhang7/RiT.git
+cd RiT
+pip install -r requirements.txt
+bash scripts/eval.sh        # CFG=3.7, FID ~1.14 on ImageNet 256x256
+```
 ## Model details
+- **Architecture:** vanilla Diffusion Transformer — 28 layers, hidden 1152, 16 heads, SwiGLU FFN, RMSNorm, QK-norm, 2D VisionRoPE, 32 in-context class tokens, joint [CLS]-patch modeling.
 - **Encoder (frozen):** `facebook/dinov2-with-registers-small` (d=384).
+- **Decoder (frozen):** ViT-MAE-style decoder from [nyu-visionx/RAE-collections](https://huggingface.co/nyu-visionx/RAE-collections), variant `decoders/dinov2/wReg_small/ViTXL_n08/model.pt`.
 - **Parameters (denoiser only):** 676M.
+- **Training:** 8×H200, batch 1536 effective, AdamW lr=5e-5, 800 epochs, x-prediction loss, dimension-aware time shift (s ≈ 4.9), CLS auxiliary loss weight λ=0.2.
+- **Sampling defaults:** Heun, 25 steps, time-shift schedule, CFG=3.7 in interval [0.1, 0.98], coupled-noise initialization for [CLS].
 ## Citation
 ```bibtex
+@article{zhang2026rit,
+  title   = {RiT: Vanilla Diffusion Transformers Suffice in Representation Space},
+  author  = {Zhang, Le and Mang, Ning and Agrawal, Aishwarya},
+  journal = {arXiv preprint arXiv:2605.21981},
+  year    = {2026}
 }
 ```
 ## Acknowledgments
+This release reuses the frozen DINOv2 encoder + ViT decoder pairing from [**RAE**](https://github.com/bytetriper/RAE) and adopts the modernized DiT block design + in-context class tokens from [**JiT**](https://github.com/LTH14/JiT).