MarcusAureliusSun xrenaa commited on
Commit
352c048
·
0 Parent(s):

Duplicate from nvidia/PiD

Browse files

Co-authored-by: Xuanchi Ren <xrenaa@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ figures/teaser.jpg filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: pytorch
3
+ tags:
4
+ - super-resolution
5
+ - diffusion
6
+ - pixel-diffusion-decoder
7
+ - vae-decoder
8
+ pipeline_tag: image-to-image
9
+ base_model:
10
+ - nvidia/PixelDiT-1300M-1024px
11
+ - Tongyi-MAI/Z-Image
12
+ - black-forest-labs/FLUX.1-dev
13
+ - black-forest-labs/FLUX.2-dev
14
+ - nyu-visionx/Scale-RAE-Qwen7B_DiT9.8B
15
+ ---
16
+
17
+ # PiD — Pixel Diffusion Decoder
18
+
19
+ <p align="center">
20
+ <img src="figures/teaser.jpg" alt="PiD teaser" width="100%">
21
+ </p>
22
+
23
+
24
+ **[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/)**
25
+
26
+ [Yifan Lu](https://yifanlu0227.github.io/),
27
+ [Qi Wu](https://wilsoncernwq.github.io/),
28
+ [Jay Zhangjie Wu](https://zhangjiewu.github.io/),
29
+ [Zian Wang](https://www.cs.toronto.edu/~zianwang/),
30
+ [Huan Ling](https://www.cs.toronto.edu/~linghuan/),
31
+ [Sanja Fidler](https://www.cs.utoronto.ca/~fidler/),
32
+ [Xuanchi Ren](https://xuanchiren.com/) <br>
33
+
34
+
35
+ PiD reformulates the latent-to-pixel decoder as a conditional pixel-space
36
+ diffusion model, unifying decoding and upsampling into a single generative
37
+ module. It denoises directly in high-resolution pixel space and produces a
38
+ super-resolved image in one pass. This repository hosts the released decoder
39
+ checkpoints, plus the encoder/decoder ("VAE") weights they depend on.
40
+
41
+ All `PiD_*` checkpoints in this repo are **4-step distilled**. The non-`PiD_*`
42
+ entries (`ae.safetensors`, `flux2_ae.safetensors`, `sd3_vae/`, `rae/`,
43
+ `scale_rae/`) are **the corresponding encoder/decoder VAE weights** that PiD
44
+ plugs into — they're not PiD checkpoints themselves.
45
+
46
+ ### License/Terms of Use
47
+
48
+ This model is released under the [NVIDIA Internal Scientific Research and Development Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-internal-scientific-research-and-development-model-license/).
49
+
50
+ Important Note: The Model and any Derivative Model may not be distributed, deployed, sublicensed, publicly displayed, publicly performed, or sublicensed. You may not use the Model or a Derivative Model in a production environment or for the purpose of generating works for sale or distribution. If you fail to comply with any of the terms in this Agreement, your rights under the NVIDIA Internal Scientific Research and Development Model License will automatically terminate.
51
+
52
+ ### Deployment Geography:
53
+ Global
54
+
55
+ ## PiD checkpoints
56
+
57
+ Two variants are released for each diffusers-style backbone:
58
+
59
+ - **`2k`** — trained at 2048px, used as a 4× decoder (512 LDM → 2048 px), or as
60
+ an 8× decoder for the Scale-RAE backbone (256 → 2048).
61
+ - **`2kto4k`** — trained with multi-resolution data bucketing 2048→3840 and an
62
+ SD3-style dynamic shift; designed for 1024 LDM → 4K (3840 px) decoding. Only
63
+ released for the diffusers backbones.
64
+
65
+ | Path | Backbone (encoder side) | SR factor | Variant |
66
+ |---------------------------------------------------------------|--------------------------------------------|-----------|-----------|
67
+ | `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step` | Flux1-dev (16-ch VAE) | 4× | 2k |
68
+ | `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE) | 4× | 2k |
69
+ | `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step` | SD3 medium (16-ch VAE) | 4× | 2k |
70
+ | `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step` | DINOv2-B + RAE ViT-XL (768-ch) | 4× | 2k |
71
+ | `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step` | SigLIP-2 So400M + Scale-RAE ViT-XL (1152) | 8× | 2k |
72
+ | `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step` | Flux1-dev (16-ch VAE) | 4× | 2kto4k |
73
+ | `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE) | 4× | 2kto4k |
74
+ | `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step` | SD3 medium (16-ch VAE) | 4× | 2kto4k |
75
+
76
+ Z-Image shares Flux1's VAE, so its inference path reuses the `flux` checkpoints
77
+ (both `2k` and `2kto4k`) — no separate `zimage` checkpoint is shipped.
78
+
79
+ Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA
80
+ weights cast to bfloat16 — the format the inference scripts load by default.
81
+
82
+ ## VAE / encoder weights
83
+
84
+ These are the per-backbone encoder (and, where applicable, original decoder)
85
+ weights that PiD pairs with. They're hosted here so a single download brings
86
+ everything needed end-to-end.
87
+
88
+ | Path | Description |
89
+ |---------------------------------|--------------------------------------------------------------------------------------|
90
+ | `checkpoints/ae.safetensors` | Flux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder). |
91
+ | `checkpoints/flux2_ae.safetensors` | Flux2-dev 128-ch BN VAE. |
92
+ | `checkpoints/sd3_vae/` | SD3 medium 16-ch VAE in diffusers format. |
93
+ | `checkpoints/rae/` | DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics. |
94
+ | `checkpoints/scale_rae/` | SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config. |
95
+
96
+ ## Usage
97
+
98
+ The decoder checkpoints are loaded by the inference scripts in the [PiD
99
+ codebase](https://github.com/nv-tlabs/pid). The exact `(backbone, ckpt_type) → path` mapping is the single source
100
+ of truth in
101
+ [`pid/_src/inference/checkpoint_registry.py`](https://github.com/nv-tlabs/PiD/blob/main/pid/_src/inference/checkpoint_registry.py) — clone the
102
+ repo, point it at this snapshot, and the demos pick the right file
103
+ automatically:
104
+
105
+ ```bash
106
+ # Pull just the checkpoints/ tree into the repo root (skips this README and
107
+ # the teaser figure so they don't clobber the files in the source repo).
108
+ hf download nvidia/PiD --local-dir . --include "checkpoints/*"
109
+
110
+ # Then run any of the demos, e.g.:
111
+ PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
112
+ --prompt "A photorealistic cat" \
113
+ --ldm_inference_steps 28 --save_xt_steps 22 24 26 \
114
+ --output_dir ./results/demo \
115
+ --cfg_scale 1 --pid_inference_steps 4 --scale 4
116
+ ```
117
+
118
+ Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K.
119
+
120
+ ## Citation
121
+ ```
122
+ @article{lu2026pid,
123
+ title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
124
+ author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
125
+ journal={arXiv preprint arXiv:2605.23902},
126
+ year={2026}
127
+ }
128
+ ```
checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step/model_ema_bf16.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d4ce25b6fd2c953720468cd88a7c6a192b4ce908bf085659a1324c186949eab0
3
+ size 2731773393
checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step/model_ema_bf16.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0e3a7a19d4738e0b53c1267815a87e77295fddbee117b5aed802b9b62030cac
3
+ size 2725875153
checkpoints/PiD_res2k_sr4x_official_flux_distill_4step/model_ema_bf16.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:47dd165bf3ea85df08deb152e6fcded19ddf5a35b83832abcd99d403ffca6ac3
3
+ size 2724842961
checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step/model_ema_bf16.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:835a069763903bfee317e524b83d0974639a6c1d9f79e00728d338fcc249fa27
3
+ size 2724842961
checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step/model_ema_bf16.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9b735ca95044c9d4b5777f1a398fb30efcf4c28021622f13964ba785c33495c
3
+ size 2735312130
checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step/model_ema_bf16.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca75b0b2712f0872d8225fb0c2d520f817c7185eb4a5dbedfe2d164239df044a
3
+ size 2725875153
checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step/model_ema_bf16.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2f3eabf4f2f83320472e6146f6545d8237e1423849e62148d5d656bfb571d00e
3
+ size 2724842961
checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step/model_ema_bf16.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:22d1224cae36041b61517c2145c954fca2a9624a30a01d57b7bb044304b9dc31
3
+ size 2724842961
checkpoints/ae.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:afc8e28272cd15db3919bacdb6918ce9c1ed22e96cb12c4d5ed0fba823529e38
3
+ size 335304388
checkpoints/flux2_ae.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:868fe7b343cc8f3a19dbcfcafbc3d5f888802be3f89bd81b65b3621a066ce8f3
3
+ size 336211292
checkpoints/rae/decoders/dinov2/wReg_base/ViTXL_n08_i512/model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f8b83b058cd84567ce1f671c9ea32d6ec0e48532c650bbf1ba04bb4739292630
3
+ size 1665128758
checkpoints/rae/stats/dinov2/wReg_base/imagenet1k_512/stat.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fe0ac4e914a28708ec85f0dbd3be436dcec48e556d5d745f78a13625dc8fe2c7
3
+ size 6292742
checkpoints/scale_rae/decoder/XL_decoder_config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "facebook/vit-mae-base",
3
+ "architectures": [
4
+ "ViTMAEForPreTraining"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.0,
7
+ "decoder_hidden_size": 1152,
8
+ "decoder_intermediate_size": 4096,
9
+ "decoder_num_attention_heads": 16,
10
+ "decoder_num_hidden_layers": 28,
11
+ "hidden_act": "gelu",
12
+ "hidden_dropout_prob": 0.0,
13
+ "hidden_size": 1152,
14
+ "image_size": 224,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 3072,
17
+ "layer_norm_eps": 1e-12,
18
+ "mask_ratio": 0.75,
19
+ "model_type": "vit_mae",
20
+ "norm_pix_loss": false,
21
+ "num_attention_heads": 12,
22
+ "num_channels": 3,
23
+ "num_hidden_layers": 12,
24
+ "patch_size": 14,
25
+ "qkv_bias": true,
26
+ "torch_dtype": "float32",
27
+ "transformers_version": "4.42.3"
28
+ }
checkpoints/scale_rae/decoder/siglip2_sop14_i224_web73M_ganw3_decXL.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca7e6b907bb51455a12eea39b6acb1999c2133f325c123cda20ceb206d1ef3cb
3
+ size 1662529538
checkpoints/sd3_vae/vae/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9b67a279283625caee39d61eacb5324243848477b4eb535355eaaa8423d4e09
3
+ size 167666654
config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "input_types": [
3
+ "PiD"
4
+ ],
5
+ "model_size": "1.3B"
6
+ }
figures/teaser.jpg ADDED

Git LFS Details

  • SHA256: fb74f71364bd8fc0901650d6c7b5b8ef8efac751b7d248d2c9a3d7accf031d17
  • Pointer size: 132 Bytes
  • Size of remote file: 1.36 MB