xrenaa commited on 1 day ago

Commit

352c048

0 Parent(s):

Duplicate from nvidia/PiD

Browse files

Co-authored-by: Xuanchi Ren <xrenaa@users.noreply.huggingface.co>

Files changed (19) hide show

.gitattributes +36 -0
README.md +128 -0
checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step/model_ema_bf16.pth +3 -0
checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step/model_ema_bf16.pth +3 -0
checkpoints/PiD_res2k_sr4x_official_flux_distill_4step/model_ema_bf16.pth +3 -0
checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step/model_ema_bf16.pth +3 -0
checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step/model_ema_bf16.pth +3 -0
checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step/model_ema_bf16.pth +3 -0
checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step/model_ema_bf16.pth +3 -0
checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step/model_ema_bf16.pth +3 -0
checkpoints/ae.safetensors +3 -0
checkpoints/flux2_ae.safetensors +3 -0
checkpoints/rae/decoders/dinov2/wReg_base/ViTXL_n08_i512/model.pt +3 -0
checkpoints/rae/stats/dinov2/wReg_base/imagenet1k_512/stat.pt +3 -0
checkpoints/scale_rae/decoder/XL_decoder_config.json +28 -0
checkpoints/scale_rae/decoder/siglip2_sop14_i224_web73M_ganw3_decXL.pt +3 -0
checkpoints/sd3_vae/vae/diffusion_pytorch_model.safetensors +3 -0
config.json +6 -0
figures/teaser.jpg +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+figures/teaser.jpg filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,128 @@

+---
+library_name: pytorch
+tags:
+- super-resolution
+- diffusion
+- pixel-diffusion-decoder
+- vae-decoder
+pipeline_tag: image-to-image
+base_model:
+- nvidia/PixelDiT-1300M-1024px
+- Tongyi-MAI/Z-Image
+- black-forest-labs/FLUX.1-dev
+- black-forest-labs/FLUX.2-dev
+- nyu-visionx/Scale-RAE-Qwen7B_DiT9.8B
+---
+# PiD — Pixel Diffusion Decoder
+<p align="center">
+  <img src="figures/teaser.jpg" alt="PiD teaser" width="100%">
+</p>
+**[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/)**
+[Yifan Lu](https://yifanlu0227.github.io/),
+[Qi Wu](https://wilsoncernwq.github.io/),
+[Jay Zhangjie Wu](https://zhangjiewu.github.io/),
+[Zian Wang](https://www.cs.toronto.edu/~zianwang/),
+[Huan Ling](https://www.cs.toronto.edu/~linghuan/),
+[Sanja Fidler](https://www.cs.utoronto.ca/~fidler/),
+[Xuanchi Ren](https://xuanchiren.com/) <br>
+PiD reformulates the latent-to-pixel decoder as a conditional pixel-space
+diffusion model, unifying decoding and upsampling into a single generative
+module. It denoises directly in high-resolution pixel space and produces a
+super-resolved image in one pass. This repository hosts the released decoder
+checkpoints, plus the encoder/decoder ("VAE") weights they depend on.
+All `PiD_*` checkpoints in this repo are **4-step distilled**. The non-`PiD_*`
+entries (`ae.safetensors`, `flux2_ae.safetensors`, `sd3_vae/`, `rae/`,
+`scale_rae/`) are **the corresponding encoder/decoder VAE weights** that PiD
+plugs into — they're not PiD checkpoints themselves.
+### License/Terms of Use
+This model is released under the [NVIDIA Internal Scientific Research and Development Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-internal-scientific-research-and-development-model-license/).
+Important Note: The Model and any Derivative Model may not be distributed, deployed, sublicensed, publicly displayed, publicly performed, or sublicensed. You may not use the Model or a Derivative Model in a production environment or for the purpose of generating works for sale or distribution. If you fail to comply with any of the terms in this Agreement, your rights under the NVIDIA Internal Scientific Research and Development Model License will automatically terminate.
+### Deployment Geography:
+Global
+## PiD checkpoints
+Two variants are released for each diffusers-style backbone:
+- **`2k`** — trained at 2048px, used as a 4× decoder (512 LDM → 2048 px), or as
+  an 8× decoder for the Scale-RAE backbone (256 → 2048).
+- **`2kto4k`** — trained with multi-resolution data bucketing 2048→3840 and an
+  SD3-style dynamic shift; designed for 1024 LDM → 4K (3840 px) decoding. Only
+  released for the diffusers backbones.
+| Path                                                          | Backbone (encoder side)                    | SR factor | Variant   |
+|---------------------------------------------------------------|--------------------------------------------|-----------|-----------|
+| `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step`      | Flux1-dev (16-ch VAE)                      | 4×        | 2k        |
+| `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step`     | Flux2-dev (128-ch BN VAE)                  | 4×        | 2k        |
+| `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step`       | SD3 medium (16-ch VAE)                     | 4×        | 2k        |
+| `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step`    | DINOv2-B + RAE ViT-XL (768-ch)             | 4×        | 2k        |
+| `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step`    | SigLIP-2 So400M + Scale-RAE ViT-XL (1152)  | 8×        | 2k        |
+| `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step`  | Flux1-dev (16-ch VAE)                      | 4×        | 2kto4k    |
+| `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE)                  | 4×        | 2kto4k    |
+| `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step`   | SD3 medium (16-ch VAE)                     | 4×        | 2kto4k    |
+Z-Image shares Flux1's VAE, so its inference path reuses the `flux` checkpoints
+(both `2k` and `2kto4k`) — no separate `zimage` checkpoint is shipped.
+Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA
+weights cast to bfloat16 — the format the inference scripts load by default.
+## VAE / encoder weights
+These are the per-backbone encoder (and, where applicable, original decoder)
+weights that PiD pairs with. They're hosted here so a single download brings
+everything needed end-to-end.
+| Path                            | Description                                                                          |
+|---------------------------------|--------------------------------------------------------------------------------------|
+| `checkpoints/ae.safetensors`    | Flux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder).                     |
+| `checkpoints/flux2_ae.safetensors` | Flux2-dev 128-ch BN VAE.                                                          |
+| `checkpoints/sd3_vae/`          | SD3 medium 16-ch VAE in diffusers format.                                            |
+| `checkpoints/rae/`              | DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics. |
+| `checkpoints/scale_rae/`        | SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config.                 |
+## Usage
+The decoder checkpoints are loaded by the inference scripts in the [PiD
+codebase](https://github.com/nv-tlabs/pid). The exact `(backbone, ckpt_type) → path` mapping is the single source
+of truth in
+[`pid/_src/inference/checkpoint_registry.py`](https://github.com/nv-tlabs/PiD/blob/main/pid/_src/inference/checkpoint_registry.py) — clone the
+repo, point it at this snapshot, and the demos pick the right file
+automatically:
+```bash
+# Pull just the checkpoints/ tree into the repo root (skips this README and
+# the teaser figure so they don't clobber the files in the source repo).
+hf download nvidia/PiD --local-dir . --include "checkpoints/*"
+# Then run any of the demos, e.g.:
+PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
+    --prompt "A photorealistic cat" \
+    --ldm_inference_steps 28 --save_xt_steps 22 24 26 \
+    --output_dir ./results/demo \
+    --cfg_scale 1 --pid_inference_steps 4 --scale 4
+```
+Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K.
+## Citation
+```
+@article{lu2026pid,
+    title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
+    author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
+    journal={arXiv preprint arXiv:2605.23902},
+    year={2026}
+}
+```

checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step/model_ema_bf16.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d4ce25b6fd2c953720468cd88a7c6a192b4ce908bf085659a1324c186949eab0
+size 2731773393

checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step/model_ema_bf16.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b0e3a7a19d4738e0b53c1267815a87e77295fddbee117b5aed802b9b62030cac
+size 2725875153

checkpoints/PiD_res2k_sr4x_official_flux_distill_4step/model_ema_bf16.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:47dd165bf3ea85df08deb152e6fcded19ddf5a35b83832abcd99d403ffca6ac3
+size 2724842961

checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step/model_ema_bf16.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:835a069763903bfee317e524b83d0974639a6c1d9f79e00728d338fcc249fa27
+size 2724842961

checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step/model_ema_bf16.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f9b735ca95044c9d4b5777f1a398fb30efcf4c28021622f13964ba785c33495c
+size 2735312130

checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step/model_ema_bf16.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca75b0b2712f0872d8225fb0c2d520f817c7185eb4a5dbedfe2d164239df044a
+size 2725875153

checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step/model_ema_bf16.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2f3eabf4f2f83320472e6146f6545d8237e1423849e62148d5d656bfb571d00e
+size 2724842961

checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step/model_ema_bf16.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:22d1224cae36041b61517c2145c954fca2a9624a30a01d57b7bb044304b9dc31
+size 2724842961

checkpoints/ae.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:afc8e28272cd15db3919bacdb6918ce9c1ed22e96cb12c4d5ed0fba823529e38
+size 335304388

checkpoints/flux2_ae.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:868fe7b343cc8f3a19dbcfcafbc3d5f888802be3f89bd81b65b3621a066ce8f3
+size 336211292

checkpoints/rae/decoders/dinov2/wReg_base/ViTXL_n08_i512/model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f8b83b058cd84567ce1f671c9ea32d6ec0e48532c650bbf1ba04bb4739292630
+size 1665128758

checkpoints/rae/stats/dinov2/wReg_base/imagenet1k_512/stat.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fe0ac4e914a28708ec85f0dbd3be436dcec48e556d5d745f78a13625dc8fe2c7
+size 6292742

checkpoints/scale_rae/decoder/XL_decoder_config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "_name_or_path": "facebook/vit-mae-base",
+  "architectures": [
+    "ViTMAEForPreTraining"
+  ],
+  "attention_probs_dropout_prob": 0.0,
+  "decoder_hidden_size": 1152,
+  "decoder_intermediate_size": 4096,
+  "decoder_num_attention_heads": 16,
+  "decoder_num_hidden_layers": 28,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.0,
+  "hidden_size": 1152,
+  "image_size": 224,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "mask_ratio": 0.75,
+  "model_type": "vit_mae",
+  "norm_pix_loss": false,
+  "num_attention_heads": 12,
+  "num_channels": 3,
+  "num_hidden_layers": 12,
+  "patch_size": 14,
+  "qkv_bias": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.42.3"
+}

checkpoints/scale_rae/decoder/siglip2_sop14_i224_web73M_ganw3_decXL.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca7e6b907bb51455a12eea39b6acb1999c2133f325c123cda20ceb206d1ef3cb
+size 1662529538

checkpoints/sd3_vae/vae/diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f9b67a279283625caee39d61eacb5324243848477b4eb535355eaaa8423d4e09
+size 167666654

config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+    "input_types": [
+        "PiD"
+    ],
+    "model_size": "1.3B"
+}

figures/teaser.jpg ADDED Viewed

Git LFS Details

SHA256: fb74f71364bd8fc0901650d6c7b5b8ef8efac751b7d248d2c9a3d7accf031d17
Pointer size: 132 Bytes
Size of remote file: 1.36 MB