Add model card with checkpoint inventory and teaser
Browse files- .gitattributes +1 -0
- README.md +101 -0
- figures/teaser.jpg +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
figures/teaser.jpg filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: pytorch
|
| 4 |
+
tags:
|
| 5 |
+
- super-resolution
|
| 6 |
+
- diffusion
|
| 7 |
+
- pixel-diffusion-decoder
|
| 8 |
+
- vae-decoder
|
| 9 |
+
pipeline_tag: image-to-image
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# PiD β Pixel Diffusion Decoder
|
| 13 |
+
|
| 14 |
+
<p align="center">
|
| 15 |
+
<img src="figures/teaser.jpg" alt="PiD teaser" width="100%">
|
| 16 |
+
</p>
|
| 17 |
+
|
| 18 |
+
PiD reformulates the latent-to-pixel decoder as a conditional pixel-space
|
| 19 |
+
diffusion model, unifying decoding and upsampling into a single generative
|
| 20 |
+
module. It denoises directly in high-resolution pixel space and produces a
|
| 21 |
+
super-resolved image in one pass. This repository hosts the released decoder
|
| 22 |
+
checkpoints, plus the encoder/decoder ("VAE") weights they depend on.
|
| 23 |
+
|
| 24 |
+
All `PiD_*` checkpoints in this repo are **4-step distilled**. The non-`PiD_*`
|
| 25 |
+
entries (`ae.safetensors`, `flux2_ae.safetensors`, `sd3_vae/`, `rae/`,
|
| 26 |
+
`scale_rae/`) are **the corresponding encoder/decoder VAE weights** that PiD
|
| 27 |
+
plugs into β they're not PiD checkpoints themselves.
|
| 28 |
+
|
| 29 |
+
## PiD checkpoints
|
| 30 |
+
|
| 31 |
+
Two variants are released for each diffusers-style backbone:
|
| 32 |
+
|
| 33 |
+
- **`2k`** β trained at 2048px, used as a 4Γ decoder (512 LDM β 2048 px), or as
|
| 34 |
+
an 8Γ decoder for the Scale-RAE backbone (256 β 2048).
|
| 35 |
+
- **`2kto4k`** β trained with multi-resolution data bucketing 2048β3840 and an
|
| 36 |
+
SD3-style dynamic shift; designed for 1024 LDM β 4K (3840 px) decoding. Only
|
| 37 |
+
released for the diffusers backbones.
|
| 38 |
+
|
| 39 |
+
| Path | Backbone (encoder side) | SR factor | Variant |
|
| 40 |
+
|---------------------------------------------------------------|--------------------------------------------|-----------|-----------|
|
| 41 |
+
| `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step` | Flux1-dev (16-ch VAE) | 4Γ | 2k |
|
| 42 |
+
| `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE) | 4Γ | 2k |
|
| 43 |
+
| `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step` | SD3 medium (16-ch VAE) | 4Γ | 2k |
|
| 44 |
+
| `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step` | DINOv2-B + RAE ViT-XL (768-ch) | 4Γ | 2k |
|
| 45 |
+
| `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step` | SigLIP-2 So400M + Scale-RAE ViT-XL (1152) | 8Γ | 2k |
|
| 46 |
+
| `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step` | Flux1-dev (16-ch VAE) | 4Γ | 2kto4k |
|
| 47 |
+
| `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE) | 4Γ | 2kto4k |
|
| 48 |
+
| `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step` | SD3 medium (16-ch VAE) | 4Γ | 2kto4k |
|
| 49 |
+
|
| 50 |
+
Z-Image shares Flux1's VAE, so its inference path reuses the `flux` checkpoints
|
| 51 |
+
(both `2k` and `2kto4k`) β no separate `zimage` checkpoint is shipped.
|
| 52 |
+
|
| 53 |
+
Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA
|
| 54 |
+
weights cast to bfloat16 β the format the inference scripts load by default.
|
| 55 |
+
|
| 56 |
+
## VAE / encoder weights
|
| 57 |
+
|
| 58 |
+
These are the per-backbone encoder (and, where applicable, original decoder)
|
| 59 |
+
weights that PiD pairs with. They're hosted here so a single download brings
|
| 60 |
+
everything needed end-to-end.
|
| 61 |
+
|
| 62 |
+
| Path | Description |
|
| 63 |
+
|---------------------------------|--------------------------------------------------------------------------------------|
|
| 64 |
+
| `checkpoints/ae.safetensors` | Flux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder). |
|
| 65 |
+
| `checkpoints/flux2_ae.safetensors` | Flux2-dev 128-ch BN VAE. |
|
| 66 |
+
| `checkpoints/sd3_vae/` | SD3 medium 16-ch VAE in diffusers format. |
|
| 67 |
+
| `checkpoints/rae/` | DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics. |
|
| 68 |
+
| `checkpoints/scale_rae/` | SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config. |
|
| 69 |
+
|
| 70 |
+
## Usage
|
| 71 |
+
|
| 72 |
+
The decoder checkpoints are loaded by the inference scripts in the PiD
|
| 73 |
+
codebase. The exact `(backbone, ckpt_type) β path` mapping is the single source
|
| 74 |
+
of truth in
|
| 75 |
+
[`pid/_src/inference/checkpoint_registry.py`](https://github.com/) β clone the
|
| 76 |
+
repo, point it at this snapshot, and the demos pick the right file
|
| 77 |
+
automatically:
|
| 78 |
+
|
| 79 |
+
```bash
|
| 80 |
+
# Download this whole snapshot into ./checkpoints
|
| 81 |
+
hf download nvidia/PiD --local-dir .
|
| 82 |
+
|
| 83 |
+
# Then run any of the demos, e.g.:
|
| 84 |
+
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
|
| 85 |
+
--prompt "A photorealistic cat" \
|
| 86 |
+
--ldm_inference_steps 28 --save_xt_steps 22 24 26 \
|
| 87 |
+
--output_dir ./results/demo \
|
| 88 |
+
--cfg_scale 1 --pid_inference_steps 4 --scale 4
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K.
|
| 92 |
+
|
| 93 |
+
## License
|
| 94 |
+
|
| 95 |
+
Released under the **Apache License 2.0**. Copyright 2026 NVIDIA Corporation
|
| 96 |
+
& Affiliates. See the `LICENSE` file in the source repository for the full
|
| 97 |
+
text.
|
| 98 |
+
|
| 99 |
+
The upstream encoder backbones (DINOv2, SigLIP-2, Flux, SD3, Z-Image) and their
|
| 100 |
+
weights remain under their own original licenses; PiD's Apache-2.0 release
|
| 101 |
+
covers only the PiD decoder weights and code.
|
figures/teaser.jpg
ADDED
|
Git LFS Details
|