File size: 5,406 Bytes
5dcdc89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09bd083
 
 
5dcdc89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
license: apache-2.0
library_name: pytorch
tags:
  - super-resolution
  - diffusion
  - pixel-diffusion-decoder
  - vae-decoder
pipeline_tag: image-to-image
---

# PiD β€” Pixel Diffusion Decoder

<p align="center">
  <img src="figures/teaser.jpg" alt="PiD teaser" width="100%">
</p>

PiD reformulates the latent-to-pixel decoder as a conditional pixel-space
diffusion model, unifying decoding and upsampling into a single generative
module. It denoises directly in high-resolution pixel space and produces a
super-resolved image in one pass. This repository hosts the released decoder
checkpoints, plus the encoder/decoder ("VAE") weights they depend on.

All `PiD_*` checkpoints in this repo are **4-step distilled**. The non-`PiD_*`
entries (`ae.safetensors`, `flux2_ae.safetensors`, `sd3_vae/`, `rae/`,
`scale_rae/`) are **the corresponding encoder/decoder VAE weights** that PiD
plugs into β€” they're not PiD checkpoints themselves.

## PiD checkpoints

Two variants are released for each diffusers-style backbone:

- **`2k`** β€” trained at 2048px, used as a 4Γ— decoder (512 LDM β†’ 2048 px), or as
  an 8Γ— decoder for the Scale-RAE backbone (256 β†’ 2048).
- **`2kto4k`** β€” trained with multi-resolution data bucketing 2048β†’3840 and an
  SD3-style dynamic shift; designed for 1024 LDM β†’ 4K (3840 px) decoding. Only
  released for the diffusers backbones.

| Path                                                          | Backbone (encoder side)                    | SR factor | Variant   |
|---------------------------------------------------------------|--------------------------------------------|-----------|-----------|
| `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step`      | Flux1-dev (16-ch VAE)                      | 4Γ—        | 2k        |
| `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step`     | Flux2-dev (128-ch BN VAE)                  | 4Γ—        | 2k        |
| `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step`       | SD3 medium (16-ch VAE)                     | 4Γ—        | 2k        |
| `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step`    | DINOv2-B + RAE ViT-XL (768-ch)             | 4Γ—        | 2k        |
| `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step`    | SigLIP-2 So400M + Scale-RAE ViT-XL (1152)  | 8Γ—        | 2k        |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step`  | Flux1-dev (16-ch VAE)                      | 4Γ—        | 2kto4k    |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE)                  | 4Γ—        | 2kto4k    |
| `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step`   | SD3 medium (16-ch VAE)                     | 4Γ—        | 2kto4k    |

Z-Image shares Flux1's VAE, so its inference path reuses the `flux` checkpoints
(both `2k` and `2kto4k`) β€” no separate `zimage` checkpoint is shipped.

Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA
weights cast to bfloat16 β€” the format the inference scripts load by default.

## VAE / encoder weights

These are the per-backbone encoder (and, where applicable, original decoder)
weights that PiD pairs with. They're hosted here so a single download brings
everything needed end-to-end.

| Path                            | Description                                                                          |
|---------------------------------|--------------------------------------------------------------------------------------|
| `checkpoints/ae.safetensors`    | Flux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder).                     |
| `checkpoints/flux2_ae.safetensors` | Flux2-dev 128-ch BN VAE.                                                          |
| `checkpoints/sd3_vae/`          | SD3 medium 16-ch VAE in diffusers format.                                            |
| `checkpoints/rae/`              | DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics. |
| `checkpoints/scale_rae/`        | SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config.                 |

## Usage

The decoder checkpoints are loaded by the inference scripts in the PiD
codebase. The exact `(backbone, ckpt_type) β†’ path` mapping is the single source
of truth in
[`pid/_src/inference/checkpoint_registry.py`](https://github.com/) β€” clone the
repo, point it at this snapshot, and the demos pick the right file
automatically:

```bash
# Pull just the checkpoints/ tree into the repo root (skips this README and
# the teaser figure so they don't clobber the files in the source repo).
hf download nvidia/PiD --local-dir . --include "checkpoints/*"

# Then run any of the demos, e.g.:
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
    --prompt "A photorealistic cat" \
    --ldm_inference_steps 28 --save_xt_steps 22 24 26 \
    --output_dir ./results/demo \
    --cfg_scale 1 --pid_inference_steps 4 --scale 4
```

Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K.

## License

Released under the **Apache License 2.0**. Copyright 2026 NVIDIA Corporation
& Affiliates. See the `LICENSE` file in the source repository for the full
text.

The upstream encoder backbones (DINOv2, SigLIP-2, Flux, SD3, Z-Image) and their
weights remain under their own original licenses; PiD's Apache-2.0 release
covers only the PiD decoder weights and code.