File size: 6,164 Bytes

5dcdc89
 
 
76d494d
 
 
 
5dcdc89
626adcb
fcb6414
626adcb
 
 
 
5dcdc89
 
 
 
 
 
 
 
626adcb
b6814fd
626adcb
3c40f2d
626adcb
 
 
 
 
3c40f2d
626adcb
 
5dcdc89
 
 
 
 
 
 
 
 
 
 
626adcb
 
b87dba4
626adcb
 
 
 
5dcdc89
 
 
 
 
 
 
d40ee0a
 
 
5dcdc89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
622f7f4
 
5dcdc89
622f7f4
5dcdc89
 
 
 
09bd083
 
 
5dcdc89
 
 
 
 
 
 
 
 
626adcb
 
 
 
 
b6814fd
 
 
 
626adcb

---
library_name: pytorch
tags:
- super-resolution
- diffusion
- pixel-diffusion-decoder
- vae-decoder
pipeline_tag: image-to-image
base_model:
- nvidia/PixelDiT-1300M-1024px
- Tongyi-MAI/Z-Image
- black-forest-labs/FLUX.1-dev
- black-forest-labs/FLUX.2-dev
- nyu-visionx/Scale-RAE-Qwen7B_DiT9.8B
---

# PiD — Pixel Diffusion Decoder

<p align="center">
  <img src="figures/teaser.jpg" alt="PiD teaser" width="100%">
</p>


**[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/)**

[Yifan Lu](https://yifanlu0227.github.io/),
[Qi Wu](https://wilsoncernwq.github.io/),
[Jay Zhangjie Wu](https://zhangjiewu.github.io/),
[Zian Wang](https://www.cs.toronto.edu/~zianwang/),
[Huan Ling](https://www.cs.toronto.edu/~linghuan/),
[Sanja Fidler](https://www.cs.utoronto.ca/~fidler/),
[Xuanchi Ren](https://xuanchiren.com/) <br>


PiD reformulates the latent-to-pixel decoder as a conditional pixel-space
diffusion model, unifying decoding and upsampling into a single generative
module. It denoises directly in high-resolution pixel space and produces a
super-resolved image in one pass. This repository hosts the released decoder
checkpoints, plus the encoder/decoder ("VAE") weights they depend on.

All `PiD_*` checkpoints in this repo are **4-step distilled**. The non-`PiD_*`
entries (`ae.safetensors`, `flux2_ae.safetensors`, `sd3_vae/`, `rae/`,
`scale_rae/`) are **the corresponding encoder/decoder VAE weights** that PiD
plugs into — they're not PiD checkpoints themselves.

### License/Terms of Use

This model is released under the NVIDIA Open Model License.

### Deployment Geography:
Global

## PiD checkpoints

Two variants are released for each diffusers-style backbone:

- **`2k`** — trained at 2048px, used as a 4× decoder (512 LDM → 2048 px), or as
  an 8× decoder for the Scale-RAE backbone (256 → 2048).
- **`2kto4k`** — trained with multi-resolution data bucketing 2048→3840 and an
  SD3-style dynamic shift; designed for 1024 LDM → 4K (4096 px) decoding.

Both checkpoint variants support multiple aspect ratios.

| Path                                                          | Backbone (encoder side)                    | SR factor | Variant   |
|---------------------------------------------------------------|--------------------------------------------|-----------|-----------|
| `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step`      | Flux1-dev (16-ch VAE)                      | 4×        | 2k        |
| `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step`     | Flux2-dev (128-ch BN VAE)                  | 4×        | 2k        |
| `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step`       | SD3 medium (16-ch VAE)                     | 4×        | 2k        |
| `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step`    | DINOv2-B + RAE ViT-XL (768-ch)             | 4×        | 2k        |
| `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step`    | SigLIP-2 So400M + Scale-RAE ViT-XL (1152)  | 8×        | 2k        |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step`  | Flux1-dev (16-ch VAE)                      | 4×        | 2kto4k    |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE)                  | 4×        | 2kto4k    |
| `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step`   | SD3 medium (16-ch VAE)                     | 4×        | 2kto4k    |

Z-Image shares Flux1's VAE, so its inference path reuses the `flux` checkpoints
(both `2k` and `2kto4k`) — no separate `zimage` checkpoint is shipped.

Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA
weights cast to bfloat16 — the format the inference scripts load by default.

## VAE / encoder weights

These are the per-backbone encoder (and, where applicable, original decoder)
weights that PiD pairs with. They're hosted here so a single download brings
everything needed end-to-end.

| Path                            | Description                                                                          |
|---------------------------------|--------------------------------------------------------------------------------------|
| `checkpoints/ae.safetensors`    | Flux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder).                     |
| `checkpoints/flux2_ae.safetensors` | Flux2-dev 128-ch BN VAE.                                                          |
| `checkpoints/sd3_vae/`          | SD3 medium 16-ch VAE in diffusers format.                                            |
| `checkpoints/rae/`              | DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics. |
| `checkpoints/scale_rae/`        | SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config.                 |

## Usage

The decoder checkpoints are loaded by the inference scripts in the [PiD
codebase](https://github.com/nv-tlabs/pid). The exact `(backbone, ckpt_type) → path` mapping is the single source
of truth in
[`pid/_src/inference/checkpoint_registry.py`](https://github.com/nv-tlabs/PiD/blob/main/pid/_src/inference/checkpoint_registry.py) — clone the
repo, point it at this snapshot, and the demos pick the right file
automatically:

```bash
# Pull just the checkpoints/ tree into the repo root (skips this README and
# the teaser figure so they don't clobber the files in the source repo).
hf download nvidia/PiD --local-dir . --include "checkpoints/*"

# Then run any of the demos, e.g.:
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
    --prompt "A photorealistic cat" \
    --ldm_inference_steps 28 --save_xt_steps 22 24 26 \
    --output_dir ./results/demo \
    --cfg_scale 1 --pid_inference_steps 4 --scale 4
```

Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K.

## Citation
```
@article{lu2026pid,
    title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
    author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
    journal={arXiv preprint arXiv:2605.23902},
    year={2026}
}
```