File size: 6,164 Bytes
5dcdc89 76d494d 5dcdc89 626adcb fcb6414 626adcb 5dcdc89 626adcb b6814fd 626adcb 3c40f2d 626adcb 3c40f2d 626adcb 5dcdc89 626adcb b87dba4 626adcb 5dcdc89 d40ee0a 5dcdc89 622f7f4 5dcdc89 622f7f4 5dcdc89 09bd083 5dcdc89 626adcb b6814fd 626adcb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | ---
library_name: pytorch
tags:
- super-resolution
- diffusion
- pixel-diffusion-decoder
- vae-decoder
pipeline_tag: image-to-image
base_model:
- nvidia/PixelDiT-1300M-1024px
- Tongyi-MAI/Z-Image
- black-forest-labs/FLUX.1-dev
- black-forest-labs/FLUX.2-dev
- nyu-visionx/Scale-RAE-Qwen7B_DiT9.8B
---
# PiD β Pixel Diffusion Decoder
<p align="center">
<img src="figures/teaser.jpg" alt="PiD teaser" width="100%">
</p>
**[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/)**
[Yifan Lu](https://yifanlu0227.github.io/),
[Qi Wu](https://wilsoncernwq.github.io/),
[Jay Zhangjie Wu](https://zhangjiewu.github.io/),
[Zian Wang](https://www.cs.toronto.edu/~zianwang/),
[Huan Ling](https://www.cs.toronto.edu/~linghuan/),
[Sanja Fidler](https://www.cs.utoronto.ca/~fidler/),
[Xuanchi Ren](https://xuanchiren.com/) <br>
PiD reformulates the latent-to-pixel decoder as a conditional pixel-space
diffusion model, unifying decoding and upsampling into a single generative
module. It denoises directly in high-resolution pixel space and produces a
super-resolved image in one pass. This repository hosts the released decoder
checkpoints, plus the encoder/decoder ("VAE") weights they depend on.
All `PiD_*` checkpoints in this repo are **4-step distilled**. The non-`PiD_*`
entries (`ae.safetensors`, `flux2_ae.safetensors`, `sd3_vae/`, `rae/`,
`scale_rae/`) are **the corresponding encoder/decoder VAE weights** that PiD
plugs into β they're not PiD checkpoints themselves.
### License/Terms of Use
This model is released under the NVIDIA Open Model License.
### Deployment Geography:
Global
## PiD checkpoints
Two variants are released for each diffusers-style backbone:
- **`2k`** β trained at 2048px, used as a 4Γ decoder (512 LDM β 2048 px), or as
an 8Γ decoder for the Scale-RAE backbone (256 β 2048).
- **`2kto4k`** β trained with multi-resolution data bucketing 2048β3840 and an
SD3-style dynamic shift; designed for 1024 LDM β 4K (4096 px) decoding.
Both checkpoint variants support multiple aspect ratios.
| Path | Backbone (encoder side) | SR factor | Variant |
|---------------------------------------------------------------|--------------------------------------------|-----------|-----------|
| `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step` | Flux1-dev (16-ch VAE) | 4Γ | 2k |
| `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE) | 4Γ | 2k |
| `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step` | SD3 medium (16-ch VAE) | 4Γ | 2k |
| `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step` | DINOv2-B + RAE ViT-XL (768-ch) | 4Γ | 2k |
| `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step` | SigLIP-2 So400M + Scale-RAE ViT-XL (1152) | 8Γ | 2k |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step` | Flux1-dev (16-ch VAE) | 4Γ | 2kto4k |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE) | 4Γ | 2kto4k |
| `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step` | SD3 medium (16-ch VAE) | 4Γ | 2kto4k |
Z-Image shares Flux1's VAE, so its inference path reuses the `flux` checkpoints
(both `2k` and `2kto4k`) β no separate `zimage` checkpoint is shipped.
Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA
weights cast to bfloat16 β the format the inference scripts load by default.
## VAE / encoder weights
These are the per-backbone encoder (and, where applicable, original decoder)
weights that PiD pairs with. They're hosted here so a single download brings
everything needed end-to-end.
| Path | Description |
|---------------------------------|--------------------------------------------------------------------------------------|
| `checkpoints/ae.safetensors` | Flux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder). |
| `checkpoints/flux2_ae.safetensors` | Flux2-dev 128-ch BN VAE. |
| `checkpoints/sd3_vae/` | SD3 medium 16-ch VAE in diffusers format. |
| `checkpoints/rae/` | DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics. |
| `checkpoints/scale_rae/` | SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config. |
## Usage
The decoder checkpoints are loaded by the inference scripts in the [PiD
codebase](https://github.com/nv-tlabs/pid). The exact `(backbone, ckpt_type) β path` mapping is the single source
of truth in
[`pid/_src/inference/checkpoint_registry.py`](https://github.com/nv-tlabs/PiD/blob/main/pid/_src/inference/checkpoint_registry.py) β clone the
repo, point it at this snapshot, and the demos pick the right file
automatically:
```bash
# Pull just the checkpoints/ tree into the repo root (skips this README and
# the teaser figure so they don't clobber the files in the source repo).
hf download nvidia/PiD --local-dir . --include "checkpoints/*"
# Then run any of the demos, e.g.:
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
--prompt "A photorealistic cat" \
--ldm_inference_steps 28 --save_xt_steps 22 24 26 \
--output_dir ./results/demo \
--cfg_scale 1 --pid_inference_steps 4 --scale 4
```
Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K.
## Citation
```
@article{lu2026pid,
title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
journal={arXiv preprint arXiv:2605.23902},
year={2026}
}
``` |