---
library_name: pytorch
tags:
- super-resolution
- diffusion
- pixel-diffusion-decoder
- vae-decoder
pipeline_tag: image-to-image
base_model:
- nvidia/PixelDiT-1300M-1024px
- Tongyi-MAI/Z-Image
- black-forest-labs/FLUX.1-dev
- black-forest-labs/FLUX.2-dev
- nyu-visionx/Scale-RAE-Qwen7B_DiT9.8B
---
# PiD — Pixel Diffusion Decoder
**[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/)**
[Yifan Lu](https://yifanlu0227.github.io/),
[Qi Wu](https://wilsoncernwq.github.io/),
[Jay Zhangjie Wu](https://zhangjiewu.github.io/),
[Zian Wang](https://www.cs.toronto.edu/~zianwang/),
[Huan Ling](https://www.cs.toronto.edu/~linghuan/),
[Sanja Fidler](https://www.cs.utoronto.ca/~fidler/),
[Xuanchi Ren](https://xuanchiren.com/)
PiD reformulates the latent-to-pixel decoder as a conditional pixel-space
diffusion model, unifying decoding and upsampling into a single generative
module. It denoises directly in high-resolution pixel space and produces a
super-resolved image in one pass. This repository hosts the released decoder
checkpoints, plus the encoder/decoder ("VAE") weights they depend on.
All `PiD_*` checkpoints in this repo are **4-step distilled**. The non-`PiD_*`
entries (`ae.safetensors`, `flux2_ae.safetensors`, `sd3_vae/`, `rae/`,
`scale_rae/`) are **the corresponding encoder/decoder VAE weights** that PiD
plugs into — they're not PiD checkpoints themselves.
### License/Terms of Use
This model is released under the [NSCLv1](https://huggingface.co/nvidia/PixelDiT-1300M-1024px/blob/main/LICENSE) License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.
### Deployment Geography:
Global
## PiD checkpoints
Two variants are released for each diffusers-style backbone:
- **`2k`** — trained at 2048px, used as a 4× decoder (512 LDM → 2048 px), or as
an 8× decoder for the Scale-RAE backbone (256 → 2048).
- **`2kto4k`** — trained with multi-resolution data bucketing 2048→3840 and an
SD3-style dynamic shift; designed for 1024 LDM → 4K (4096 px) decoding.
Both checkpoint variants support multiple aspect ratios.
| Path | Backbone (encoder side) | SR factor | Variant |
|---------------------------------------------------------------|--------------------------------------------|-----------|-----------|
| `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step` | Flux1-dev (16-ch VAE) | 4× | 2k |
| `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE) | 4× | 2k |
| `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step` | SD3 medium (16-ch VAE) | 4× | 2k |
| `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step` | DINOv2-B + RAE ViT-XL (768-ch) | 4× | 2k |
| `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step` | SigLIP-2 So400M + Scale-RAE ViT-XL (1152) | 8× | 2k |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step` | Flux1-dev (16-ch VAE) | 4× | 2kto4k |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE) | 4× | 2kto4k |
| `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step` | SD3 medium (16-ch VAE) | 4× | 2kto4k |
Z-Image shares Flux1's VAE, so its inference path reuses the `flux` checkpoints
(both `2k` and `2kto4k`) — no separate `zimage` checkpoint is shipped.
Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA
weights cast to bfloat16 — the format the inference scripts load by default.
## VAE / encoder weights
These are the per-backbone encoder (and, where applicable, original decoder)
weights that PiD pairs with. They're hosted here so a single download brings
everything needed end-to-end.
| Path | Description |
|---------------------------------|--------------------------------------------------------------------------------------|
| `checkpoints/ae.safetensors` | Flux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder). |
| `checkpoints/flux2_ae.safetensors` | Flux2-dev 128-ch BN VAE. |
| `checkpoints/sd3_vae/` | SD3 medium 16-ch VAE in diffusers format. |
| `checkpoints/rae/` | DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics. |
| `checkpoints/scale_rae/` | SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config. |
## Usage
The decoder checkpoints are loaded by the inference scripts in the [PiD
codebase](https://github.com/nv-tlabs/pid). The exact `(backbone, ckpt_type) → path` mapping is the single source
of truth in
[`pid/_src/inference/checkpoint_registry.py`](https://github.com/nv-tlabs/PiD/blob/main/pid/_src/inference/checkpoint_registry.py) — clone the
repo, point it at this snapshot, and the demos pick the right file
automatically:
```bash
# Pull just the checkpoints/ tree into the repo root (skips this README and
# the teaser figure so they don't clobber the files in the source repo).
hf download nvidia/PiD --local-dir . --include "checkpoints/*"
# Then run any of the demos, e.g.:
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
--prompt "A photorealistic cat" \
--ldm_inference_steps 28 --save_xt_steps 22 24 26 \
--output_dir ./results/demo \
--cfg_scale 1 --pid_inference_steps 4 --scale 4
```
Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K.
## Citation
```
@article{lu2026pid,
title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
journal={arXiv preprint arXiv:2605.23902},
year={2026}
}
```