--- license: apache-2.0 library_name: pytorch tags: - super-resolution - diffusion - pixel-diffusion-decoder - vae-decoder pipeline_tag: image-to-image --- # PiD — Pixel Diffusion Decoder

PiD teaser

PiD reformulates the latent-to-pixel decoder as a conditional pixel-space diffusion model, unifying decoding and upsampling into a single generative module. It denoises directly in high-resolution pixel space and produces a super-resolved image in one pass. This repository hosts the released decoder checkpoints, plus the encoder/decoder ("VAE") weights they depend on. All `PiD_*` checkpoints in this repo are **4-step distilled**. The non-`PiD_*` entries (`ae.safetensors`, `flux2_ae.safetensors`, `sd3_vae/`, `rae/`, `scale_rae/`) are **the corresponding encoder/decoder VAE weights** that PiD plugs into — they're not PiD checkpoints themselves. ## PiD checkpoints Two variants are released for each diffusers-style backbone: - **`2k`** — trained at 2048px, used as a 4× decoder (512 LDM → 2048 px), or as an 8× decoder for the Scale-RAE backbone (256 → 2048). - **`2kto4k`** — trained with multi-resolution data bucketing 2048→3840 and an SD3-style dynamic shift; designed for 1024 LDM → 4K (3840 px) decoding. Only released for the diffusers backbones. | Path | Backbone (encoder side) | SR factor | Variant | |---------------------------------------------------------------|--------------------------------------------|-----------|-----------| | `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step` | Flux1-dev (16-ch VAE) | 4× | 2k | | `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE) | 4× | 2k | | `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step` | SD3 medium (16-ch VAE) | 4× | 2k | | `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step` | DINOv2-B + RAE ViT-XL (768-ch) | 4× | 2k | | `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step` | SigLIP-2 So400M + Scale-RAE ViT-XL (1152) | 8× | 2k | | `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step` | Flux1-dev (16-ch VAE) | 4× | 2kto4k | | `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step` | Flux2-dev (128-ch BN VAE) | 4× | 2kto4k | | `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step` | SD3 medium (16-ch VAE) | 4× | 2kto4k | Z-Image shares Flux1's VAE, so its inference path reuses the `flux` checkpoints (both `2k` and `2kto4k`) — no separate `zimage` checkpoint is shipped. Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA weights cast to bfloat16 — the format the inference scripts load by default. ## VAE / encoder weights These are the per-backbone encoder (and, where applicable, original decoder) weights that PiD pairs with. They're hosted here so a single download brings everything needed end-to-end. | Path | Description | |---------------------------------|--------------------------------------------------------------------------------------| | `checkpoints/ae.safetensors` | Flux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder). | | `checkpoints/flux2_ae.safetensors` | Flux2-dev 128-ch BN VAE. | | `checkpoints/sd3_vae/` | SD3 medium 16-ch VAE in diffusers format. | | `checkpoints/rae/` | DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics. | | `checkpoints/scale_rae/` | SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config. | ## Usage The decoder checkpoints are loaded by the inference scripts in the PiD codebase. The exact `(backbone, ckpt_type) → path` mapping is the single source of truth in [`pid/_src/inference/checkpoint_registry.py`](https://github.com/) — clone the repo, point it at this snapshot, and the demos pick the right file automatically: ```bash # Download this whole snapshot into ./checkpoints hf download nvidia/PiD --local-dir . # Then run any of the demos, e.g.: PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \ --prompt "A photorealistic cat" \ --ldm_inference_steps 28 --save_xt_steps 22 24 26 \ --output_dir ./results/demo \ --cfg_scale 1 --pid_inference_steps 4 --scale 4 ``` Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K. ## License Released under the **Apache License 2.0**. Copyright 2026 NVIDIA Corporation & Affiliates. See the `LICENSE` file in the source repository for the full text. The upstream encoder backbones (DINOv2, SigLIP-2, Flux, SD3, Z-Image) and their weights remain under their own original licenses; PiD's Apache-2.0 release covers only the PiD decoder weights and code.