PiD / README.md

Add model card with checkpoint inventory and teaser

5dcdc89 verified 6 days ago

5.28 kB

	---
	license: apache-2.0
	library_name: pytorch
	tags:
	- super-resolution
	- diffusion
	- pixel-diffusion-decoder
	- vae-decoder
	pipeline_tag: image-to-image
	---

	# PiD — Pixel Diffusion Decoder

	<p align="center">
	<img src="figures/teaser.jpg" alt="PiD teaser" width="100%">
	</p>

	PiD reformulates the latent-to-pixel decoder as a conditional pixel-space
	diffusion model, unifying decoding and upsampling into a single generative
	module. It denoises directly in high-resolution pixel space and produces a
	super-resolved image in one pass. This repository hosts the released decoder
	checkpoints, plus the encoder/decoder ("VAE") weights they depend on.

	All `PiD_` checkpoints in this repo are 4-step distilled. The non-`PiD_`
	entries (`ae.safetensors`, `flux2_ae.safetensors`, `sd3_vae/`, `rae/`,
	`scale_rae/`) are the corresponding encoder/decoder VAE weights that PiD
	plugs into — they're not PiD checkpoints themselves.

	## PiD checkpoints

	Two variants are released for each diffusers-style backbone:

	- `2k` — trained at 2048px, used as a 4× decoder (512 LDM → 2048 px), or as
	an 8× decoder for the Scale-RAE backbone (256 → 2048).
	- `2kto4k` — trained with multi-resolution data bucketing 2048→3840 and an
	SD3-style dynamic shift; designed for 1024 LDM → 4K (3840 px) decoding. Only
	released for the diffusers backbones.

	\| Path \| Backbone (encoder side) \| SR factor \| Variant \|
	\|---------------------------------------------------------------\|--------------------------------------------\|-----------\|-----------\|
	\| `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step` \| Flux1-dev (16-ch VAE) \| 4× \| 2k \|
	\| `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step` \| Flux2-dev (128-ch BN VAE) \| 4× \| 2k \|
	\| `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step` \| SD3 medium (16-ch VAE) \| 4× \| 2k \|
	\| `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step` \| DINOv2-B + RAE ViT-XL (768-ch) \| 4× \| 2k \|
	\| `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step` \| SigLIP-2 So400M + Scale-RAE ViT-XL (1152) \| 8× \| 2k \|
	\| `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step` \| Flux1-dev (16-ch VAE) \| 4× \| 2kto4k \|
	\| `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step` \| Flux2-dev (128-ch BN VAE) \| 4× \| 2kto4k \|
	\| `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step` \| SD3 medium (16-ch VAE) \| 4× \| 2kto4k \|

	Z-Image shares Flux1's VAE, so its inference path reuses the `flux` checkpoints
	(both `2k` and `2kto4k`) — no separate `zimage` checkpoint is shipped.

	Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA
	weights cast to bfloat16 — the format the inference scripts load by default.

	## VAE / encoder weights

	These are the per-backbone encoder (and, where applicable, original decoder)
	weights that PiD pairs with. They're hosted here so a single download brings
	everything needed end-to-end.

	\| Path \| Description \|
	\|---------------------------------\|--------------------------------------------------------------------------------------\|
	\| `checkpoints/ae.safetensors` \| Flux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder). \|
	\| `checkpoints/flux2_ae.safetensors` \| Flux2-dev 128-ch BN VAE. \|
	\| `checkpoints/sd3_vae/` \| SD3 medium 16-ch VAE in diffusers format. \|
	\| `checkpoints/rae/` \| DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics. \|
	\| `checkpoints/scale_rae/` \| SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config. \|

	## Usage

	The decoder checkpoints are loaded by the inference scripts in the PiD
	codebase. The exact `(backbone, ckpt_type) → path` mapping is the single source
	of truth in
	[`pid/_src/inference/checkpoint_registry.py`](https://github.com/) — clone the
	repo, point it at this snapshot, and the demos pick the right file
	automatically:

	```bash
	# Download this whole snapshot into ./checkpoints
	hf download nvidia/PiD --local-dir .

	# Then run any of the demos, e.g.:
	PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
	--prompt "A photorealistic cat" \
	--ldm_inference_steps 28 --save_xt_steps 22 24 26 \
	--output_dir ./results/demo \
	--cfg_scale 1 --pid_inference_steps 4 --scale 4
	```

	Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K.

	## License

	Released under the Apache License 2.0. Copyright 2026 NVIDIA Corporation
	& Affiliates. See the `LICENSE` file in the source repository for the full
	text.

	The upstream encoder backbones (DINOv2, SigLIP-2, Flux, SD3, Z-Image) and their
	weights remain under their own original licenses; PiD's Apache-2.0 release
	covers only the PiD decoder weights and code.