PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu^1,2 Wei Xiong^1† Weili Nie¹ Yichen Sheng¹ Shiqiu Liu¹ Jiebo Luo²

¹NVIDIA ²University of Rochester
^†Project Lead and Main Advising

Key Features

VAE-free
Dual-level architecture: Patch-level DiT + Pixel-level DiT
MM-DiT text-image fusion: Joint attention between text and image tokens
Text encoder: Gemma-2-2B-IT
Multi-aspect-ratio: Supports various aspect ratios at 1024px

Usage

Installation

pip install -r requirements.txt

Inference

# See the full inference script at: https://github.com/NVlabs/PixelDiT
cd t2i/
python inference.py \
  --config configs/PixelDiT_1024px_pixel_diffusion_stage3.yaml \
  --model_path PixelDiT-T2I-v1.pth \
  --txt_file prompts.txt \
  --custom_height 1024 --custom_width 1024 \
  --cfg_scale 2.75 --seed 2025 \
  --negative_prompt "low quality, worst quality, over-saturated, blurry, deformed, watermark" \
  --work_dir "."

Inference Parameters

Parameter	Default	Description
`--cfg_scale`	3.5	Classifier-free guidance scale
`--step`	50	Number of sampling steps (25 for fast, 50 for quality)
`--seed`	0	Random seed
`--negative_prompt`	`""`	Negative prompt for CFG
`--interval_guidance`	[0, 1]	CFG application interval
`--sampling_algo`	flow_dpm-solver	Sampling algorithm

Model Architecture

Component	Value
Parameters	1.3B
Patch size	16
Hidden size	1536
Attention heads	24
Patch-level depth	14
Pixel-level depth	2
Pixel hidden size	16
Pixel attention hidden size	1152
Text embedding dim	2304
Text max length	300
Text encoder	Gemma-2-2B-IT

Citation

@inproceedings{yu2026pixeldit,
      title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
      author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2026},
}

License

This model is released under the NSCLv1 License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.

Downloads last month: -

Paper for nvidia/PixelDiT-1300M-1024px

PixelDiT: Pixel Diffusion Transformers for Image Generation

Paper • 2511.20645 • Published Nov 25, 2025 • 35