PixelDiT: Pixel Diffusion Transformers for Image Generation
Paper • 2511.20645 • Published • 35
Yongsheng Yu1,2 Wei Xiong1† Weili Nie1 Yichen Sheng1 Shiqiu Liu1 Jiebo Luo2
1NVIDIA 2University of Rochester
†Project Lead and Main Advising
pip install -r requirements.txt
# See the full inference script at: https://github.com/NVlabs/PixelDiT
cd t2i/
python inference.py \
--config configs/PixelDiT_1024px_pixel_diffusion_stage3.yaml \
--model_path PixelDiT-T2I-v1.pth \
--txt_file prompts.txt \
--custom_height 1024 --custom_width 1024 \
--cfg_scale 2.75 --seed 2025 \
--negative_prompt "low quality, worst quality, over-saturated, blurry, deformed, watermark" \
--work_dir "."
| Parameter | Default | Description |
|---|---|---|
--cfg_scale |
3.5 | Classifier-free guidance scale |
--step |
50 | Number of sampling steps (25 for fast, 50 for quality) |
--seed |
0 | Random seed |
--negative_prompt |
"" |
Negative prompt for CFG |
--interval_guidance |
[0, 1] | CFG application interval |
--sampling_algo |
flow_dpm-solver | Sampling algorithm |
| Component | Value |
|---|---|
| Parameters | 1.3B |
| Patch size | 16 |
| Hidden size | 1536 |
| Attention heads | 24 |
| Patch-level depth | 14 |
| Pixel-level depth | 2 |
| Pixel hidden size | 16 |
| Pixel attention hidden size | 1152 |
| Text embedding dim | 2304 |
| Text max length | 300 |
| Text encoder | Gemma-2-2B-IT |
@inproceedings{yu2026pixeldit,
title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026},
}
This model is released under the NSCLv1 License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.