PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu1,2   Wei Xiong1†   Weili Nie1   Yichen Sheng1   Shiqiu Liu1   Jiebo Luo2

1NVIDIA   2University of Rochester
Project Lead and Main Advising

   

Key Features

  • VAE-free
  • Dual-level architecture: Patch-level DiT + Pixel-level DiT
  • MM-DiT text-image fusion: Joint attention between text and image tokens
  • Text encoder: Gemma-2-2B-IT
  • Multi-aspect-ratio: Supports various aspect ratios at 1024px

Usage

Installation

pip install -r requirements.txt

Inference

# See the full inference script at: https://github.com/NVlabs/PixelDiT
cd t2i/
python inference.py \
  --config configs/PixelDiT_1024px_pixel_diffusion_stage3.yaml \
  --model_path PixelDiT-T2I-v1.pth \
  --txt_file prompts.txt \
  --custom_height 1024 --custom_width 1024 \
  --cfg_scale 2.75 --seed 2025 \
  --negative_prompt "low quality, worst quality, over-saturated, blurry, deformed, watermark" \
  --work_dir "."

Inference Parameters

Parameter Default Description
--cfg_scale 3.5 Classifier-free guidance scale
--step 50 Number of sampling steps (25 for fast, 50 for quality)
--seed 0 Random seed
--negative_prompt "" Negative prompt for CFG
--interval_guidance [0, 1] CFG application interval
--sampling_algo flow_dpm-solver Sampling algorithm

Model Architecture

Component Value
Parameters 1.3B
Patch size 16
Hidden size 1536
Attention heads 24
Patch-level depth 14
Pixel-level depth 2
Pixel hidden size 16
Pixel attention hidden size 1152
Text embedding dim 2304
Text max length 300
Text encoder Gemma-2-2B-IT

Citation

@inproceedings{yu2026pixeldit,
      title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
      author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2026},
}

License

This model is released under the NSCLv1 License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for nvidia/PixelDiT-1300M-1024px