DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
Xu Guo^*, Fulong Ye^*, Qichao Sun^*^†, Liyang Chen, Bingchuan Li^†, Pengze Zhang, Jiawei Liu, Songtao Zhao^§, Qian He, Xiangwang Hou^§
^*Equal contribution,^†Project lead, ^§Corresponding author
Tsinghua University | Intelligent Creation Team, ByteDance

🔥 News

[03/13/2026] 🔥 Our v1 version code for R2AV is released!
[02/13/2026] 🔥 Our paper is released!
[01/05/2026] 🔥 The code for our previous work, DreamID-V, has been released!

🎬 Demo

⚡️ Quickstart

Installation

python3 download_weights.py
conda create -n dreamid_omni python=3.11
conda activate dreamid_omni
pip install torch==2.6.0 torchvision torchaudio
pip install -r requirements.txt
pip install flash_attn --no-build-isolation

Inference

Single-GPU inference

python3 inference_r2av.py --config-file dreamid_omni/configs/inference/inference_r2av.yaml

Multi-GPU inference

torchrun --nnodes 1 --nproc_per_node 8 inference_r2av.py --config-file dreamid_omni/configs/inference/inference_r2av.yaml

Before running multi-GPU inference, please open dreamid_omni/configs/inference/inference_r2av.yaml and set sp_size: 8

🎨 How to Create

Our prompts use the following special tags to control characters and speech:

Subject Identity: <sub1>, <sub2> - Represents the character IPs provided in your input images (e.g., <img1> corresponds to <sub1>). Use these tags in your prompt to specify who is acting or speaking.
Speech: <S>Your speech content here<E> - Text enclosed in these tags will be converted to speech using the corresponding character's reference audio.

💡 Structure Example

We provide example prompts to help you get started with DreamID-Omni:

Single-person generation: test_case/oneip
Multi-person generation: test_case/twoip

🙏 Acknowledgements

Our work builds upon and is greatly inspired by several outstanding open-source projects, including Ovi, Wan2.2, MMAudio, Phantom, HuMo, OpenHumanVid. We sincerely thank the authors and contributors of these projects for generously sharing their excellent codes and ideas.

📧 Contact

If you have any comments or questions regarding this open-source project, please open a new issue or contact Xu Guo, Fulong Ye and Qichao Sun.

⚠️ Ethics Statement

This project, DreamID-Omni, is intended for academic research and technical demonstration purposes only.

Prohibited Use: Users are strictly prohibited from using this codebase to generate content that is illegal, defamatory, pornographic, harmful, or infringes upon the privacy and rights of others.
Responsibility: Users bear full responsibility for the content they generate. The authors and contributors of this project assume no liability for any misuse or consequences arising from the use of this software.
AI Labeling: We strongly recommend marking generated videos as "AI-Generated" to prevent misinformation. By using this software, you agree to adhere to these guidelines and applicable local laws.

⭐ Citation

If you find our work helpful, please consider citing our paper and leaving valuable stars

@misc{guo2026dreamidomni,
      title={DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation}, 
      author={Xu Guo and Fulong Ye and Qichao Sun and Liyang Chen and Bingchuan Li and Pengze Zhang and Jiawei Liu and Songtao Zhao and Qian He and Xiangwang Hou},
      year={2026},
      eprint={2602.12160},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.12160}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for XuGuo699/DreamID-Omni

Finetunes

1 model

Quantizations

1 model

Space using XuGuo699/DreamID-Omni 1

Paper for XuGuo699/DreamID-Omni

DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation

Paper • 2602.12160 • Published Feb 12 • 38