YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
π Project Page | π Arxiv | π€ Models |
DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
Xu Guo * , Fulong Ye * , Qichao Sun *β , Liyang Chen, Bingchuan Li β , Pengze Zhang, Jiawei Liu, Songtao Zhao Β§, Qian He, Xiangwang Hou Β§
* Equal contribution, β Project lead, Β§ Corresponding author
Tsinghua University | Intelligent Creation Team, ByteDance
π₯ News
- [03/13/2026] π₯ Our v1 version code for R2AV is released!
- [02/13/2026] π₯ Our paper is released!
- [01/05/2026] π₯ The code for our previous work, DreamID-V, has been released!
π¬ Demo
β‘οΈ Quickstart
Installation
python3 download_weights.py
conda create -n dreamid_omni python=3.11
conda activate dreamid_omni
pip install torch==2.6.0 torchvision torchaudio
pip install -r requirements.txt
pip install flash_attn --no-build-isolation
Inference
Single-GPU inference
python3 inference_r2av.py --config-file dreamid_omni/configs/inference/inference_r2av.yaml
Multi-GPU inference
torchrun --nnodes 1 --nproc_per_node 8 inference_r2av.py --config-file dreamid_omni/configs/inference/inference_r2av.yaml
Before running multi-GPU inference, please open dreamid_omni/configs/inference/inference_r2av.yaml and set sp_size: 8
π¨ How to Create
Our prompts use the following special tags to control characters and speech:
- Subject Identity:
<sub1>,<sub2>- Represents the character IPs provided in your input images (e.g.,<img1>corresponds to<sub1>). Use these tags in your prompt to specify who is acting or speaking. - Speech:
<S>Your speech content here<E>- Text enclosed in these tags will be converted to speech using the corresponding character's reference audio.
π‘ Structure Example
We provide example prompts to help you get started with DreamID-Omni:
- Single-person generation:
test_case/oneip - Multi-person generation:
test_case/twoip
π Acknowledgements
Our work builds upon and is greatly inspired by several outstanding open-source projects, including Ovi, Wan2.2, MMAudio, Phantom, HuMo, OpenHumanVid. We sincerely thank the authors and contributors of these projects for generously sharing their excellent codes and ideas.
π§ Contact
If you have any comments or questions regarding this open-source project, please open a new issue or contact Xu Guo, Fulong Ye and Qichao Sun.
β οΈ Ethics Statement
This project, DreamID-Omni, is intended for academic research and technical demonstration purposes only.
- Prohibited Use: Users are strictly prohibited from using this codebase to generate content that is illegal, defamatory, pornographic, harmful, or infringes upon the privacy and rights of others.
- Responsibility: Users bear full responsibility for the content they generate. The authors and contributors of this project assume no liability for any misuse or consequences arising from the use of this software.
- AI Labeling: We strongly recommend marking generated videos as "AI-Generated" to prevent misinformation. By using this software, you agree to adhere to these guidelines and applicable local laws.
β Citation
If you find our work helpful, please consider citing our paper and leaving valuable stars
@misc{guo2026dreamidomni,
title={DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation},
author={Xu Guo and Fulong Ye and Qichao Sun and Liyang Chen and Bingchuan Li and Pengze Zhang and Jiawei Liu and Songtao Zhao and Qian He and Xiangwang Hou},
year={2026},
eprint={2602.12160},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.12160},
}