--- license: apache-2.0 library_name: transformers tags: - multimodal - video-understanding - audio-understanding - streaming - real-time - omni-modal pipeline_tag: video-text-to-text --- # ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Figure: ROMA processes streaming inputs as aligned multimodal units, using a 'Speak Head' to decide when to respond.

## Model Summary **ROMA** is a Real-time Omni-Multimodal Assistant designed for unified streaming audio-video understanding. Unlike traditional videoLLMs that only answer after a query, ROMA integrates both **Reactive** (Question Answering) and **Proactive** (Event-Driven Alert, Real-Time Narration) capabilities within a single framework. ROMA introduces a "Speak Head" mechanism to decouple response timing from content generation, allowing it to autonomously decide *when* to speak based on the continuous audio-visual stream. - **Paper:** [ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding](https://arxiv.org/abs/2601.10323) - **Project Page:** [Link](https://eureka-maggie.github.io/ROMA_show/) - **Repository:** [[Github (Coming Soon)](https://github.com/Eureka-Maggie/ROMA)] ## Citation If you find this project useful, please cite: ```bibtex @article{tian2026roma, title={ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding}, author={Tian, Xueyun and Li, Wei and Xu, Bingbing and Dong, Heng and Wang, Yuanzhuo and Shen, Huawei}, journal={arXiv preprint arXiv:2601.10323}, year={2026} }