File size: 1,843 Bytes
e60c903 d3d184d 502974c d3d184d 502974c d3d184d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | ---
license: mit
library_name: torchgeo
language:
- en
base_model:
- chetwinlow1/Ovi
tags:
- joint-audio-video-generation
---
<h1 align="center">Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation</h1>
<!-- <h1 align="center">Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar</h1> -->
<div align="center">
[](https://arxiv.org/abs/2604.23632)
[](https://huggingface.co/fudan-generative-ai/Hallo-Live)
[](https://github.com/fudan-generative-vision/Hallo-Live)
</div>
## 📖 Introduction
We present *Hallo-Live*, a real-time text-driven joint audio-video avatar generation framework. The method adopts a causal dual-stream DiT model to generate synchronized avatar video and speech in a streaming manner. *Hallo-Live* reaches **20.38 FPS** with **0.94 s latency** on two NVIDIA H200 GPUs, while preserving strong lip-sync accuracy, visual fidelity, and speech quality.
## 🏗️ Framework
<p align="center">
<img src="assets/framework.png" width=100%>
<p>
The framework of *Hallo-Live*. **Top left**: Stage I training adapts a pretrained dual-stream DiT to the streaming setting using cross-modal future-expanding block-causal mask. **Bottom left**: Stage II training performs autoregressive self-rollout with the audio-video KV cache and optimizes the generated trajectory with reward-weighted dual-stream DMD. **Right**: Each causal fusion block in the dual-stream DiT consists of cross-modal attention between the video and audio streams, where the block-causal masks are utilized in Stage I ODE initialization, and KV cache is maintained for Stage II self-rollout and streaming inference.
|