fudan-generative-ai
/

Hallo-Live

joint-audio-video-generation

Model card Files Files and versions

Hallo-Live / README.md

chunyu-li's picture

Update README.md

502974c 8 days ago

|

history blame contribute delete

1.84 kB

	---
	license: mit
	library_name: torchgeo
	language:
	- en
	base_model:
	- chetwinlow1/Ovi
	tags:
	- joint-audio-video-generation
	---

	<h1 align="center">Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation</h1>
	<!-- <h1 align="center">Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar</h1> -->

	<div align="center">

	[![Paper](https://img.shields.io/badge/arXiv-2604.23632-b31b1b.svg)](https://arxiv.org/abs/2604.23632)
	[![arXiv](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow)](https://huggingface.co/fudan-generative-ai/Hallo-Live)
	[![GitHub](https://img.shields.io/badge/GitHub-Repo-181717.svg?logo=github)](https://github.com/fudan-generative-vision/Hallo-Live)

	</div>

	## 📖 Introduction

	We present Hallo-Live, a real-time text-driven joint audio-video avatar generation framework. The method adopts a causal dual-stream DiT model to generate synchronized avatar video and speech in a streaming manner. Hallo-Live reaches 20.38 FPS with 0.94 s latency on two NVIDIA H200 GPUs, while preserving strong lip-sync accuracy, visual fidelity, and speech quality.

	## 🏗️ Framework

	<p align="center">
	<img src="assets/framework.png" width=100%>
	<p>

	The framework of Hallo-Live. Top left: Stage I training adapts a pretrained dual-stream DiT to the streaming setting using cross-modal future-expanding block-causal mask. Bottom left: Stage II training performs autoregressive self-rollout with the audio-video KV cache and optimizes the generated trajectory with reward-weighted dual-stream DMD. Right: Each causal fusion block in the dual-stream DiT consists of cross-modal attention between the video and audio streams, where the block-causal masks are utilized in Stage I ODE initialization, and KV cache is maintained for Stage II self-rollout and streaming inference.