Image-to-Video
English

Improve model card: add paper link, HF author profiles and tags

#9
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +55 -68
README.md CHANGED
@@ -1,11 +1,17 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  base_model:
6
  - Wan-AI/Wan2.2-S2V-14B
 
 
 
7
  pipeline_tag: image-to-video
 
 
 
 
 
8
  ---
 
9
  <div align="center">
10
 
11
  <p align="center">
@@ -13,26 +19,24 @@ pipeline_tag: image-to-video
13
  </p>
14
 
15
  <h1>🎬 Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length</h1>
16
- <!-- <h3>The code will be open source in <strong><span style="color: #87CEEB;">early December</span></strong>.</h3> -->
17
-
18
 
19
  <p>
20
- <a href="https://github.com/Yubo-Shankui" style="color: inherit;">Yubo Huang</a><sup>1,2</sup> Β·
21
- <a href="#" style="color: inherit;">Hailong Guo</a><sup>1,3</sup> Β·
22
- <a href="#" style="color: inherit;">Fangtai Wu</a><sup>1,4</sup> Β·
23
- <a href="#" style="color: inherit;">Shifeng Zhang</a><sup>1</sup> Β·
24
- <a href="#" style="color: inherit;">Shijie Huang</a><sup>1</sup> Β·
25
  <a href="#" style="color: inherit;">Qijun Gan</a><sup>4</sup> Β·
26
- <a href="#" style="color: inherit;">Lin Liu</a><sup>2</sup> Β·
27
- <a href="#" style="color: inherit;">Sirui Zhao</a><sup>2,*</sup> Β·
28
- <a href="http://staff.ustc.edu.cn/~cheneh/" style="color: inherit;">Enhong Chen</a><sup>2,*</sup> Β·
29
- <a href="https://openreview.net/profile?id=%7EJiaming_Liu7" style="color: inherit;">Jiaming Liu</a><sup>1,‑</sup> Β·
30
- <a href="https://sites.google.com/view/stevenhoi/" style="color: inherit;">Steven Hoi</a><sup>1</sup>
31
  </p>
32
 
33
  <p style="font-size: 0.9em;">
34
- <sup>1</sup> Alibaba Group &nbsp;&nbsp;
35
- <sup>2</sup> University of Science and Technology of China &nbsp;&nbsp;
36
  <sup>3</sup> Beijing University of Posts and Telecommunications &nbsp;&nbsp;
37
  <sup>4</sup> Zhejiang University
38
  </p>
@@ -46,14 +50,16 @@ pipeline_tag: image-to-video
46
 
47
  </div>
48
 
49
- > **TL;DR:** **Live Avatar** is an algorithm–system co-designed framework that enables real-time, streaming, infinite-length interactive avatar video generation. Powered by a **14B-parameter** diffusion model, it achieves **20 FPS** on **5Γ—H800** GPUs with **4-step** sampling and supports **Block-wise Autoregressive** processing for **10,000+** second streaming videos.
 
 
50
 
51
  <div align="center">
52
 
53
  [![Watch the video](assets/demo.png)](https://www.youtube.com/watch?v=srbsGlLNpAc)
54
 
55
  <strong>πŸ‘€ More Demos:</strong> <br>
56
- :robot: Human-AI Conversation &nbsp;|&nbsp; ♾️ Infinite Video &nbsp;|&nbsp; 🎭 Diverse Characters &nbsp;|&nbsp; 🎬 Animated Tech Explanation <br>
57
  <a href="https://liveavatar.github.io/">
58
  <strong>πŸ‘‰ Click Here to Visit Project Page! 🌐</strong>
59
  </a>
@@ -64,16 +70,18 @@ pipeline_tag: image-to-video
64
  ---
65
  ## ✨ Highlights
66
 
67
- > - ⚑ **​​Real-time Streaming Interaction**​​ - Achieve **20** FPS real-time streaming with low latency
68
  > - ♾️ ​​**​​Infinite-length Autoregressive Generation**​​​​ - Support **10,000+** second continuous video generation
69
  > - 🎨 ​​**​​Generalization Performances**​​​​ - Strong generalization across cartoon characters, singing, and diverse scenarios
70
 
71
 
72
  ---
73
  ## πŸ“° News
74
- - **[2025.12.08]** πŸš€ We released real-time inference [Code](infinite_inference_multi_gpu.sh) and the model [Weight](https://huggingface.co/Quark-Vision/Live-Avatar).
 
 
 
75
  - **[2025.12.08]** πŸŽ‰ LiveAvatar won the Hugging Face [#1 Paper of the day](https://huggingface.co/papers/date/2025-12-05)!
76
- - **[2025.12.04]** πŸƒβ€β™‚οΈ We committed to open-sourcing the code in **early December**.
77
  - **[2025.12.04]** πŸ”₯ We released [Paper](https://arxiv.org/abs/2512.04677) and [demo page](https://liveavatar.github.io/) Website.
78
 
79
  ---
@@ -92,12 +100,15 @@ pipeline_tag: image-to-video
92
 
93
  ### βš™οΈ **Later updates**
94
 
 
 
 
 
 
95
  - ⬜ UI integration for easily streaming interaction
96
- - ⬜ Inference code supporting single GPU (offline generation)
97
- - ⬜ Multi-character support
98
- - ⬜ Training code
99
  - ⬜ TTS integration
100
- - ⬜ LiveAvatar v1.1
 
101
 
102
  ## πŸ› οΈ Installation
103
 
@@ -118,6 +129,11 @@ conda install -c nvidia/label/cuda-12.4.1 cudatoolkit -y
118
  ### 3. Install PyTorch & Flash Attention
119
  ```bash
120
  pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
 
 
 
 
 
121
  pip install flash-attn==2.8.3 --no-build-isolation
122
  ```
123
 
@@ -134,53 +150,38 @@ apt-get update && apt-get install -y ffmpeg
134
 
135
  ## πŸ“₯ Download Models
136
 
137
- Please download the pretrained checkpoints from links below and place them in the `./ckpt/` directory.
138
 
139
  | Model Component | Description | Link |
140
  | :--- | :--- | :---: |
141
  | `WanS2V-14B` | base model| πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.2-S2V-14B) |
142
  | `liveAvatar` | our lora model| πŸ€— [Huggingface](https://huggingface.co/Quark-Vision/Live-Avatar) |
 
143
  ```bash
144
- # If you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
145
  pip install "huggingface_hub[cli]"
146
  huggingface-cli download Wan-AI/Wan2.2-S2V-14B --local-dir ./ckpt/Wan2.2-S2V-14B
147
  huggingface-cli download Quark-Vision/Live-Avatar --local-dir ./ckpt/LiveAvatar
148
  ```
149
 
150
- After downloading, your directory structure should look like this:
151
-
152
- ```
153
- ckpt/
154
- β”œβ”€β”€ Wan2.2-S2V-14B/ # Base model
155
- β”‚ β”œβ”€β”€ config.json
156
- β”‚ β”œβ”€β”€ diffusion_pytorch_model-*.safetensors
157
- β”‚ └── ...
158
- └── LiveAvatar/ # Our LoRA model
159
- β”œβ”€β”€ liveavatar.safetensors
160
- └── ...
161
- ```
162
-
163
-
164
-
165
  ## πŸš€ Inference
166
  ### Real-time Inference with TPP
167
- > πŸ’‘ Currently, This command can run on GPUs with at least 80GB VRAM.
168
  ```bash
169
  # CLI Inference
170
  bash infinite_inference_multi_gpu.sh
171
  # Gradio Web UI
172
  bash gradio_multi_gpu.sh
173
  ```
174
- > πŸ’‘ The model can generate videos from audio input combined with reference image and optional text prompt.
175
 
176
- > πŸ’‘ The `size` parameter represents the area of the generated video, with the aspect ratio following that of the original input image.
177
-
178
- > πŸ’‘ The `--num_clip` parameter controls the number of video clips generated, useful for quick preview with shorter generation time.
179
-
180
- > πŸ’‘ Currently, our TPP pipeline requires **five** GPUs for inference. We are planning to develop a 3-step version that can be deployed on a 4-GPU cluster.
181
- Furthermore, we are planning to integrate the [LightX2V](https://github.com/ModelTC/LightX2V) VAE component. This integration will eliminate the dependency on additional single-GPU VAE parallelism and support 4-step inference within a 4-GPU setup.
 
 
182
 
183
- Please visit our [project page](https://liveavatar.github.io/) to see more examples and learn about the scenarios suitable for this model.
184
  ## πŸ“ Citation
185
 
186
  If you find this project useful for your research, please consider citing our paper:
@@ -196,21 +197,7 @@ If you find this project useful for your research, please consider citing our pa
196
  url={https://arxiv.org/abs/2512.04677},
197
  }
198
  ```
199
- ## ⭐ Star History
200
-
201
- [![Star History Chart](https://api.star-history.com/svg?repos=Alibaba-Quark/LiveAvatar&type=date&legend=top-left)](https://www.star-history.com/#Alibaba-Quark/LiveAvatar&type=date&legend=top-left)
202
 
203
  ## πŸ“œ License Agreement
204
- * The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](LICENSE).
205
- * The Wan model (Our base model) is also released under the Apache 2.0 license as found in the [LICENSE](https://github.com/Wan-Video/Wan2.2/blob/main/LICENSE.txt).
206
- * The project is a research preview. Please contact us if you find any potential violations. (jmliu1217@gmail.com)
207
-
208
-
209
-
210
- ## πŸ™ Acknowledgements
211
-
212
- We would like to express our gratitude to the following projects:
213
-
214
- * [CausVid](https://github.com/tianweiy/CausVid)
215
- * [Longlive](https://github.com/NVlabs/LongLive)
216
- * [WanS2V](https://humanaigc.github.io/wan-s2v-webpage/)
 
1
  ---
 
 
 
2
  base_model:
3
  - Wan-AI/Wan2.2-S2V-14B
4
+ language:
5
+ - en
6
+ license: apache-2.0
7
  pipeline_tag: image-to-video
8
+ tags:
9
+ - lora
10
+ - talking-head
11
+ - audio-driven
12
+ - avatar-generation
13
  ---
14
+
15
  <div align="center">
16
 
17
  <p align="center">
 
19
  </p>
20
 
21
  <h1>🎬 Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length</h1>
 
 
22
 
23
  <p>
24
+ <a href="https://huggingface.co/Yubo-Huang" style="color: inherit;">Yubo Huang</a><sup>1,2</sup> Β·
25
+ <a href="#" style="color: inherit;">Hailong Guo</a><sup>2,3</sup> Β·
26
+ <a href="#" style="color: inherit;">Fangtai Wu</a><sup>2,4</sup> Β·
27
+ <a href="#" style="color: inherit;">Shifeng Zhang</a><sup>2</sup> Β·
28
+ <a href="#" style="color: inherit;">Shijie Huang</a><sup>2</sup> Β·
29
  <a href="#" style="color: inherit;">Qijun Gan</a><sup>4</sup> Β·
30
+ <a href="#" style="color: inherit;">Lin Liu</a><sup>1</sup> Β·
31
+ <a href="#" style="color: inherit;">Sirui Zhao</a><sup>1,*</sup> Β·
32
+ <a href="https://huggingface.co/Hongni" style="color: inherit;">Enhong Chen</a><sup>1,*</sup> Β·
33
+ <a href="https://huggingface.co/jamesliu1217" style="color: inherit;">Jiaming Liu</a><sup>2,‑</sup> Β·
34
+ <a href="https://huggingface.co/stevenhoi" style="color: inherit;">Steven Hoi</a><sup>2</sup>
35
  </p>
36
 
37
  <p style="font-size: 0.9em;">
38
+ <sup>1</sup> University of Science and Technology of China &nbsp;&nbsp;
39
+ <sup>2</sup> Alibaba Group &nbsp;&nbsp;
40
  <sup>3</sup> Beijing University of Posts and Telecommunications &nbsp;&nbsp;
41
  <sup>4</sup> Zhejiang University
42
  </p>
 
50
 
51
  </div>
52
 
53
+ This repository contains the weights for the paper [Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length](https://huggingface.co/papers/2512.04677).
54
+
55
+ > **TL;DR:** **Live Avatar** is an algorithm–system co-designed framework that enables real-time, streaming, infinite-length interactive avatar video generation. Powered by a **14B-parameter** diffusion model, it achieves **45 FPS** on multi-card **H800** GPUs with **4-step** sampling and supports **Block-wise Autoregressive** processing for **10,000+** second streaming videos.
56
 
57
  <div align="center">
58
 
59
  [![Watch the video](assets/demo.png)](https://www.youtube.com/watch?v=srbsGlLNpAc)
60
 
61
  <strong>πŸ‘€ More Demos:</strong> <br>
62
+ πŸ€– Human-AI Conversation &nbsp;|&nbsp; ♾️ Infinite Video &nbsp;|&nbsp; 🎭 Diverse Characters &nbsp;|&nbsp; 🎬 Animated Tech Explanation <br>
63
  <a href="https://liveavatar.github.io/">
64
  <strong>πŸ‘‰ Click Here to Visit Project Page! 🌐</strong>
65
  </a>
 
70
  ---
71
  ## ✨ Highlights
72
 
73
+ > - ⚑ **​​Real-time Streaming Interaction**​​ - Achieve **45** FPS real-time streaming with low latency
74
  > - ♾️ ​​**​​Infinite-length Autoregressive Generation**​​​​ - Support **10,000+** second continuous video generation
75
  > - 🎨 ​​**​​Generalization Performances**​​​​ - Strong generalization across cartoon characters, singing, and diverse scenarios
76
 
77
 
78
  ---
79
  ## πŸ“° News
80
+ - **[2026.1.20]** πŸš€ Major performance breakthrough (**v1.1**)! **FP8 quantization** enables inference on **48GB GPUs**, while advanced **compilation** and **cuDNN** attention boost speed to **~2.5x** peak and **3x** average FPS. Achieving stable **45+ FPS** on multi-H800.
81
+ - **[2025.12.16]** πŸŽ‰ LiveAvatar has reached **1,000+** stars on GitHub!
82
+ - **[2025.12.12]** πŸš€ We released **single-gpu** inference [Code](https://github.com/Alibaba-Quark/LiveAvatar/blob/main/infinite_inference_single_gpu.sh) β€” a single 80GB VRAM GPU is enough to enjoy.
83
+ - **[2025.12.08]** πŸš€ We released real-time inference [Code](https://github.com/Alibaba-Quark/LiveAvatar/blob/main/infinite_inference_multi_gpu.sh) and the model [Weight](https://huggingface.co/Quark-Vision/Live-Avatar).
84
  - **[2025.12.08]** πŸŽ‰ LiveAvatar won the Hugging Face [#1 Paper of the day](https://huggingface.co/papers/date/2025-12-05)!
 
85
  - **[2025.12.04]** πŸ”₯ We released [Paper](https://arxiv.org/abs/2512.04677) and [demo page](https://liveavatar.github.io/) Website.
86
 
87
  ---
 
100
 
101
  ### βš™οΈ **Later updates**
102
 
103
+ - βœ… Inference code supporting single GPU (offline generation)
104
+ - βœ… Multi-character support
105
+ - βœ… Inference Acceleration Stage1 (RoPE optimization, compilation, LoRA merge)
106
+ - βœ… Streaming-VAE intergration
107
+ - βœ… Inference Acceleration Stage2 (further compilation, fp8, cudnn attn)
108
  - ⬜ UI integration for easily streaming interaction
 
 
 
109
  - ⬜ TTS integration
110
+ - ⬜ Training code
111
+ - ⬜ LiveAvatar v1.2
112
 
113
  ## πŸ› οΈ Installation
114
 
 
129
  ### 3. Install PyTorch & Flash Attention
130
  ```bash
131
  pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
132
+
133
+ # For H800/H200 setups:
134
+ pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch280 --extra-index-url https://download.pytorch.org/whl/cu128
135
+
136
+ # Otherwise:
137
  pip install flash-attn==2.8.3 --no-build-isolation
138
  ```
139
 
 
150
 
151
  ## πŸ“₯ Download Models
152
 
153
+ Please download the pretrained checkpoints and place them in the `./ckpt/` directory.
154
 
155
  | Model Component | Description | Link |
156
  | :--- | :--- | :---: |
157
  | `WanS2V-14B` | base model| πŸ€— [Huggingface](https://huggingface.co/Wan-AI/Wan2.2-S2V-14B) |
158
  | `liveAvatar` | our lora model| πŸ€— [Huggingface](https://huggingface.co/Quark-Vision/Live-Avatar) |
159
+
160
  ```bash
 
161
  pip install "huggingface_hub[cli]"
162
  huggingface-cli download Wan-AI/Wan2.2-S2V-14B --local-dir ./ckpt/Wan2.2-S2V-14B
163
  huggingface-cli download Quark-Vision/Live-Avatar --local-dir ./ckpt/LiveAvatar
164
  ```
165
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  ## πŸš€ Inference
167
  ### Real-time Inference with TPP
168
+ > πŸ’‘ Requires multi-GPU setup with at least 80GB VRAM.
169
  ```bash
170
  # CLI Inference
171
  bash infinite_inference_multi_gpu.sh
172
  # Gradio Web UI
173
  bash gradio_multi_gpu.sh
174
  ```
 
175
 
176
+ ### Single-GPU Inference
177
+ > πŸ’‘ Can run on a single GPU with at least 80GB VRAM.
178
+ ```bash
179
+ # CLI Inference
180
+ bash infinite_inference_single_gpu.sh
181
+ # Gradio Web UI
182
+ bash gradio_single_gpu.sh
183
+ ```
184
 
 
185
  ## πŸ“ Citation
186
 
187
  If you find this project useful for your research, please consider citing our paper:
 
197
  url={https://arxiv.org/abs/2512.04677},
198
  }
199
  ```
 
 
 
200
 
201
  ## πŸ“œ License Agreement
202
+ * The majority of this project is released under the Apache 2.0 license.
203
+ * The Wan model (base model) is also released under the Apache 2.0 license.