Quark-Vision
/

Live-Avatar

Image-to-Video

English

Model card Files Files and versions

xet

Community

Improve model card: add paper link, HF author profiles and tags

by nielsr HF Staff - opened about 1 month ago

base: refs/heads/main

←

from: refs/pr/9

Discussion Files changed

+55

-68

Files changed (1) hide show

README.md +55 -68

README.md CHANGED Viewed

@@ -1,11 +1,17 @@
 ---
-license: apache-2.0
-language:
-- en
 base_model:
 - Wan-AI/Wan2.2-S2V-14B
 pipeline_tag: image-to-video
 ---
 <div align="center">
 <p align="center">
@@ -13,26 +19,24 @@ pipeline_tag: image-to-video
 </p>
 <h1>🎬 Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length</h1>
-<!-- <h3>The code will be open source in <strong><span style="color: #87CEEB;">early December</span></strong>.</h3> -->
 <p>
-<a href="https://github.com/Yubo-Shankui" style="color: inherit;">Yubo Huang</a><sup>1,2</sup> ·
-<a href="#" style="color: inherit;">Hailong Guo</a><sup>1,3</sup> ·
-<a href="#" style="color: inherit;">Fangtai Wu</a><sup>1,4</sup> ·
-<a href="#" style="color: inherit;">Shifeng Zhang</a><sup>1</sup> ·
-<a href="#" style="color: inherit;">Shijie Huang</a><sup>1</sup> ·
 <a href="#" style="color: inherit;">Qijun Gan</a><sup>4</sup> ·
-<a href="#" style="color: inherit;">Lin Liu</a><sup>2</sup> ·
-<a href="#" style="color: inherit;">Sirui Zhao</a><sup>2,*</sup> ·
-<a href="http://staff.ustc.edu.cn/~cheneh/" style="color: inherit;">Enhong Chen</a><sup>2,*</sup> ·
-<a href="https://openreview.net/profile?id=%7EJiaming_Liu7" style="color: inherit;">Jiaming Liu</a><sup>1,‡</sup> ·
-<a href="https://sites.google.com/view/stevenhoi/" style="color: inherit;">Steven Hoi</a><sup>1</sup>
 </p>
 <p style="font-size: 0.9em;">
-<sup>1</sup> Alibaba Group &nbsp;&nbsp;
-<sup>2</sup> University of Science and Technology of China &nbsp;&nbsp;
 <sup>3</sup> Beijing University of Posts and Telecommunications &nbsp;&nbsp;
 <sup>4</sup> Zhejiang University
 </p>
@@ -46,14 +50,16 @@ pipeline_tag: image-to-video
 </div>
-> **TL;DR:** **Live Avatar** is an algorithm–system co-designed framework that enables real-time, streaming, infinite-length interactive avatar video generation. Powered by a **14B-parameter** diffusion model, it achieves **20 FPS** on **5×H800** GPUs with **4-step** sampling and supports **Block-wise Autoregressive** processing for **10,000+** second streaming videos.
 <div align="center">
 [![Watch the video](assets/demo.png)](https://www.youtube.com/watch?v=srbsGlLNpAc)
 <strong>👀 More Demos:</strong> <br>
-:robot: Human-AI Conversation &nbsp;|&nbsp; ♾️ Infinite Video &nbsp;|&nbsp; 🎭 Diverse Characters &nbsp;|&nbsp; 🎬 Animated Tech Explanation <br>
 <a href="https://liveavatar.github.io/">
   <strong>👉 Click Here to Visit Project Page! 🌐</strong>
 </a>
@@ -64,16 +70,18 @@ pipeline_tag: image-to-video
 ---
 ## ✨ Highlights
-> - ⚡ **Real-time Streaming Interaction** - Achieve **20** FPS real-time streaming with low latency
 > - ♾️ **Infinite-length Autoregressive Generation** - Support **10,000+** second continuous video generation
 > - 🎨 **Generalization Performances** - Strong generalization across cartoon characters, singing, and diverse scenarios
 ---
 ## 📰 News
-- **[2025.12.08]** 🚀 We released real-time inference [Code](infinite_inference_multi_gpu.sh) and the model [Weight](https://huggingface.co/Quark-Vision/Live-Avatar).
 - **[2025.12.08]** 🎉 LiveAvatar won the Hugging Face [#1 Paper of the day](https://huggingface.co/papers/date/2025-12-05)!
-- **[2025.12.04]** 🏃‍♂️ We committed to open-sourcing the code in **early December**.
 - **[2025.12.04]** 🔥 We released [Paper](https://arxiv.org/abs/2512.04677) and [demo page](https://liveavatar.github.io/) Website.
 ---
@@ -92,12 +100,15 @@ pipeline_tag: image-to-video
 ### ⚙️ **Later updates**
 - ⬜ UI integration for easily streaming interaction
-- ⬜ Inference code supporting single GPU (offline generation)
-- ⬜ Multi-character support
-- ⬜ Training code
 - ⬜ TTS integration
-- ⬜ LiveAvatar v1.1
 ## 🛠️ Installation
@@ -118,6 +129,11 @@ conda install -c nvidia/label/cuda-12.4.1 cudatoolkit -y
 ### 3. Install PyTorch & Flash Attention
 ```bash
 pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
 pip install flash-attn==2.8.3 --no-build-isolation
 ```
@@ -134,53 +150,38 @@ apt-get update && apt-get install -y ffmpeg
 ## 📥 Download Models
-Please download the pretrained checkpoints from links below and place them in the `./ckpt/` directory.
 | Model Component | Description | Link |
 | :--- | :--- | :---: |
 | `WanS2V-14B` | base model| 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.2-S2V-14B) |
 | `liveAvatar` | our lora model| 🤗 [Huggingface](https://huggingface.co/Quark-Vision/Live-Avatar) |
 ```bash
-# If you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
 pip install "huggingface_hub[cli]"
 huggingface-cli download Wan-AI/Wan2.2-S2V-14B --local-dir ./ckpt/Wan2.2-S2V-14B
 huggingface-cli download Quark-Vision/Live-Avatar --local-dir ./ckpt/LiveAvatar
 ```
-After downloading, your directory structure should look like this:
-```
-ckpt/
-├── Wan2.2-S2V-14B/          # Base model
-│   ├── config.json
-│   ├── diffusion_pytorch_model-*.safetensors
-│   └── ...
-└── LiveAvatar/              # Our LoRA model
-    ├── liveavatar.safetensors
-    └── ...
-```
 ## 🚀 Inference
 ### Real-time Inference with TPP
-> 💡 Currently, This command can run on GPUs with at least 80GB VRAM.
 ```bash
 # CLI Inference
 bash infinite_inference_multi_gpu.sh
 # Gradio Web UI
 bash gradio_multi_gpu.sh
 ```
-> 💡 The model can generate videos from audio input combined with reference image and optional text prompt.
-> 💡 The `size` parameter represents the area of the generated video, with the aspect ratio following that of the original input image.
-> 💡 The `--num_clip` parameter controls the number of video clips generated, useful for quick preview with shorter generation time.
-> 💡 Currently, our TPP pipeline requires **five** GPUs for inference. We are planning to develop a 3-step version that can be deployed on a 4-GPU cluster.
-Furthermore, we are planning to integrate the [LightX2V](https://github.com/ModelTC/LightX2V) VAE component. This integration will eliminate the dependency on additional single-GPU VAE parallelism and support 4-step inference within a 4-GPU setup.
-Please visit our [project page](https://liveavatar.github.io/) to see more examples and learn about the scenarios suitable for this model.
 ## 📝 Citation
 If you find this project useful for your research, please consider citing our paper:
@@ -196,21 +197,7 @@ If you find this project useful for your research, please consider citing our pa
       url={https://arxiv.org/abs/2512.04677},
 }
 ```
-## ⭐ Star History
-[![Star History Chart](https://api.star-history.com/svg?repos=Alibaba-Quark/LiveAvatar&type=date&legend=top-left)](https://www.star-history.com/#Alibaba-Quark/LiveAvatar&type=date&legend=top-left)
 ## 📜 License Agreement
-* The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](LICENSE).
-* The Wan model (Our base model) is also released under the Apache 2.0 license as found in the [LICENSE](https://github.com/Wan-Video/Wan2.2/blob/main/LICENSE.txt).
-* The project is a research preview. Please contact us if you find any potential violations. (jmliu1217@gmail.com)
-## 🙏 Acknowledgements
-We would like to express our gratitude to the following projects:
-*   [CausVid](https://github.com/tianweiy/CausVid)
-*   [Longlive](https://github.com/NVlabs/LongLive)
-*   [WanS2V](https://humanaigc.github.io/wan-s2v-webpage/)

 ---
 base_model:
 - Wan-AI/Wan2.2-S2V-14B
+language:
+- en
+license: apache-2.0
 pipeline_tag: image-to-video
+tags:
+- lora
+- talking-head
+- audio-driven
+- avatar-generation
 ---
 <div align="center">
 <p align="center">
 </p>
 <h1>🎬 Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length</h1>
 <p>
+<a href="https://huggingface.co/Yubo-Huang" style="color: inherit;">Yubo Huang</a><sup>1,2</sup> ·
+<a href="#" style="color: inherit;">Hailong Guo</a><sup>2,3</sup> ·
+<a href="#" style="color: inherit;">Fangtai Wu</a><sup>2,4</sup> ·
+<a href="#" style="color: inherit;">Shifeng Zhang</a><sup>2</sup> ·
+<a href="#" style="color: inherit;">Shijie Huang</a><sup>2</sup> ·
 <a href="#" style="color: inherit;">Qijun Gan</a><sup>4</sup> ·
+<a href="#" style="color: inherit;">Lin Liu</a><sup>1</sup> ·
+<a href="#" style="color: inherit;">Sirui Zhao</a><sup>1,*</sup> ·
+<a href="https://huggingface.co/Hongni" style="color: inherit;">Enhong Chen</a><sup>1,*</sup> ·
+<a href="https://huggingface.co/jamesliu1217" style="color: inherit;">Jiaming Liu</a><sup>2,‡</sup> ·
+<a href="https://huggingface.co/stevenhoi" style="color: inherit;">Steven Hoi</a><sup>2</sup>
 </p>
 <p style="font-size: 0.9em;">
+<sup>1</sup> University of Science and Technology of China &nbsp;&nbsp;
+<sup>2</sup> Alibaba Group &nbsp;&nbsp;
 <sup>3</sup> Beijing University of Posts and Telecommunications &nbsp;&nbsp;
 <sup>4</sup> Zhejiang University
 </p>
 </div>
+This repository contains the weights for the paper [Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length](https://huggingface.co/papers/2512.04677).
+> **TL;DR:** **Live Avatar** is an algorithm–system co-designed framework that enables real-time, streaming, infinite-length interactive avatar video generation. Powered by a **14B-parameter** diffusion model, it achieves **45 FPS** on multi-card **H800** GPUs with **4-step** sampling and supports **Block-wise Autoregressive** processing for **10,000+** second streaming videos.
 <div align="center">
 [![Watch the video](assets/demo.png)](https://www.youtube.com/watch?v=srbsGlLNpAc)
 <strong>👀 More Demos:</strong> <br>
+🤖 Human-AI Conversation &nbsp;|&nbsp; ♾️ Infinite Video &nbsp;|&nbsp; 🎭 Diverse Characters &nbsp;|&nbsp; 🎬 Animated Tech Explanation <br>
 <a href="https://liveavatar.github.io/">
   <strong>👉 Click Here to Visit Project Page! 🌐</strong>
 </a>
 ---
 ## ✨ Highlights
+> - ⚡ **Real-time Streaming Interaction** - Achieve **45** FPS real-time streaming with low latency
 > - ♾️ **Infinite-length Autoregressive Generation** - Support **10,000+** second continuous video generation
 > - 🎨 **Generalization Performances** - Strong generalization across cartoon characters, singing, and diverse scenarios
 ---
 ## 📰 News
+- **[2026.1.20]** 🚀 Major performance breakthrough (**v1.1**)! **FP8 quantization** enables inference on **48GB GPUs**, while advanced **compilation** and **cuDNN** attention boost speed to **~2.5x** peak and **3x** average FPS. Achieving stable **45+ FPS** on multi-H800.
+- **[2025.12.16]** 🎉 LiveAvatar has reached **1,000+** stars on GitHub!
+- **[2025.12.12]** 🚀 We released **single-gpu** inference [Code](https://github.com/Alibaba-Quark/LiveAvatar/blob/main/infinite_inference_single_gpu.sh) — a single 80GB VRAM GPU is enough to enjoy.
+- **[2025.12.08]** 🚀 We released real-time inference [Code](https://github.com/Alibaba-Quark/LiveAvatar/blob/main/infinite_inference_multi_gpu.sh) and the model [Weight](https://huggingface.co/Quark-Vision/Live-Avatar).
 - **[2025.12.08]** 🎉 LiveAvatar won the Hugging Face [#1 Paper of the day](https://huggingface.co/papers/date/2025-12-05)!
 - **[2025.12.04]** 🔥 We released [Paper](https://arxiv.org/abs/2512.04677) and [demo page](https://liveavatar.github.io/) Website.
 ---
 ### ⚙️ **Later updates**
+- ✅ Inference code supporting single GPU (offline generation)
+- ✅ Multi-character support
+- ✅ Inference Acceleration Stage1 (RoPE optimization, compilation, LoRA merge)
+- ✅ Streaming-VAE intergration
+- ✅ Inference Acceleration Stage2 (further compilation, fp8, cudnn attn)
 - ⬜ UI integration for easily streaming interaction
 - ⬜ TTS integration
+- ⬜ Training code
+- ⬜ LiveAvatar v1.2
 ## 🛠️ Installation
 ### 3. Install PyTorch & Flash Attention
 ```bash
 pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
+# For H800/H200 setups:
+pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch280 --extra-index-url https://download.pytorch.org/whl/cu128
+# Otherwise:
 pip install flash-attn==2.8.3 --no-build-isolation
 ```
 ## 📥 Download Models
+Please download the pretrained checkpoints and place them in the `./ckpt/` directory.
 | Model Component | Description | Link |
 | :--- | :--- | :---: |
 | `WanS2V-14B` | base model| 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.2-S2V-14B) |
 | `liveAvatar` | our lora model| 🤗 [Huggingface](https://huggingface.co/Quark-Vision/Live-Avatar) |
 ```bash
 pip install "huggingface_hub[cli]"
 huggingface-cli download Wan-AI/Wan2.2-S2V-14B --local-dir ./ckpt/Wan2.2-S2V-14B
 huggingface-cli download Quark-Vision/Live-Avatar --local-dir ./ckpt/LiveAvatar
 ```
 ## 🚀 Inference
 ### Real-time Inference with TPP
+> 💡 Requires multi-GPU setup with at least 80GB VRAM.
 ```bash
 # CLI Inference
 bash infinite_inference_multi_gpu.sh
 # Gradio Web UI
 bash gradio_multi_gpu.sh
 ```
+### Single-GPU Inference
+> 💡 Can run on a single GPU with at least 80GB VRAM.
+```bash
+# CLI Inference
+bash infinite_inference_single_gpu.sh
+# Gradio Web UI
+bash gradio_single_gpu.sh
+```
 ## 📝 Citation
 If you find this project useful for your research, please consider citing our paper:
       url={https://arxiv.org/abs/2512.04677},
 }
 ```
 ## 📜 License Agreement
+* The majority of this project is released under the Apache 2.0 license.
+* The Wan model (base model) is also released under the Apache 2.0 license.