---
title: daVinci-MagiHuman
emoji: 🎬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.23.0
app_port: 7860
---
# daVinci-MagiHuman
### Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
SII-GAIR & Sand.ai
[](https://arxiv.org/abs/2603.21986)
[](https://huggingface.co/spaces/SII-GAIR/daVinci-MagiHuman)
[](https://huggingface.co/GAIR-NLP/daVinci-MagiHuman)
[](https://opensource.org/licenses/Apache-2.0)
[](https://www.python.org/)
[](https://pytorch.org/)
## Highlights
- **Single-Stream Transformer** — A unified 15B-parameter, 40-layer Transformer that jointly processes text, video, and audio via self-attention only. No cross-attention, no multi-stream complexity.
- **Exceptional Human-Centric Quality** — Expressive facial performance, natural speech-expression coordination, realistic body motion, and accurate audio-video synchronization.
- **Multilingual** — Supports Chinese (Mandarin & Cantonese), English, Japanese, Korean, German, and French.
- **Blazing Fast Inference** — Generates a 5-second 256p video in **2 seconds** and a 5-second 1080p video in **38 seconds** on a single H100 GPU.
- **State-of-the-Art Results** — Achieves **80.0%** win rate vs Ovi 1.1 and **60.9%** vs LTX 2.3 in pairwise human evaluation over 2,000 comparisons.
- **Fully Open Source** — We release the complete model stack: base model, distilled model, super-resolution model, and inference code.
## Demo
https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_1
https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_2
https://github.com/user-attachments/assets/PLACEHOLDER_VIDEO_3
## Architecture