Model Card updated
Browse files
README.md
CHANGED
|
@@ -12,8 +12,9 @@ tags:
|
|
| 12 |
language:
|
| 13 |
- en
|
| 14 |
base_model:
|
| 15 |
-
-
|
| 16 |
---
|
|
|
|
| 17 |
<div align="center" style="padding: 20px; border-radius: 10px;">
|
| 18 |
<div style="display: flex; align-items: center; justify-content: center; gap: 20px;">
|
| 19 |
<img src="assets/Neodragon_title.jpg" alt="neodragon logo"/>
|
|
@@ -21,28 +22,26 @@ base_model:
|
|
| 21 |
|
| 22 |
|
| 23 |
<!-- Animated banner (WebP with fallback) -->
|
| 24 |
-
<
|
| 25 |
-
<
|
| 26 |
-
|
| 27 |
-
<!-- Optional fallback (PNG/JPEG still image or a lightweight GIF) -->
|
| 28 |
-
<img
|
| 29 |
-
src="assets/showcase_video_banner_fallback.png"
|
| 30 |
-
alt="Neodragon showcase banner"
|
| 31 |
-
style="max-width: 100%; height: auto; display: block; margin: 0 auto;"
|
| 32 |
-
loading="lazy"
|
| 33 |
-
/>
|
| 34 |
-
</picture>
|
| 35 |
-
|
| 36 |
<h1> Neodragon: Mobile Video Generation Using Diffusion Transformer </h1>
|
| 37 |
-
|
|
|
|
| 38 |
<a href="https://qualcomm-ai-research.github.io/neodragon">
|
| 39 |
<img src="https://img.shields.io/badge/Project-Page-Green" alt="Project Page">
|
| 40 |
</a>
|
| 41 |
<a href="https://arxiv.org/abs/2511.06055">
|
| 42 |
<img src="https://img.shields.io/badge/arXiv-2511.06055-b31b1b.svg" alt="arXiv">
|
| 43 |
</a>
|
| 44 |
-
<a href="https://huggingface.co/
|
| 45 |
-
<img src=
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
</a>
|
| 47 |
|
| 48 |
**[Qualcomm AI Research](https://www.qualcomm.com/research/artificial-intelligence)**
|
|
@@ -64,71 +63,104 @@ base_model:
|
|
| 64 |
</div>
|
| 65 |
|
| 66 |
```bibtex
|
| 67 |
-
@article{karnewar2025neodragon,
|
| 68 |
-
author = {Animesh Karnewar and Denis Korzhenkov and Ioannis Lelekas and Noor Fathima and Adil Karjauv and Hanwen Xiong, Vancheeswaran Vaidyanathan and Will Zeng and Rafael Esteves and Tushar Singhal and Fatih Porikli and Mohsen Ghafoorian and Amirhossein Habibian},
|
| 69 |
-
title = {Neodragon: Mobile Video Generation using Diffusion Transformer},
|
| 70 |
-
journal = {arXiv preprint arXiv:2511.06055},
|
| 71 |
-
year = {2025},
|
| 72 |
-
}
|
| 73 |
-
```
|
| 74 |
-
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
(2) **Constant-Memory KV Cache for Block Linear Attention**: Implements a block-wise autoregressive approach that uses the cumulative properties of linear attention to maintain global context at a fixed memory cost, eliminating the traditional KV cache bottleneck and enabling efficient, minute-long video synthesis.
|
| 86 |
|
| 87 |
-
|
| 88 |
-
SANA-Video is deployable on RTX 5090 GPUs, accelerating the inference speed for a 5-second 720p video from 71s down to 29s (2.4× speedup), setting a new standard for low-cost, high-quality video generation.
|
| 89 |
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
-
# 🐱 How to Inference
|
| 93 |
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
refer to: https://huggingface.co/Efficient-Large-Model/SANA-Video_2B_720p_diffusers
|
| 97 |
|
| 98 |
### Model Description
|
| 99 |
|
| 100 |
-
- **Developed by:**
|
| 101 |
-
- **Model type:**
|
| 102 |
-
- **Model size:**
|
| 103 |
- **Model precision:** torch.bfloat16 (BF16)
|
| 104 |
-
- **Model resolution:** This model is developed to generate
|
| 105 |
-
- **Model Description:** This is a model that can be used to generate
|
| 106 |
-
It is a
|
| 107 |
-
- **Resources for more information:** Check out our [GitHub Repository](https://github.com/
|
| 108 |
-
|
| 109 |
-
### Model Sources
|
| 110 |
|
| 111 |
-
For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference
|
| 112 |
-
- **Repository:** https://github.com/NVlabs/Sana
|
| 113 |
-
- **Guidance:** https://github.com/NVlabs/Sana/asset/docs/sana_video.md
|
| 114 |
## License/Terms of Use
|
| 115 |
-
|
| 116 |
## Uses
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
- Generation of artworks and use in design and other artistic processes.
|
| 120 |
- Applications in educational or creative tools.
|
| 121 |
- Research on generative models.
|
| 122 |
- Safe deployment of models which have the potential to generate harmful content.
|
| 123 |
- Probing and understanding the limitations and biases of generative models.
|
| 124 |
Excluded uses are described below.
|
| 125 |
-
### Out-of-Scope Use
|
| 126 |
-
The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
|
| 127 |
## Limitations and Bias
|
| 128 |
### Limitations
|
| 129 |
- The model does not achieve perfect photorealism
|
| 130 |
- The model cannot render complex legible text
|
| 131 |
-
-
|
| 132 |
-
- The autoencoding part of the model is lossy.
|
| 133 |
### Bias
|
| 134 |
-
While the capabilities of video generation
|
|
|
|
| 12 |
language:
|
| 13 |
- en
|
| 14 |
base_model:
|
| 15 |
+
- qualcomm/Neodragon
|
| 16 |
---
|
| 17 |
+
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
|
| 18 |
<div align="center" style="padding: 20px; border-radius: 10px;">
|
| 19 |
<div style="display: flex; align-items: center; justify-content: center; gap: 20px;">
|
| 20 |
<img src="assets/Neodragon_title.jpg" alt="neodragon logo"/>
|
|
|
|
| 22 |
|
| 23 |
|
| 24 |
<!-- Animated banner (WebP with fallback) -->
|
| 25 |
+
<p align="center">
|
| 26 |
+
<img src="assets/showcase_video_banner.webp" alt="Neodragon showcase banner">
|
| 27 |
+
</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
<h1> Neodragon: Mobile Video Generation Using Diffusion Transformer </h1>
|
| 29 |
+
|
| 30 |
+
<!-- Badges -->
|
| 31 |
<a href="https://qualcomm-ai-research.github.io/neodragon">
|
| 32 |
<img src="https://img.shields.io/badge/Project-Page-Green" alt="Project Page">
|
| 33 |
</a>
|
| 34 |
<a href="https://arxiv.org/abs/2511.06055">
|
| 35 |
<img src="https://img.shields.io/badge/arXiv-2511.06055-b31b1b.svg" alt="arXiv">
|
| 36 |
</a>
|
| 37 |
+
<a href="https://huggingface.co/qualcomm/Neodragon">
|
| 38 |
+
<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" alt="Hugging Face Model">
|
| 39 |
+
</a>
|
| 40 |
+
<a href="https://openreview.net/forum?id=XBzIhhwv8d">
|
| 41 |
+
<img src="https://img.shields.io/badge/ICLR%202026-OpenReview-8A2BE2" alt="ICLR 2026 OpenReview">
|
| 42 |
+
</a>
|
| 43 |
+
<a href="https://github.com/qualcomm-ai-research/neodragon">
|
| 44 |
+
<img src="https://img.shields.io/badge/GitHub-Code-181717?logo=github&logoColor=white" alt="GitHub Code">
|
| 45 |
</a>
|
| 46 |
|
| 47 |
**[Qualcomm AI Research](https://www.qualcomm.com/research/artificial-intelligence)**
|
|
|
|
| 63 |
</div>
|
| 64 |
|
| 65 |
```bibtex
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
+
@article{karnewar2025neodragon,
|
| 68 |
+
author = {Animesh Karnewar and Denis Korzhenkov and Ioannis Lelekas and Noor Fathima and Adil Karjauv and Hanwen Xiong and Vancheeswaran Vaidyanathan and Will Zeng and Rafael Esteves and Tushar Singhal and Fatih Porikli and Mohsen Ghafoorian and Amirhossein Habibian},
|
| 69 |
+
title = {Neodragon: Mobile Video Generation using Diffusion Transformer},
|
| 70 |
+
journal = {arXiv preprint arXiv:2511.06055},
|
| 71 |
+
year = {2025},
|
| 72 |
+
note = {Published in the Proceedings of ICLR 2026. OpenReview: \url{https://openreview.net/forum?id=XBzIhhwv8d}; arXiv technical-report: \url{https://arxiv.org/abs/2511.06055}}
|
| 73 |
+
}
|
|
|
|
|
|
|
| 74 |
|
| 75 |
+
```
|
|
|
|
| 76 |
|
| 77 |
+
<section class="section hero is-light">
|
| 78 |
+
<div class="container is-max-widescreen">
|
| 79 |
+
<div class="columns is-centered has-text-centered">
|
| 80 |
+
<div class="column is-11">
|
| 81 |
+
<div class="content has-text-justified">
|
| 82 |
+
<p>
|
| 83 |
+
We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos
|
| 84 |
+
at a resolution of <code>[640×1024]</code> directly on a <strong>Qualcomm Hexagon NPU</strong> in a
|
| 85 |
+
record <strong>~6.7s</strong> (7 FPS). Differing from existing transformer-based offline text-to-video
|
| 86 |
+
generation models, <strong>Neodragon</strong> is the first to have been specifically optimized for mobile
|
| 87 |
+
hardware to achieve efficient, low-cost, and high-fidelity video synthesis.
|
| 88 |
+
</p>
|
| 89 |
+
<ul>
|
| 90 |
+
<li>
|
| 91 |
+
<strong>Replacing the original large 4.762B <em>T5</em><sub>XXL</sub> Text-Encoder</strong>
|
| 92 |
+
with a much smaller 0.2B <em>DT5</em> (DistilT5) with minimal quality loss, enabling the entire model
|
| 93 |
+
to run without CPU offloading. This is enabled through a novel Text-Encoder Distillation
|
| 94 |
+
procedure which uses only generative text-prompt data and <em>does not</em> require any image or video data.
|
| 95 |
+
</li>
|
| 96 |
+
<li>
|
| 97 |
+
<strong>Proposing an Asymmetric Decoder Distillation approach</strong> which allows us to replace the native
|
| 98 |
+
codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the
|
| 99 |
+
video generation pipeline.
|
| 100 |
+
</li>
|
| 101 |
+
<li>
|
| 102 |
+
<strong>Pruning of MMDiT blocks</strong> within the denoiser backbone based on their relative importance,
|
| 103 |
+
with recovery of original performance through a two-stage distillation process.
|
| 104 |
+
</li>
|
| 105 |
+
<li>
|
| 106 |
+
<strong>Reducing the NFE (Neural Functional Evaluation) requirement</strong> of the denoiser by performing
|
| 107 |
+
step distillation using a technique adapted from DMD for <em>pyramidal</em> flow-matching, thereby significantly
|
| 108 |
+
accelerating video generation.
|
| 109 |
+
</li>
|
| 110 |
+
</ul>
|
| 111 |
+
<p>
|
| 112 |
+
When paired with an optimized SSD1B first-frame image generator and QuickSRNet for 2×
|
| 113 |
+
super-resolution, our end-to-end <strong>Neodragon</strong> system becomes a highly parameter
|
| 114 |
+
(<strong>4.945B</strong> full model), memory (<strong>3.5GB</strong> peak RAM usage), and
|
| 115 |
+
runtime (<strong>6.7s</strong> E2E latency) efficient mobile-friendly model, while achieving a <em>VBench</em>
|
| 116 |
+
total score of <strong>81.61</strong>, yielding high-fidelity generated videos.
|
| 117 |
+
</p>
|
| 118 |
+
<p>
|
| 119 |
+
By enabling low-cost, private, and on-device text-to-video synthesis, <strong>Neodragon</strong> democratizes
|
| 120 |
+
AI-based video content creation, empowering creators to generate high-quality videos without reliance on cloud services.
|
| 121 |
+
</p>
|
| 122 |
+
<p>
|
| 123 |
+
Inference code is available at:
|
| 124 |
+
<a href="https://github.com/qualcomm-ai-research/neodragon">
|
| 125 |
+
https://github.com/qualcomm-ai-research/neodragon
|
| 126 |
+
</a>
|
| 127 |
+
</p>
|
| 128 |
+
</div>
|
| 129 |
+
</div>
|
| 130 |
+
</div>
|
| 131 |
+
</div>
|
| 132 |
+
</section>
|
| 133 |
|
|
|
|
| 134 |
|
| 135 |
+
# How to Inference
|
| 136 |
+
Please Refer to: https://github.com/qualcomm-ai-research/neodragon
|
|
|
|
| 137 |
|
| 138 |
### Model Description
|
| 139 |
|
| 140 |
+
- **Developed by:** Qualcomm AI Research, Generative Vision group, Amsterdam, Netherlands
|
| 141 |
+
- **Model type:** Mobile Video Generation with efficient pyramidal Diffusion Transformer
|
| 142 |
+
- **Model size:** 4.945B parameters (full package)
|
| 143 |
- **Model precision:** torch.bfloat16 (BF16)
|
| 144 |
+
- **Model resolution:** This model is developed to generate [320 x 512] resolution 49(2s @ 24fps) frames videos directly on a Snapdragon powered mobile phone.
|
| 145 |
+
- **Model Description:** This is a model that can be used to generate videos based on the provided text prompts.
|
| 146 |
+
It is a Diffusion Transformer that uses our finetuned TinyAEHV Auto-Encoder with 8x8x8x spatio-temporal-compressed latent features ([TinyAEHV](https://github.com/madebyollin/taehv)).
|
| 147 |
+
- **Resources for more information:** Check out our [GitHub Repository](https://github.com/qualcomm-ai-research/Neodragon) and the [Technical-report on arXiv](https://arxiv.org/abs/2511.06055) and the [ICLR 2026 Openreview](https://openreview.net/forum?id=XBzIhhwv8d).
|
|
|
|
|
|
|
| 148 |
|
|
|
|
|
|
|
|
|
|
| 149 |
## License/Terms of Use
|
| 150 |
+
This model is released under the terms-and-conditions laid out in the [Qualcomm-AI-Hub-proprietory License](https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/Qualcomm+AI+Hub+Proprietary+License.pdf).
|
| 151 |
## Uses
|
| 152 |
+
The model is intended for research purposes. Possible research areas and tasks include:
|
| 153 |
+
- Research on Efficient Transformer or non-Transformer based Backbone Architectures for Video Generation.
|
| 154 |
+
- Generation of Image/Video based artworks and use in design and other artistic processes.
|
| 155 |
- Applications in educational or creative tools.
|
| 156 |
- Research on generative models.
|
| 157 |
- Safe deployment of models which have the potential to generate harmful content.
|
| 158 |
- Probing and understanding the limitations and biases of generative models.
|
| 159 |
Excluded uses are described below.
|
|
|
|
|
|
|
| 160 |
## Limitations and Bias
|
| 161 |
### Limitations
|
| 162 |
- The model does not achieve perfect photorealism
|
| 163 |
- The model cannot render complex legible text
|
| 164 |
+
- The model cannot produce videos with accurate physically compliant motion
|
|
|
|
| 165 |
### Bias
|
| 166 |
+
While the capabilities of the presented mobile video generation model are impressive, they can also reinforce or exacerbate social biases strictly based on our foundational-base model [Pyramidal-Flow](https://arxiv.org/abs/2410.05954).
|