Text-to-Video
Diffusers
Safetensors
English
efficient
mobile video generation
dit
pyramidal diffusion
karnewar commited on
Commit
1a8ff90
·
verified ·
1 Parent(s): 32e1763

README.md [WIP]

Browse files
Files changed (1) hide show
  1. README.md +131 -2
README.md CHANGED
@@ -1,5 +1,134 @@
1
  ---
2
  license: other
3
  license_name: qualcomm-ai-hub-proprietary-license
4
- license_link: https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/Qualcomm+AI+Hub+Proprietary+License.pdf
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: other
3
  license_name: qualcomm-ai-hub-proprietary-license
4
+ license_link: >-
5
+ https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/Qualcomm+AI+Hub+Proprietary+License.pdf
6
+ pipeline_tag: text-to-video
7
+ tags:
8
+ - efficient
9
+ - mobile video generation
10
+ - dit
11
+ - pyramidal diffusion
12
+ language:
13
+ - en
14
+ base_model:
15
+ - karnewar/Neodragon
16
+ ---
17
+ <div align="center" style="padding: 20px; border-radius: 10px;">
18
+ <div style="display: flex; align-items: center; justify-content: center; gap: 20px;">
19
+ <img src="assets/Neodragon_title.jpg" alt="neodragon logo"/>
20
+ </div>
21
+
22
+
23
+ <!-- Animated banner (WebP with fallback) -->
24
+ <picture>
25
+ <!-- WebP first -->
26
+ <source srcset="assets/showcase_video_banner.webp" type="image/webp">
27
+ <!-- Optional fallback (PNG/JPEG still image or a lightweight GIF) -->
28
+ <img
29
+ src="assets/showcase_video_banner_fallback.png"
30
+ alt="Neodragon showcase banner"
31
+ style="max-width: 100%; height: auto; display: block; margin: 0 auto;"
32
+ loading="lazy"
33
+ />
34
+ </picture>
35
+
36
+ <h1> Neodragon: Mobile Video Generation Using Diffusion Transformer </h1>
37
+
38
+ <a href="https://qualcomm-ai-research.github.io/neodragon">
39
+ <img src="https://img.shields.io/badge/Project-Page-Green" alt="Project Page">
40
+ </a>
41
+ <a href="https://arxiv.org/abs/2511.06055">
42
+ <img src="https://img.shields.io/badge/arXiv-2511.06055-b31b1b.svg" alt="arXiv">
43
+ </a>
44
+ <a href="https://huggingface.co/karnewar/Neodragon">
45
+ <img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'>
46
+ </a>
47
+
48
+ **[Qualcomm AI Research](https://www.qualcomm.com/research/artificial-intelligence)**
49
+
50
+ [Animesh Karnewar](https://akanimax.github.io),
51
+ [Denis Korzhenkov](https://scholar.google.com/citations?user=ypspak0AAAAJ),
52
+ [Ioannis Lelekas](https://nl.linkedin.com/in/ioannis-lelekas-609bb5151),
53
+ [Noor Fathima](https://scholar.google.com/citations?user=M9BUCaUAAAAJ&hl=en),
54
+ [Adil Karjauv](https://scholar.google.com/citations?user=bN7UGiYAAAAJ&hl=en),
55
+ [Hanwen Xiong](#),
56
+ [Vancheeswaran Vaidyanathan](https://www.linkedin.com/in/vancheeswaran-vaidyanathan),
57
+ [Will Zeng](https://scholar.google.com/citations?user=B_fh4ioAAAAJ&hl=en),
58
+ [Rafael Esteves](https://www.linkedin.com/in/rafael-esteves-124353145),
59
+ [Tushar Singhal](https://www.linkedin.com/in/tushar-singhal),
60
+ [Fatih Porikli](https://scholar.google.com/citations?user=VpB8NZ8AAAAJ&hl=en),
61
+ [Mohsen Ghafoorian](https://mohsenghafoorian.github.io),
62
+ [Amirhossein Habibian](https://habibian.github.io/)
63
+
64
+ </div>
65
+
66
+ ```bibtex
67
+ @article{karnewar2025neodragon,
68
+ author = {Animesh Karnewar and Denis Korzhenkov and Ioannis Lelekas and Noor Fathima and Adil Karjauv and Hanwen Xiong, Vancheeswaran Vaidyanathan and Will Zeng and Rafael Esteves and Tushar Singhal and Fatih Porikli and Mohsen Ghafoorian and Amirhossein Habibian},
69
+ title = {Neodragon: Mobile Video Generation using Diffusion Transformer},
70
+ journal = {arXiv preprint arXiv:2511.06055},
71
+ year = {2025},
72
+ }
73
+ ```
74
+
75
+
76
+
77
+ <!-- # 🐱 SANA-Video Model Card
78
+
79
+ SANA-Video is a small, ultra-efficient diffusion model designed for rapid generation of high-quality, minute-long videos at resolutions up to 720×1280.
80
+
81
+ Key innovations and efficiency drivers include:
82
+
83
+ (1) **Linear DiT**: Leverages linear attention as the core operation, offering significantly more efficiency than vanilla attention when processing the massive number of tokens required for video generation.
84
+
85
+ (2) **Constant-Memory KV Cache for Block Linear Attention**: Implements a block-wise autoregressive approach that uses the cumulative properties of linear attention to maintain global context at a fixed memory cost, eliminating the traditional KV cache bottleneck and enabling efficient, minute-long video synthesis.
86
+
87
+ SANA-Video achieves exceptional efficiency and cost savings: its training cost is only **1%** of MovieGen's (**12 days on 64 H100 GPUs**). Compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1 and SkyReel-V2), SANA-Video maintains competitive performance while being **16×** faster in measured latency.
88
+ SANA-Video is deployable on RTX 5090 GPUs, accelerating the inference speed for a 5-second 720p video from 71s down to 29s (2.4× speedup), setting a new standard for low-cost, high-quality video generation.
89
+
90
+ Source code is available at https://github.com/NVlabs/Sana.
91
+
92
+ # 🐱 How to Inference
93
+
94
+ Refer to: https://github.com/NVlabs/Sana/blob/main/asset/docs/sana_video.md#1-inference-with-txt-file
95
+ # diffusers pipeline
96
+ refer to: https://huggingface.co/Efficient-Large-Model/SANA-Video_2B_720p_diffusers
97
+
98
+ ### Model Description
99
+
100
+ - **Developed by:** NVIDIA, Sana
101
+ - **Model type:** Efficient Video Generation with Block Linear Diffusion Transformer
102
+ - **Model size:** 2B parameters
103
+ - **Model precision:** torch.bfloat16 (BF16)
104
+ - **Model resolution:** This model is developed to generate 720p resolution 81(5s) frames videos with multi-scale heigh and width.
105
+ - **Model Description:** This is a model that can be used to generate and modify videos based on text prompts.
106
+ It is a Linear Diffusion Transformer that uses LTX2-vae one 32x32x8 spatial-temporal-compressed latent feature encoder ([LTX2](https://github.com/Lightricks/LTX-2)).
107
+ - **Resources for more information:** Check out our [GitHub Repository](https://github.com/NVlabs/Sana) and the [SANA-Video report on arXiv](https://arxiv.org/pdf/2509.24695).
108
+
109
+ ### Model Sources
110
+
111
+ For research purposes, we recommend our `generative-models` Github repository (https://github.com/NVlabs/Sana), which is more suitable for both training and inference
112
+ - **Repository:** https://github.com/NVlabs/Sana
113
+ - **Guidance:** https://github.com/NVlabs/Sana/asset/docs/sana_video.md
114
+ ## License/Terms of Use
115
+ GOVERNING TERMS: This trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
116
+ ## Uses
117
+ ### Direct Use
118
+ The model is intended for research purposes only. Possible research areas and tasks include
119
+ - Generation of artworks and use in design and other artistic processes.
120
+ - Applications in educational or creative tools.
121
+ - Research on generative models.
122
+ - Safe deployment of models which have the potential to generate harmful content.
123
+ - Probing and understanding the limitations and biases of generative models.
124
+ Excluded uses are described below.
125
+ ### Out-of-Scope Use
126
+ The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
127
+ ## Limitations and Bias
128
+ ### Limitations
129
+ - The model does not achieve perfect photorealism
130
+ - The model cannot render complex legible text
131
+ - fingers, .etc in general may not be generated properly.
132
+ - The autoencoding part of the model is lossy.
133
+ ### Bias
134
+ While the capabilities of video generation models are impressive, they can also reinforce or exacerbate social biases. -->