Commit ·
f6fdb20
0
Parent(s):
Duplicate from zai-org/SSVAE
Browse filesCo-authored-by: zR <ZHANGYUXUAN-zR@users.noreply.huggingface.co>
- .gitattributes +35 -0
- README.md +53 -0
- ch48_256p_15w_512p_5w.ckpt +3 -0
.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability
|
| 6 |
+
|
| 7 |
+
[](https://zhazhan.github.io/ssvae.github.io)
|
| 8 |
+
[](https://arxiv.org/abs/2512.05394)
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
Most existing video VAEs prioritize reconstruction fidelity, often overlooking the latent structure's impact on
|
| 12 |
+
downstream diffusion training. Our research identifies properties of video VAE latent spaces that facilitate diffusion
|
| 13 |
+
training through statistical analysis of VAE latents. Our key finding is that biased, rather than uniform, spectra lead
|
| 14 |
+
to improved diffusability. Motivated by this, we introduce **SSVAE (Spectral-Structured VAE)**, which optimizes the *
|
| 15 |
+
*spectral properties** of the latent space to enhance its **"Diffusability"**.
|
| 16 |
+
|
| 17 |
+
<div align="center">
|
| 18 |
+
<img src="https://raw.githubusercontent.com/zai-org/SSVAE/refs/heads/main/assets/figs/teaser.png" alt="Figure 1" width="400">
|
| 19 |
+
</div>
|
| 20 |
+
|
| 21 |
+
## 🔥 Key Highlights
|
| 22 |
+
|
| 23 |
+
* **Spectral Analysis of Latents**: We identify two statistical properties essential for efficient diffusion training: a
|
| 24 |
+
**low-frequency biased spatio-temporal spectrum** and a **few-mode biased channel eigenspectrum**.
|
| 25 |
+
* **Local Correlation Regularization (LCR)**: A lightweight regularizer that explicitly enhances local spatio-temporal
|
| 26 |
+
correlations to induce low-frequency bias.
|
| 27 |
+
* **Latent Masked Reconstruction (LMR)**: A mechanism that simultaneously promotes few-mode bias and improves decoder
|
| 28 |
+
robustness against noise.
|
| 29 |
+
* **Superior Performance**:
|
| 30 |
+
* 🚀 **3× Faster Convergence**: Accelerates text-to-video generation convergence by 3× compared to strong baselines.
|
| 31 |
+
* 📈 **Higher Quality**: Achieves a **10% gain** in video reward scores (UnifiedReward).
|
| 32 |
+
* 🏆 **Outperforms SOTA**: Surpasses open-source VAEs (e.g., Wan 2.2, CogVideoX) in generation quality with fewer
|
| 33 |
+
parameters.
|
| 34 |
+
|
| 35 |
+
## Using Model
|
| 36 |
+
|
| 37 |
+
Please View our [Github](https://github.com/zai-org/SSVAE).
|
| 38 |
+
|
| 39 |
+
## Citation
|
| 40 |
+
|
| 41 |
+
If you find this work useful in your research, please consider citing:
|
| 42 |
+
|
| 43 |
+
```bibtex
|
| 44 |
+
@misc{liu2025delvinglatentspectralbiasing,
|
| 45 |
+
title={Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability},
|
| 46 |
+
author={Shizhan Liu and Xinran Deng and Zhuoyi Yang and Jiayan Teng and Xiaotao Gu and Jie Tang},
|
| 47 |
+
year={2025},
|
| 48 |
+
eprint={2512.05394},
|
| 49 |
+
archivePrefix={arXiv},
|
| 50 |
+
primaryClass={cs.CV},
|
| 51 |
+
url={https://arxiv.org/abs/2512.05394},
|
| 52 |
+
}
|
| 53 |
+
```
|
ch48_256p_15w_512p_5w.ckpt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:49a354e836ac6124f7a1564a29def48bc7b938368aad53a52cc63ca45decba57
|
| 3 |
+
size 1382929206
|