HaoyiZhu's picture
Upload README.md with huggingface_hub
dfc7eab verified
---
license: apache-2.0
tags:
- text-to-video
- image-to-video
- camera-control
- world-model
- diffusion
---
# SANA-WM (Bidirectional)
**SANA-WM** is an efficient open-source world model trained natively for
one-minute generation. The bidirectional checkpoint released here is a
2.6B-parameter image-to-video diffusion transformer that synthesises
720p, minute-scale videos with precise 6-DoF camera control, paired with
the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.
Four core designs drive the architecture:
1. **Hybrid Linear Attention** β€” frame-wise Gated DeltaNet combined with
softmax attention every Nth block for memory-efficient long-context
modelling.
2. **Dual-Branch Camera Control** β€” independent main and camera branches
enable precise per-frame trajectory adherence.
3. **Two-Stage Generation Pipeline** β€” a long-video refiner stitched on
top of Stage-1 latents improves quality and temporal consistency.
4. **Robust Annotation Pipeline** β€” metric-scale 6-DoF camera poses
extracted from public video corpora yield spatiotemporally consistent
action supervision.
Paper: <https://arxiv.org/abs/2605.15178>
```bibtex
@article{zhu2026sanawm,
title = {{SANA-WM}: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
author = {Zhu, Haoyi and Liu, Haozhe and Zhao, Yuyang and Ye, Tian and Chen, Junsong and Yu, Jincheng and He, Tong and Han, Song and Xie, Enze},
journal = {arXiv preprint arXiv:2605.15178},
year = {2026},
}
```
## Repository layout
| Component | Path in repo | Size |
|------------------------------------|-------------------------------------------|------:|
| Sana DiT (Stage 1) | `dit/sana_wm_1600m_720p.safetensors` | 10 GB |
| LTX-2 VAE (diffusers) | `vae/` | 2 GB |
| LTX-2 refiner (Stage 2) | `refiner/refiner.safetensors` | 41 GB |
| Gemma text encoder for the refiner | `refiner/text_encoder/` | 46 GB |
| Inference config | `config.yaml` | β€” |
The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here β€” it is
fetched on demand from the public Hugging Face mirror.
## Usage
```bash
python inference_video_scripts/inference_sana_wm.py \
--image asset/sana_wm/demo_0.png \
--prompt asset/sana_wm/demo_0.txt \
--action "w-80,jw-40,w-40,lw-60,w-100" \
--translation_speed 0.055 \
--rotation_speed_deg 1.2 \
--num_frames 321 \
--output_dir results/demo
```
Weights are fetched from this repository on first use. Pass `--no_refiner`
to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE
instead. To run fully offline, override any of `--config` / `--model_path` /
`--refiner_checkpoint` / `--refiner_gemma_root` with local paths.
## Inputs
| Argument | Format |
|---------------------|-----------------------------------------------------------------------------------------|
| `--image` | RGB image (any PIL-readable format) β€” used as the first frame. |
| `--prompt` | UTF-8 text file containing the conditioning prompt. |
| `--camera` | NumPy `.npy` of shape `(F, 4, 4)` β€” per-frame camera-to-world matrices. |
| `--action` | WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`. |
| `--intrinsics` | Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25Β°, 120Β°]`. |
The output frame size is fixed at `704 x 1280`; input images are
aspect-preserving resized + center-cropped to that resolution.
## License
Released under the Apache 2.0 license. The bundled LTX-2 refiner and VAE
inherit the LTX-2 upstream license.