Instructions to use Efficient-Large-Model/SANA-WM_bidirectional with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Efficient-Large-Model/SANA-WM_bidirectional with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Efficient-Large-Model/SANA-WM_bidirectional", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - Notebooks
- Google Colab
- Kaggle
File size: 4,149 Bytes
a1c1c1f f6ea8dc a1c1c1f f6ea8dc a1c1c1f 405266b dfc7eab f39d702 dfc7eab 405266b f39d702 405266b f6ea8dc a1c1c1f 4b2d932 a1c1c1f f6ea8dc a1c1c1f f6ea8dc a1c1c1f f6ea8dc a1c1c1f f6ea8dc 4b2d932 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | ---
license: apache-2.0
tags:
- text-to-video
- image-to-video
- camera-control
- world-model
- diffusion
---
# SANA-WM (Bidirectional)
**SANA-WM** is an efficient open-source world model trained natively for
one-minute generation. The bidirectional checkpoint released here is a
2.6B-parameter image-to-video diffusion transformer that synthesises
720p, minute-scale videos with precise 6-DoF camera control, paired with
the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.
Four core designs drive the architecture:
1. **Hybrid Linear Attention** — frame-wise Gated DeltaNet combined with
softmax attention every Nth block for memory-efficient long-context
modelling.
2. **Dual-Branch Camera Control** — independent main and camera branches
enable precise per-frame trajectory adherence.
3. **Two-Stage Generation Pipeline** — a long-video refiner stitched on
top of Stage-1 latents improves quality and temporal consistency.
4. **Robust Annotation Pipeline** — metric-scale 6-DoF camera poses
extracted from public video corpora yield spatiotemporally consistent
action supervision.
Paper: <https://arxiv.org/abs/2605.15178>
```bibtex
@article{zhu2026sanawm,
title = {{SANA-WM}: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
author = {Zhu, Haoyi and Liu, Haozhe and Zhao, Yuyang and Ye, Tian and Chen, Junsong and Yu, Jincheng and He, Tong and Han, Song and Xie, Enze},
journal = {arXiv preprint arXiv:2605.15178},
year = {2026},
}
```
## Repository layout
| Component | Path in repo | Size |
|------------------------------------|-------------------------------------------|------:|
| Sana DiT (Stage 1) | `dit/sana_wm_1600m_720p.safetensors` | 10 GB |
| LTX-2 VAE (diffusers) | `vae/` | 2 GB |
| LTX-2 refiner (Stage 2) | `refiner/refiner.safetensors` | 41 GB |
| Gemma text encoder for the refiner | `refiner/text_encoder/` | 46 GB |
| Inference config | `config.yaml` | — |
The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here — it is
fetched on demand from the public Hugging Face mirror.
## Usage
```bash
python inference_video_scripts/inference_sana_wm.py \
--image asset/sana_wm/demo_0.png \
--prompt asset/sana_wm/demo_0.txt \
--action "w-80,jw-40,w-40,lw-60,w-100" \
--translation_speed 0.055 \
--rotation_speed_deg 1.2 \
--num_frames 321 \
--output_dir results/demo
```
Weights are fetched from this repository on first use. Pass `--no_refiner`
to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE
instead. To run fully offline, override any of `--config` / `--model_path` /
`--refiner_checkpoint` / `--refiner_gemma_root` with local paths.
## Inputs
| Argument | Format |
|---------------------|-----------------------------------------------------------------------------------------|
| `--image` | RGB image (any PIL-readable format) — used as the first frame. |
| `--prompt` | UTF-8 text file containing the conditioning prompt. |
| `--camera` | NumPy `.npy` of shape `(F, 4, 4)` — per-frame camera-to-world matrices. |
| `--action` | WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`. |
| `--intrinsics` | Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25°, 120°]`. |
The output frame size is fixed at `704 x 1280`; input images are
aspect-preserving resized + center-cropped to that resolution.
## License
Released under the Apache 2.0 license. The bundled LTX-2 refiner and VAE
inherit the LTX-2 upstream license.
|