Instructions to use Efficient-Large-Model/SANA-WM_bidirectional with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Efficient-Large-Model/SANA-WM_bidirectional with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Efficient-Large-Model/SANA-WM_bidirectional", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| tags: | |
| - text-to-video | |
| - image-to-video | |
| - camera-control | |
| - world-model | |
| - diffusion | |
| # SANA-WM (Bidirectional) | |
| **SANA-WM** is an efficient open-source world model trained natively for | |
| one-minute generation. The bidirectional checkpoint released here is a | |
| 2.6B-parameter image-to-video diffusion transformer that synthesises | |
| 720p, minute-scale videos with precise 6-DoF camera control, paired with | |
| the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding. | |
| Four core designs drive the architecture: | |
| 1. **Hybrid Linear Attention** β frame-wise Gated DeltaNet combined with | |
| softmax attention every Nth block for memory-efficient long-context | |
| modelling. | |
| 2. **Dual-Branch Camera Control** β independent main and camera branches | |
| enable precise per-frame trajectory adherence. | |
| 3. **Two-Stage Generation Pipeline** β a long-video refiner stitched on | |
| top of Stage-1 latents improves quality and temporal consistency. | |
| 4. **Robust Annotation Pipeline** β metric-scale 6-DoF camera poses | |
| extracted from public video corpora yield spatiotemporally consistent | |
| action supervision. | |
| Paper: <https://arxiv.org/abs/2605.15178> | |
| ```bibtex | |
| @article{zhu2026sanawm, | |
| title = {{SANA-WM}: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer}, | |
| author = {Zhu, Haoyi and Liu, Haozhe and Zhao, Yuyang and Ye, Tian and Chen, Junsong and Yu, Jincheng and He, Tong and Han, Song and Xie, Enze}, | |
| journal = {arXiv preprint arXiv:2605.15178}, | |
| year = {2026}, | |
| } | |
| ``` | |
| ## Repository layout | |
| | Component | Path in repo | Size | | |
| |------------------------------------|-------------------------------------------|------:| | |
| | Sana DiT (Stage 1) | `dit/sana_wm_1600m_720p.safetensors` | 10 GB | | |
| | LTX-2 VAE (diffusers) | `vae/` | 2 GB | | |
| | LTX-2 refiner (Stage 2) | `refiner/refiner.safetensors` | 41 GB | | |
| | Gemma text encoder for the refiner | `refiner/text_encoder/` | 46 GB | | |
| | Inference config | `config.yaml` | β | | |
| The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here β it is | |
| fetched on demand from the public Hugging Face mirror. | |
| ## Usage | |
| ```bash | |
| python inference_video_scripts/inference_sana_wm.py \ | |
| --image asset/sana_wm/demo_0.png \ | |
| --prompt asset/sana_wm/demo_0.txt \ | |
| --action "w-80,jw-40,w-40,lw-60,w-100" \ | |
| --translation_speed 0.055 \ | |
| --rotation_speed_deg 1.2 \ | |
| --num_frames 321 \ | |
| --output_dir results/demo | |
| ``` | |
| Weights are fetched from this repository on first use. Pass `--no_refiner` | |
| to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE | |
| instead. To run fully offline, override any of `--config` / `--model_path` / | |
| `--refiner_checkpoint` / `--refiner_gemma_root` with local paths. | |
| ## Inputs | |
| | Argument | Format | | |
| |---------------------|-----------------------------------------------------------------------------------------| | |
| | `--image` | RGB image (any PIL-readable format) β used as the first frame. | | |
| | `--prompt` | UTF-8 text file containing the conditioning prompt. | | |
| | `--camera` | NumPy `.npy` of shape `(F, 4, 4)` β per-frame camera-to-world matrices. | | |
| | `--action` | WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`. | | |
| | `--intrinsics` | Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25Β°, 120Β°]`. | | |
| The output frame size is fixed at `704 x 1280`; input images are | |
| aspect-preserving resized + center-cropped to that resolution. | |
| ## License | |
| Released under the Apache 2.0 license. The bundled LTX-2 refiner and VAE | |
| inherit the LTX-2 upstream license. | |