--- license: apache-2.0 tags: - text-to-video - image-to-video - camera-control - world-model - diffusion --- # SANA-WM (Bidirectional) ![mosaic_2x2_5s_600](https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/W6I8j-VQA4CulZWROldru.gif) **SANA-WM** is an efficient open-source world model trained natively for one-minute generation. The bidirectional checkpoint released here is a 2.6B-parameter image-to-video diffusion transformer that synthesises 720p, minute-scale videos with precise 6-DoF camera control, paired with the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding. Four core designs drive the architecture: 1. **Hybrid Linear Attention** — frame-wise Gated DeltaNet combined with softmax attention every Nth block for memory-efficient long-context modelling. 2. **Dual-Branch Camera Control** — independent main and camera branches enable precise per-frame trajectory adherence. 3. **Two-Stage Generation Pipeline** — a long-video refiner stitched on top of Stage-1 latents improves quality and temporal consistency. 4. **Robust Annotation Pipeline** — metric-scale 6-DoF camera poses extracted from public video corpora yield spatiotemporally consistent action supervision. Paper: ```bibtex @article{zhu2026sanawm, title = {{SANA-WM}: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer}, author = {Zhu, Haoyi and Liu, Haozhe and Zhao, Yuyang and Ye, Tian and Chen, Junsong and Yu, Jincheng and He, Tong and Han, Song and Xie, Enze}, journal = {arXiv preprint arXiv:2605.15178}, year = {2026}, } ``` ## Repository layout | Component | Path in repo | Size | |------------------------------------|-------------------------------------------|------:| | Sana DiT (Stage 1) | `dit/sana_wm_1600m_720p.safetensors` | 10 GB | | LTX-2 VAE (diffusers) | `vae/` | 2 GB | | LTX-2 refiner (Stage 2) | `refiner/refiner.safetensors` | 41 GB | | Gemma text encoder for the refiner | `refiner/text_encoder/` | 46 GB | | Inference config | `config.yaml` | — | The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here — it is fetched on demand from the public Hugging Face mirror. ## Usage ```bash python inference_video_scripts/inference_sana_wm.py \ --image asset/sana_wm/demo_0.png \ --prompt asset/sana_wm/demo_0.txt \ --action "w-80,jw-40,w-40,lw-60,w-100" \ --translation_speed 0.055 \ --rotation_speed_deg 1.2 \ --num_frames 321 \ --output_dir results/demo ``` Weights are fetched from this repository on first use. Pass `--no_refiner` to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE instead. To run fully offline, override any of `--config` / `--model_path` / `--refiner_checkpoint` / `--refiner_gemma_root` with local paths. ## Inputs | Argument | Format | |---------------------|-----------------------------------------------------------------------------------------| | `--image` | RGB image (any PIL-readable format) — used as the first frame. | | `--prompt` | UTF-8 text file containing the conditioning prompt. | | `--camera` | NumPy `.npy` of shape `(F, 4, 4)` — per-frame camera-to-world matrices. | | `--action` | WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`. | | `--intrinsics` | Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25°, 120°]`. | The output frame size is fixed at `704 x 1280`; input images are aspect-preserving resized + center-cropped to that resolution. ## License Released under the Apache 2.0 license. The bundled LTX-2 refiner and VAE inherit the LTX-2 upstream license.