Efficient-Large-Model
/

SANA-WM_bidirectional

+---
+license: apache-2.0
+tags:
+  - text-to-video
+  - image-to-video
+  - camera-control
+  - diffusion
+library_name: NVlabs-Sana
+---
+# SANA-WM (Bidirectional)
+A 2.6 B parameter image-to-video diffusion model conditioned on a per-frame
+camera trajectory, paired with the LTX-2 sink-bidirectional Euler refiner
+for high-fidelity decoding.
+| Component                  | Path in repo                              | Size  |
+|----------------------------|-------------------------------------------|-------|
+| Sana DiT (Stage 1)         | `dit/sana_wm_1600m_720p.safetensors`      | 10 GB |
+| LTX-2 VAE (diffusers)      | `vae/`                                    |  2 GB |
+| LTX-2 refiner (Stage 2)    | `refiner/refiner.safetensors`             | 41 GB |
+| Gemma text encoder for the refiner | `refiner/text_encoder/`           | 46 GB |
+| Inference config           | `config.yaml`                             |       |
+The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here — it is
+fetched on demand from `Efficient-Large-Model/gemma-2-2b-it`.
+## Usage
+Install the inference repo and run:
+```bash
+python inference_video_scripts/inference_sana_wm.py \
+  --image examples/scene/first_frame.png \
+  --prompt examples/scene/prompt.txt \
+  --camera examples/scene/camera.npy \
+  --intrinsics examples/scene/intrinsics.npy \
+  --output_dir results/demo
+```
+Weights are fetched from this repository on first use. Pass `--use_refiner`
+to enable the Stage-2 LTX-2 refiner; without it, the Sana VAE decodes the
+Stage-1 latents directly. To run entirely offline, override any of
+`--config` / `--model_path` / `--refiner_checkpoint` / `--refiner_gemma_root`
+with local paths.
+## Inputs
+| Argument          | Format                                                                                  |
+|-------------------|-----------------------------------------------------------------------------------------|
+| `--image`         | RGB image (any PIL-readable format) — used as the first frame.                          |
+| `--prompt`        | UTF-8 text file containing the conditioning prompt.                                     |
+| `--camera`        | NumPy `.npy`, shape `(F, 4, 4)`, camera-to-world matrices for `F = --num_frames`.       |
+| `--intrinsics`    | NumPy `.npy`, shape `(3, 3)`, `(F, 3, 3)`, or `(4,) = (fx, fy, cx, cy)` in input pixels.|
+## License
+Released under the Apache 2.0 license. The refiner inherits the LTX-2
+upstream license; see the parent NVlabs-Sana repository for details.