Instructions to use Efficient-Large-Model/SANA-WM_bidirectional with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Efficient-Large-Model/SANA-WM_bidirectional with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline from diffusers.utils import load_image, export_to_video # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Efficient-Large-Model/SANA-WM_bidirectional", dtype=torch.bfloat16, device_map="cuda") pipe.to("cuda") prompt = "A man with short gray hair plays a red electric guitar." image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/guitar-man.png" ) output = pipe(image=image, prompt=prompt).frames[0] export_to_video(output, "output.mp4") - Notebooks
- Google Colab
- Kaggle
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -4,15 +4,31 @@ tags:
|
|
| 4 |
- text-to-video
|
| 5 |
- image-to-video
|
| 6 |
- camera-control
|
|
|
|
| 7 |
- diffusion
|
| 8 |
library_name: NVlabs-Sana
|
| 9 |
---
|
| 10 |
|
| 11 |
# SANA-WM (Bidirectional)
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
Paper: <https://arxiv.org/abs/2605.15178>
|
| 18 |
|
|
@@ -25,46 +41,55 @@ Paper: <https://arxiv.org/abs/2605.15178>
|
|
| 25 |
}
|
| 26 |
```
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
|
| 31 |
-
|
|
| 32 |
-
|
|
| 33 |
-
|
|
| 34 |
-
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here — it is
|
| 37 |
fetched on demand from `Efficient-Large-Model/gemma-2-2b-it`.
|
| 38 |
|
| 39 |
## Usage
|
| 40 |
|
| 41 |
-
Install the inference repo
|
|
|
|
| 42 |
|
| 43 |
```bash
|
| 44 |
python inference_video_scripts/inference_sana_wm.py \
|
| 45 |
-
--image
|
| 46 |
-
--prompt
|
| 47 |
-
--
|
| 48 |
-
--
|
|
|
|
|
|
|
| 49 |
--output_dir results/demo
|
| 50 |
```
|
| 51 |
|
| 52 |
-
Weights are fetched from this repository on first use. Pass `--
|
| 53 |
-
to
|
| 54 |
-
|
| 55 |
-
`--
|
| 56 |
-
with local paths.
|
| 57 |
|
| 58 |
## Inputs
|
| 59 |
|
| 60 |
-
| Argument
|
| 61 |
-
|-------------------|-----------------------------------------------------------------------------------------|
|
| 62 |
-
| `--image`
|
| 63 |
-
| `--prompt`
|
| 64 |
-
| `--camera`
|
| 65 |
-
| `--
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
## License
|
| 68 |
|
| 69 |
-
Released under the Apache 2.0 license. The
|
| 70 |
-
upstream license; see the parent NVlabs-Sana
|
|
|
|
|
|
| 4 |
- text-to-video
|
| 5 |
- image-to-video
|
| 6 |
- camera-control
|
| 7 |
+
- world-model
|
| 8 |
- diffusion
|
| 9 |
library_name: NVlabs-Sana
|
| 10 |
---
|
| 11 |
|
| 12 |
# SANA-WM (Bidirectional)
|
| 13 |
|
| 14 |
+
**SANA-WM** is an efficient open-source world model trained natively for
|
| 15 |
+
one-minute generation. The bidirectional checkpoint released here is a
|
| 16 |
+
2.6B-parameter image-to-video diffusion transformer that synthesises
|
| 17 |
+
720p, minute-scale videos with precise 6-DoF camera control, paired with
|
| 18 |
+
the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.
|
| 19 |
+
|
| 20 |
+
Four core designs drive the architecture:
|
| 21 |
+
|
| 22 |
+
1. **Hybrid Linear Attention** — frame-wise Gated DeltaNet combined with
|
| 23 |
+
softmax attention every Nth block for memory-efficient long-context
|
| 24 |
+
modelling.
|
| 25 |
+
2. **Dual-Branch Camera Control** — independent main and camera branches
|
| 26 |
+
enable precise per-frame trajectory adherence.
|
| 27 |
+
3. **Two-Stage Generation Pipeline** — a long-video refiner stitched on
|
| 28 |
+
top of Stage-1 latents improves quality and temporal consistency.
|
| 29 |
+
4. **Robust Annotation Pipeline** — metric-scale 6-DoF camera poses
|
| 30 |
+
extracted from public video corpora yield spatiotemporally consistent
|
| 31 |
+
action supervision.
|
| 32 |
|
| 33 |
Paper: <https://arxiv.org/abs/2605.15178>
|
| 34 |
|
|
|
|
| 41 |
}
|
| 42 |
```
|
| 43 |
|
| 44 |
+
## Repository layout
|
| 45 |
+
|
| 46 |
+
| Component | Path in repo | Size |
|
| 47 |
+
|------------------------------------|-------------------------------------------|------:|
|
| 48 |
+
| Sana DiT (Stage 1) | `dit/sana_wm_1600m_720p.safetensors` | 10 GB |
|
| 49 |
+
| LTX-2 VAE (diffusers) | `vae/` | 2 GB |
|
| 50 |
+
| LTX-2 refiner (Stage 2) | `refiner/refiner.safetensors` | 41 GB |
|
| 51 |
+
| Gemma text encoder for the refiner | `refiner/text_encoder/` | 46 GB |
|
| 52 |
+
| Inference config | `config.yaml` | — |
|
| 53 |
|
| 54 |
The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here — it is
|
| 55 |
fetched on demand from `Efficient-Large-Model/gemma-2-2b-it`.
|
| 56 |
|
| 57 |
## Usage
|
| 58 |
|
| 59 |
+
Install the inference repo (see [environment_setup_sana_wm.sh](https://github.com/NVlabs/Sana/blob/main/environment_setup_sana_wm.sh))
|
| 60 |
+
and run:
|
| 61 |
|
| 62 |
```bash
|
| 63 |
python inference_video_scripts/inference_sana_wm.py \
|
| 64 |
+
--image asset/sana_wm/demo_0.png \
|
| 65 |
+
--prompt asset/sana_wm/demo_0.txt \
|
| 66 |
+
--action "w-80,jw-40,w-40,lw-60,w-100" \
|
| 67 |
+
--translation_speed 0.055 \
|
| 68 |
+
--rotation_speed_deg 1.2 \
|
| 69 |
+
--num_frames 321 \
|
| 70 |
--output_dir results/demo
|
| 71 |
```
|
| 72 |
|
| 73 |
+
Weights are fetched from this repository on first use. Pass `--no_refiner`
|
| 74 |
+
to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE
|
| 75 |
+
instead. To run fully offline, override any of `--config` / `--model_path` /
|
| 76 |
+
`--refiner_checkpoint` / `--refiner_gemma_root` with local paths.
|
|
|
|
| 77 |
|
| 78 |
## Inputs
|
| 79 |
|
| 80 |
+
| Argument | Format |
|
| 81 |
+
|---------------------|-----------------------------------------------------------------------------------------|
|
| 82 |
+
| `--image` | RGB image (any PIL-readable format) — used as the first frame. |
|
| 83 |
+
| `--prompt` | UTF-8 text file containing the conditioning prompt. |
|
| 84 |
+
| `--camera` | NumPy `.npy` of shape `(F, 4, 4)` — per-frame camera-to-world matrices. |
|
| 85 |
+
| `--action` | WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`. |
|
| 86 |
+
| `--intrinsics` | Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25°, 120°]`. |
|
| 87 |
+
|
| 88 |
+
The output frame size is fixed at `704 x 1280`; input images are
|
| 89 |
+
aspect-preserving resized + center-cropped to that resolution.
|
| 90 |
|
| 91 |
## License
|
| 92 |
|
| 93 |
+
Released under the Apache 2.0 license. The bundled LTX-2 refiner and VAE
|
| 94 |
+
inherit the LTX-2 upstream license; see the parent NVlabs-Sana
|
| 95 |
+
repository for details.
|