Efficient-Large-Model
/

SANA-WM_bidirectional

@@ -4,15 +4,31 @@ tags:
   - text-to-video
   - image-to-video
   - camera-control
   - diffusion
 library_name: NVlabs-Sana
 ---
 # SANA-WM (Bidirectional)
-A 2.6 B parameter image-to-video diffusion model conditioned on a per-frame
-camera trajectory, paired with the LTX-2 sink-bidirectional Euler refiner
-for high-fidelity decoding.
 Paper: <https://arxiv.org/abs/2605.15178>
@@ -25,46 +41,55 @@ Paper: <https://arxiv.org/abs/2605.15178>
 }
 ```
-| Component                  | Path in repo                              | Size  |
-|----------------------------|-------------------------------------------|-------|
-| Sana DiT (Stage 1)         | `dit/sana_wm_1600m_720p.safetensors`      | 10 GB |
-| LTX-2 VAE (diffusers)      | `vae/`                                    |  2 GB |
-| LTX-2 refiner (Stage 2)    | `refiner/refiner.safetensors`             | 41 GB |
-| Gemma text encoder for the refiner | `refiner/text_encoder/`           | 46 GB |
-| Inference config           | `config.yaml`                             |       |
 The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here — it is
 fetched on demand from `Efficient-Large-Model/gemma-2-2b-it`.
 ## Usage
-Install the inference repo and run:
 ```bash
 python inference_video_scripts/inference_sana_wm.py \
-  --image examples/scene/first_frame.png \
-  --prompt examples/scene/prompt.txt \
-  --camera examples/scene/camera.npy \
-  --intrinsics examples/scene/intrinsics.npy \
   --output_dir results/demo
 ```
-Weights are fetched from this repository on first use. Pass `--use_refiner`
-to enable the Stage-2 LTX-2 refiner; without it, the Sana VAE decodes the
-Stage-1 latents directly. To run entirely offline, override any of
-`--config` / `--model_path` / `--refiner_checkpoint` / `--refiner_gemma_root`
-with local paths.
 ## Inputs
-| Argument          | Format                                                                                  |
-|-------------------|-----------------------------------------------------------------------------------------|
-| `--image`         | RGB image (any PIL-readable format) — used as the first frame.                          |
-| `--prompt`        | UTF-8 text file containing the conditioning prompt.                                     |
-| `--camera`        | NumPy `.npy`, shape `(F, 4, 4)`, camera-to-world matrices for `F = --num_frames`.       |
-| `--intrinsics`    | NumPy `.npy`, shape `(3, 3)`, `(F, 3, 3)`, or `(4,) = (fx, fy, cx, cy)` in input pixels.|
 ## License
-Released under the Apache 2.0 license. The refiner inherits the LTX-2
-upstream license; see the parent NVlabs-Sana repository for details.

   - text-to-video
   - image-to-video
   - camera-control
+  - world-model
   - diffusion
 library_name: NVlabs-Sana
 ---
 # SANA-WM (Bidirectional)
+**SANA-WM** is an efficient open-source world model trained natively for
+one-minute generation. The bidirectional checkpoint released here is a
+2.6B-parameter image-to-video diffusion transformer that synthesises
+720p, minute-scale videos with precise 6-DoF camera control, paired with
+the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.
+Four core designs drive the architecture:
+1. **Hybrid Linear Attention** — frame-wise Gated DeltaNet combined with
+   softmax attention every Nth block for memory-efficient long-context
+   modelling.
+2. **Dual-Branch Camera Control** — independent main and camera branches
+   enable precise per-frame trajectory adherence.
+3. **Two-Stage Generation Pipeline** — a long-video refiner stitched on
+   top of Stage-1 latents improves quality and temporal consistency.
+4. **Robust Annotation Pipeline** — metric-scale 6-DoF camera poses
+   extracted from public video corpora yield spatiotemporally consistent
+   action supervision.
 Paper: <https://arxiv.org/abs/2605.15178>
 }
 ```
+## Repository layout
+| Component                          | Path in repo                              | Size  |
+|------------------------------------|-------------------------------------------|------:|
+| Sana DiT (Stage 1)                 | `dit/sana_wm_1600m_720p.safetensors`      | 10 GB |
+| LTX-2 VAE (diffusers)              | `vae/`                                    |  2 GB |
+| LTX-2 refiner (Stage 2)            | `refiner/refiner.safetensors`             | 41 GB |
+| Gemma text encoder for the refiner | `refiner/text_encoder/`                   | 46 GB |
+| Inference config                   | `config.yaml`                             |     — |
 The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here — it is
 fetched on demand from `Efficient-Large-Model/gemma-2-2b-it`.
 ## Usage
+Install the inference repo (see [environment_setup_sana_wm.sh](https://github.com/NVlabs/Sana/blob/main/environment_setup_sana_wm.sh))
+and run:
 ```bash
 python inference_video_scripts/inference_sana_wm.py \
+  --image      asset/sana_wm/demo_0.png \
+  --prompt     asset/sana_wm/demo_0.txt \
+  --action     "w-80,jw-40,w-40,lw-60,w-100" \
+  --translation_speed 0.055 \
+  --rotation_speed_deg 1.2 \
+  --num_frames 321 \
   --output_dir results/demo
 ```
+Weights are fetched from this repository on first use. Pass `--no_refiner`
+to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE
+instead. To run fully offline, override any of `--config` / `--model_path` /
+`--refiner_checkpoint` / `--refiner_gemma_root` with local paths.
 ## Inputs
+| Argument            | Format                                                                                  |
+|---------------------|-----------------------------------------------------------------------------------------|
+| `--image`           | RGB image (any PIL-readable format) — used as the first frame.                          |
+| `--prompt`          | UTF-8 text file containing the conditioning prompt.                                     |
+| `--camera`          | NumPy `.npy` of shape `(F, 4, 4)` — per-frame camera-to-world matrices.                  |
+| `--action`          | WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`. |
+| `--intrinsics`      | Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25°, 120°]`. |
+The output frame size is fixed at `704 x 1280`; input images are
+aspect-preserving resized + center-cropped to that resolution.
 ## License
+Released under the Apache 2.0 license. The bundled LTX-2 refiner and VAE
+inherit the LTX-2 upstream license; see the parent NVlabs-Sana
+repository for details.