HaoyiZhu commited on
Commit
f6ea8dc
·
verified ·
1 Parent(s): f39d702

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +53 -28
README.md CHANGED
@@ -4,15 +4,31 @@ tags:
4
  - text-to-video
5
  - image-to-video
6
  - camera-control
 
7
  - diffusion
8
  library_name: NVlabs-Sana
9
  ---
10
 
11
  # SANA-WM (Bidirectional)
12
 
13
- A 2.6 B parameter image-to-video diffusion model conditioned on a per-frame
14
- camera trajectory, paired with the LTX-2 sink-bidirectional Euler refiner
15
- for high-fidelity decoding.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  Paper: <https://arxiv.org/abs/2605.15178>
18
 
@@ -25,46 +41,55 @@ Paper: <https://arxiv.org/abs/2605.15178>
25
  }
26
  ```
27
 
28
- | Component | Path in repo | Size |
29
- |----------------------------|-------------------------------------------|-------|
30
- | Sana DiT (Stage 1) | `dit/sana_wm_1600m_720p.safetensors` | 10 GB |
31
- | LTX-2 VAE (diffusers) | `vae/` | 2 GB |
32
- | LTX-2 refiner (Stage 2) | `refiner/refiner.safetensors` | 41 GB |
33
- | Gemma text encoder for the refiner | `refiner/text_encoder/` | 46 GB |
34
- | Inference config | `config.yaml` | |
 
 
35
 
36
  The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here — it is
37
  fetched on demand from `Efficient-Large-Model/gemma-2-2b-it`.
38
 
39
  ## Usage
40
 
41
- Install the inference repo and run:
 
42
 
43
  ```bash
44
  python inference_video_scripts/inference_sana_wm.py \
45
- --image examples/scene/first_frame.png \
46
- --prompt examples/scene/prompt.txt \
47
- --camera examples/scene/camera.npy \
48
- --intrinsics examples/scene/intrinsics.npy \
 
 
49
  --output_dir results/demo
50
  ```
51
 
52
- Weights are fetched from this repository on first use. Pass `--use_refiner`
53
- to enable the Stage-2 LTX-2 refiner; without it, the Sana VAE decodes the
54
- Stage-1 latents directly. To run entirely offline, override any of
55
- `--config` / `--model_path` / `--refiner_checkpoint` / `--refiner_gemma_root`
56
- with local paths.
57
 
58
  ## Inputs
59
 
60
- | Argument | Format |
61
- |-------------------|-----------------------------------------------------------------------------------------|
62
- | `--image` | RGB image (any PIL-readable format) — used as the first frame. |
63
- | `--prompt` | UTF-8 text file containing the conditioning prompt. |
64
- | `--camera` | NumPy `.npy`, shape `(F, 4, 4)`, camera-to-world matrices for `F = --num_frames`. |
65
- | `--intrinsics` | NumPy `.npy`, shape `(3, 3)`, `(F, 3, 3)`, or `(4,) = (fx, fy, cx, cy)` in input pixels.|
 
 
 
 
66
 
67
  ## License
68
 
69
- Released under the Apache 2.0 license. The refiner inherits the LTX-2
70
- upstream license; see the parent NVlabs-Sana repository for details.
 
 
4
  - text-to-video
5
  - image-to-video
6
  - camera-control
7
+ - world-model
8
  - diffusion
9
  library_name: NVlabs-Sana
10
  ---
11
 
12
  # SANA-WM (Bidirectional)
13
 
14
+ **SANA-WM** is an efficient open-source world model trained natively for
15
+ one-minute generation. The bidirectional checkpoint released here is a
16
+ 2.6B-parameter image-to-video diffusion transformer that synthesises
17
+ 720p, minute-scale videos with precise 6-DoF camera control, paired with
18
+ the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.
19
+
20
+ Four core designs drive the architecture:
21
+
22
+ 1. **Hybrid Linear Attention** — frame-wise Gated DeltaNet combined with
23
+ softmax attention every Nth block for memory-efficient long-context
24
+ modelling.
25
+ 2. **Dual-Branch Camera Control** — independent main and camera branches
26
+ enable precise per-frame trajectory adherence.
27
+ 3. **Two-Stage Generation Pipeline** — a long-video refiner stitched on
28
+ top of Stage-1 latents improves quality and temporal consistency.
29
+ 4. **Robust Annotation Pipeline** — metric-scale 6-DoF camera poses
30
+ extracted from public video corpora yield spatiotemporally consistent
31
+ action supervision.
32
 
33
  Paper: <https://arxiv.org/abs/2605.15178>
34
 
 
41
  }
42
  ```
43
 
44
+ ## Repository layout
45
+
46
+ | Component | Path in repo | Size |
47
+ |------------------------------------|-------------------------------------------|------:|
48
+ | Sana DiT (Stage 1) | `dit/sana_wm_1600m_720p.safetensors` | 10 GB |
49
+ | LTX-2 VAE (diffusers) | `vae/` | 2 GB |
50
+ | LTX-2 refiner (Stage 2) | `refiner/refiner.safetensors` | 41 GB |
51
+ | Gemma text encoder for the refiner | `refiner/text_encoder/` | 46 GB |
52
+ | Inference config | `config.yaml` | — |
53
 
54
  The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here — it is
55
  fetched on demand from `Efficient-Large-Model/gemma-2-2b-it`.
56
 
57
  ## Usage
58
 
59
+ Install the inference repo (see [environment_setup_sana_wm.sh](https://github.com/NVlabs/Sana/blob/main/environment_setup_sana_wm.sh))
60
+ and run:
61
 
62
  ```bash
63
  python inference_video_scripts/inference_sana_wm.py \
64
+ --image asset/sana_wm/demo_0.png \
65
+ --prompt asset/sana_wm/demo_0.txt \
66
+ --action "w-80,jw-40,w-40,lw-60,w-100" \
67
+ --translation_speed 0.055 \
68
+ --rotation_speed_deg 1.2 \
69
+ --num_frames 321 \
70
  --output_dir results/demo
71
  ```
72
 
73
+ Weights are fetched from this repository on first use. Pass `--no_refiner`
74
+ to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE
75
+ instead. To run fully offline, override any of `--config` / `--model_path` /
76
+ `--refiner_checkpoint` / `--refiner_gemma_root` with local paths.
 
77
 
78
  ## Inputs
79
 
80
+ | Argument | Format |
81
+ |---------------------|-----------------------------------------------------------------------------------------|
82
+ | `--image` | RGB image (any PIL-readable format) — used as the first frame. |
83
+ | `--prompt` | UTF-8 text file containing the conditioning prompt. |
84
+ | `--camera` | NumPy `.npy` of shape `(F, 4, 4)` — per-frame camera-to-world matrices. |
85
+ | `--action` | WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`. |
86
+ | `--intrinsics` | Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25°, 120°]`. |
87
+
88
+ The output frame size is fixed at `704 x 1280`; input images are
89
+ aspect-preserving resized + center-cropped to that resolution.
90
 
91
  ## License
92
 
93
+ Released under the Apache 2.0 license. The bundled LTX-2 refiner and VAE
94
+ inherit the LTX-2 upstream license; see the parent NVlabs-Sana
95
+ repository for details.