HaoyiZhu commited on
Commit
a1c1c1f
·
verified ·
1 Parent(s): ac30705

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +59 -0
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - text-to-video
5
+ - image-to-video
6
+ - camera-control
7
+ - diffusion
8
+ library_name: NVlabs-Sana
9
+ ---
10
+
11
+ # SANA-WM (Bidirectional)
12
+
13
+ A 2.6 B parameter image-to-video diffusion model conditioned on a per-frame
14
+ camera trajectory, paired with the LTX-2 sink-bidirectional Euler refiner
15
+ for high-fidelity decoding.
16
+
17
+ | Component | Path in repo | Size |
18
+ |----------------------------|-------------------------------------------|-------|
19
+ | Sana DiT (Stage 1) | `dit/sana_wm_1600m_720p.safetensors` | 10 GB |
20
+ | LTX-2 VAE (diffusers) | `vae/` | 2 GB |
21
+ | LTX-2 refiner (Stage 2) | `refiner/refiner.safetensors` | 41 GB |
22
+ | Gemma text encoder for the refiner | `refiner/text_encoder/` | 46 GB |
23
+ | Inference config | `config.yaml` | |
24
+
25
+ The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here — it is
26
+ fetched on demand from `Efficient-Large-Model/gemma-2-2b-it`.
27
+
28
+ ## Usage
29
+
30
+ Install the inference repo and run:
31
+
32
+ ```bash
33
+ python inference_video_scripts/inference_sana_wm.py \
34
+ --image examples/scene/first_frame.png \
35
+ --prompt examples/scene/prompt.txt \
36
+ --camera examples/scene/camera.npy \
37
+ --intrinsics examples/scene/intrinsics.npy \
38
+ --output_dir results/demo
39
+ ```
40
+
41
+ Weights are fetched from this repository on first use. Pass `--use_refiner`
42
+ to enable the Stage-2 LTX-2 refiner; without it, the Sana VAE decodes the
43
+ Stage-1 latents directly. To run entirely offline, override any of
44
+ `--config` / `--model_path` / `--refiner_checkpoint` / `--refiner_gemma_root`
45
+ with local paths.
46
+
47
+ ## Inputs
48
+
49
+ | Argument | Format |
50
+ |-------------------|-----------------------------------------------------------------------------------------|
51
+ | `--image` | RGB image (any PIL-readable format) — used as the first frame. |
52
+ | `--prompt` | UTF-8 text file containing the conditioning prompt. |
53
+ | `--camera` | NumPy `.npy`, shape `(F, 4, 4)`, camera-to-world matrices for `F = --num_frames`. |
54
+ | `--intrinsics` | NumPy `.npy`, shape `(3, 3)`, `(F, 3, 3)`, or `(4,) = (fx, fy, cx, cy)` in input pixels.|
55
+
56
+ ## License
57
+
58
+ Released under the Apache 2.0 license. The refiner inherits the LTX-2
59
+ upstream license; see the parent NVlabs-Sana repository for details.