File size: 4,149 Bytes
a1c1c1f
 
 
 
 
 
f6ea8dc
a1c1c1f
 
 
 
 
f6ea8dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1c1c1f
405266b
 
 
dfc7eab
f39d702
dfc7eab
405266b
f39d702
405266b
 
 
f6ea8dc
 
 
 
 
 
 
 
 
a1c1c1f
 
4b2d932
a1c1c1f
 
 
 
 
f6ea8dc
 
 
 
 
 
a1c1c1f
 
 
f6ea8dc
 
 
 
a1c1c1f
 
 
f6ea8dc
 
 
 
 
 
 
 
 
 
a1c1c1f
 
 
f6ea8dc
4b2d932
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: apache-2.0
tags:
  - text-to-video
  - image-to-video
  - camera-control
  - world-model
  - diffusion
---

# SANA-WM (Bidirectional)

**SANA-WM** is an efficient open-source world model trained natively for
one-minute generation. The bidirectional checkpoint released here is a
2.6B-parameter image-to-video diffusion transformer that synthesises
720p, minute-scale videos with precise 6-DoF camera control, paired with
the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.

Four core designs drive the architecture:

1. **Hybrid Linear Attention** — frame-wise Gated DeltaNet combined with
   softmax attention every Nth block for memory-efficient long-context
   modelling.
2. **Dual-Branch Camera Control** — independent main and camera branches
   enable precise per-frame trajectory adherence.
3. **Two-Stage Generation Pipeline** — a long-video refiner stitched on
   top of Stage-1 latents improves quality and temporal consistency.
4. **Robust Annotation Pipeline** — metric-scale 6-DoF camera poses
   extracted from public video corpora yield spatiotemporally consistent
   action supervision.

Paper: <https://arxiv.org/abs/2605.15178>

```bibtex
@article{zhu2026sanawm,
  title   = {{SANA-WM}: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
  author  = {Zhu, Haoyi and Liu, Haozhe and Zhao, Yuyang and Ye, Tian and Chen, Junsong and Yu, Jincheng and He, Tong and Han, Song and Xie, Enze},
  journal = {arXiv preprint arXiv:2605.15178},
  year    = {2026},
}
```

## Repository layout

| Component                          | Path in repo                              | Size  |
|------------------------------------|-------------------------------------------|------:|
| Sana DiT (Stage 1)                 | `dit/sana_wm_1600m_720p.safetensors`      | 10 GB |
| LTX-2 VAE (diffusers)              | `vae/`                                    |  2 GB |
| LTX-2 refiner (Stage 2)            | `refiner/refiner.safetensors`             | 41 GB |
| Gemma text encoder for the refiner | `refiner/text_encoder/`                   | 46 GB |
| Inference config                   | `config.yaml`                             |     — |

The Sana text encoder (`gemma-2-2b-it`) is **not** bundled here — it is
fetched on demand from the public Hugging Face mirror.

## Usage

```bash
python inference_video_scripts/inference_sana_wm.py \
  --image      asset/sana_wm/demo_0.png \
  --prompt     asset/sana_wm/demo_0.txt \
  --action     "w-80,jw-40,w-40,lw-60,w-100" \
  --translation_speed 0.055 \
  --rotation_speed_deg 1.2 \
  --num_frames 321 \
  --output_dir results/demo
```

Weights are fetched from this repository on first use. Pass `--no_refiner`
to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE
instead. To run fully offline, override any of `--config` / `--model_path` /
`--refiner_checkpoint` / `--refiner_gemma_root` with local paths.

## Inputs

| Argument            | Format                                                                                  |
|---------------------|-----------------------------------------------------------------------------------------|
| `--image`           | RGB image (any PIL-readable format) — used as the first frame.                          |
| `--prompt`          | UTF-8 text file containing the conditioning prompt.                                     |
| `--camera`          | NumPy `.npy` of shape `(F, 4, 4)` — per-frame camera-to-world matrices.                  |
| `--action`          | WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`. |
| `--intrinsics`      | Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25°, 120°]`. |

The output frame size is fixed at `704 x 1280`; input images are
aspect-preserving resized + center-cropped to that resolution.

## License

Released under the Apache 2.0 license. The bundled LTX-2 refiner and VAE
inherit the LTX-2 upstream license.