Upload README.md with huggingface_hub

dfc7eab verified 4 days ago

4.15 kB

	---
	license: apache-2.0
	tags:
	- text-to-video
	- image-to-video
	- camera-control
	- world-model
	- diffusion
	---

	# SANA-WM (Bidirectional)

	SANA-WM is an efficient open-source world model trained natively for
	one-minute generation. The bidirectional checkpoint released here is a
	2.6B-parameter image-to-video diffusion transformer that synthesises
	720p, minute-scale videos with precise 6-DoF camera control, paired with
	the LTX-2 sink-bidirectional Euler refiner for high-fidelity decoding.

	Four core designs drive the architecture:

	1. Hybrid Linear Attention — frame-wise Gated DeltaNet combined with
	softmax attention every Nth block for memory-efficient long-context
	modelling.
	2. Dual-Branch Camera Control — independent main and camera branches
	enable precise per-frame trajectory adherence.
	3. Two-Stage Generation Pipeline — a long-video refiner stitched on
	top of Stage-1 latents improves quality and temporal consistency.
	4. Robust Annotation Pipeline — metric-scale 6-DoF camera poses
	extracted from public video corpora yield spatiotemporally consistent
	action supervision.

	Paper: <https://arxiv.org/abs/2605.15178>

	```bibtex
	@article{zhu2026sanawm,
	title = {{SANA-WM}: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
	author = {Zhu, Haoyi and Liu, Haozhe and Zhao, Yuyang and Ye, Tian and Chen, Junsong and Yu, Jincheng and He, Tong and Han, Song and Xie, Enze},
	journal = {arXiv preprint arXiv:2605.15178},
	year = {2026},
	}
	```

	## Repository layout

	\| Component \| Path in repo \| Size \|
	\|------------------------------------\|-------------------------------------------\|------:\|
	\| Sana DiT (Stage 1) \| `dit/sana_wm_1600m_720p.safetensors` \| 10 GB \|
	\| LTX-2 VAE (diffusers) \| `vae/` \| 2 GB \|
	\| LTX-2 refiner (Stage 2) \| `refiner/refiner.safetensors` \| 41 GB \|
	\| Gemma text encoder for the refiner \| `refiner/text_encoder/` \| 46 GB \|
	\| Inference config \| `config.yaml` \| — \|

	The Sana text encoder (`gemma-2-2b-it`) is not bundled here — it is
	fetched on demand from the public Hugging Face mirror.

	## Usage

	```bash
	python inference_video_scripts/inference_sana_wm.py \
	--image asset/sana_wm/demo_0.png \
	--prompt asset/sana_wm/demo_0.txt \
	--action "w-80,jw-40,w-40,lw-60,w-100" \
	--translation_speed 0.055 \
	--rotation_speed_deg 1.2 \
	--num_frames 321 \
	--output_dir results/demo
	```

	Weights are fetched from this repository on first use. Pass `--no_refiner`
	to skip the LTX-2 refiner and decode Stage-1 latents with the Sana VAE
	instead. To run fully offline, override any of `--config` / `--model_path` /
	`--refiner_checkpoint` / `--refiner_gemma_root` with local paths.

	## Inputs

	\| Argument \| Format \|
	\|---------------------\|-----------------------------------------------------------------------------------------\|
	\| `--image` \| RGB image (any PIL-readable format) — used as the first frame. \|
	\| `--prompt` \| UTF-8 text file containing the conditioning prompt. \|
	\| `--camera` \| NumPy `.npy` of shape `(F, 4, 4)` — per-frame camera-to-world matrices. \|
	\| `--action` \| WASD/IJKL DSL, e.g. `"w-80,jw-40,w-40,lw-60,w-100"`. We roll it out to a `(F+1, 4, 4)` trajectory. Mutually exclusive with `--camera`. \|
	\| `--intrinsics` \| Optional. `.npy` of shape `(3, 3)`, `(F, 3, 3)`, or `(4,)`. If omitted, we estimate intrinsics from `--image` with Pi3X and abort if the resulting FOV is outside `[25°, 120°]`. \|

	The output frame size is fixed at `704 x 1280`; input images are
	aspect-preserving resized + center-cropped to that resolution.

	## License

	Released under the Apache 2.0 license. The bundled LTX-2 refiner and VAE
	inherit the LTX-2 upstream license.