---
title: SANA-WM Camera-Controlled World Model
emoji: 🌍
colorFrom: indigo
colorTo: green
sdk: gradio
sdk_version: 6.14.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Image-to-video with 6-DoF camera control.
models:
- Efficient-Large-Model/SANA-WM_bidirectional
- yyfz233/Pi3X
- google/gemma-2-2b-it
suggested_hardware: zero-a10g
header: default
---

# SANA-WM — Camera-Controlled World Model (ZeroGPU)

Demo of [`Efficient-Large-Model/SANA-WM_bidirectional`](https://huggingface.co/Efficient-Large-Model/SANA-WM_bidirectional)
from the [NVlabs/Sana](https://github.com/NVlabs/Sana) project ([feat/sana-wm PR branch](https://github.com/HaoyiZhu/NVlabs-Sana/tree/feat/sana-wm)).

* Upload a first frame + write a prompt.
* Build a camera trajectory with the **W A S D / I J K L** action queue
  (each tap appends a `<keys>-<frames>` segment to the DSL).
* The Sana DiT samples a `(704, 1280)` latent video conditioned on your
  rolled-out 6-DoF camera trajectory, then the Sana VAE decodes it.

The full pipeline ships an LTX-2 sink-bidirectional Euler refiner that
adds ~87 GB of weights. This Space runs **Stage-1 only** (`--no_refiner`)
to fit ZeroGPU; for refined output, run the CLI offline.

## Build notes

* The Sana repo is vendored under `./Sana/` and prepended to `sys.path`.
* `flash_attn` is stubbed at startup — SANA-WM only uses the Triton GDN
  path, but a few Sana modules do a top-level `from flash_attn import …`.
* Camera intrinsics are estimated with Pi3X from the input image; pass
  `--intrinsics` in the CLI for accurate values.