--- title: SANA-WM Camera-Controlled World Model emoji: 🌍 colorFrom: indigo colorTo: green sdk: gradio sdk_version: 6.14.0 app_file: app.py pinned: false license: apache-2.0 short_description: Image-to-video with 6-DoF camera control. models: - Efficient-Large-Model/SANA-WM_bidirectional - yyfz233/Pi3X - google/gemma-2-2b-it suggested_hardware: zero-a10g header: default --- # SANA-WM — Camera-Controlled World Model (ZeroGPU) Demo of [`Efficient-Large-Model/SANA-WM_bidirectional`](https://huggingface.co/Efficient-Large-Model/SANA-WM_bidirectional) from the [NVlabs/Sana](https://github.com/NVlabs/Sana) project ([feat/sana-wm PR branch](https://github.com/HaoyiZhu/NVlabs-Sana/tree/feat/sana-wm)). * Upload a first frame + write a prompt. * Build a camera trajectory with the **W A S D / I J K L** action queue (each tap appends a `-` segment to the DSL). * The Sana DiT samples a `(704, 1280)` latent video conditioned on your rolled-out 6-DoF camera trajectory, then the Sana VAE decodes it. The full pipeline ships an LTX-2 sink-bidirectional Euler refiner that adds ~87 GB of weights. This Space runs **Stage-1 only** (`--no_refiner`) to fit ZeroGPU; for refined output, run the CLI offline. ## Build notes * The Sana repo is vendored under `./Sana/` and prepended to `sys.path`. * `flash_attn` is stubbed at startup — SANA-WM only uses the Triton GDN path, but a few Sana modules do a top-level `from flash_attn import …`. * Camera intrinsics are estimated with Pi3X from the input image; pass `--intrinsics` in the CLI for accurate values.