SANA-WM Bidirectional on Apple Silicon
Mac-local staged inference patches and runtime scripts for running SANA-WM Bidirectional on Apple Silicon with PyTorch MPS/Metal.
Developed and documented by the osmAPI Research Team.
Summary
SANA-WM is a 2.6B parameter controllable world model for image-to-video generation with 6-DoF camera control. The public release targets a CUDA-first ecosystem. This repository provides a practical Apple Silicon runtime path: patch CUDA/Triton assumptions into MPS-safe PyTorch fallbacks, run the pipeline in strict subprocess stages, save latents between stages, and decode video with a streaming/tiled VAE path.
This project was tested on an M3 Max MacBook Pro with 128 GB unified memory under a conservative 96 GB per-stage memory policy.
What This Repo Is
- A Mac/Apple Silicon runtime and patch set for SANA-WM PR #379.
- A PyTorch MPS/Metal compatibility path.
- A staged load/unload pipeline that prevents Stage 1, refiner, and VAE from co-residing.
- A collection of helper scripts, manifest templates, metrics notes, and sample contact sheets.
What This Repo Is Not
- Not an official NVIDIA or NVlabs release.
- Not a full MLX conversion.
- Not a Core ML conversion.
- Not a real-time playable world runtime.
- Not a model-weight mirror.
- Not a replacement for the upstream SANA-WM weights.
The output of the released SANA-WM inference path is a controllable video rollout: initial image + prompt + action/camera trajectory -> generated MP4. It is not yet a low-latency streaming simulator.
Base Model
Download the original weights from:
The Stage 1 text encoder is separate:
This repository does not include those weights.
Upstream Code
The runnable SANA-WM implementation used by this runtime came from:
At the time of this work, the main branch and model-card instructions did not provide a complete Apple-ready path. PR #379 supplied the SANA-WM inference runner, config, model registration, GDN/CamCtrl blocks, refiner wrapper, and demo assets.
Apple Silicon Strategy
The runtime follows four rules:
- Use PyTorch MPS/Metal for model execution.
- Avoid CUDA-only packages such as
flash-attn, CUDAxformers, CUDAbitsandbytes, and Triton-only paths. - Run each heavy model stage in a separate subprocess.
- Save intermediate latents to disk before loading the next major model group.
The key memory invariant:
Stage 1, refiner text encoder, refiner transformer, and VAE must never all be resident at the same time.
Pipeline
flowchart TD
A["Input image + prompt + action/camera + intrinsics"] --> B["Stage 1 subprocess"]
B --> C["Save Stage 1 latent"]
C --> D["Release Stage 1 memory"]
D --> E["Refiner subprocess"]
E --> F["Save refined latent"]
F --> G["Release refiner memory"]
G --> H["VAE-only subprocess"]
H --> I["Streaming/tiled decode"]
I --> J["Write MP4"]
Patch Highlights
The patch set adds or changes:
- MPS device selection.
- Backend-aware cache clearing.
- FLA
ShortConvolutionfallback. - Triton import guards.
- GDN/CamCtrl fallback routing for non-CUDA devices.
- MPS-safe RoPE dtype handling.
- MPS-safe attention fallback in the refiner.
torch.inference_mode()around direct Stage 1 latent sampling.- Stage 1 latent writer.
- Refiner-from-latent runner.
- VAE-only streaming/tiled decoder.
- Memory-governed subprocess runtime.
Repository Layout
.
βββ README.md
βββ LICENSE
βββ NOTICE
βββ requirements-mps.txt
βββ patches/
β βββ sana-wm-pr379-mps.patch
βββ scripts/
β βββ sana-wm/
β β βββ bootstrap_mac_env.sh
β β βββ create_split_manifest.py
β β βββ run_stage1_latent.py
β β βββ run_refiner_from_latent.py
β β βββ run_vae_decode_from_latent.py
β β βββ sana_wm_1600m_720p_mps_local.yaml
β β βββ templates/
β βββ sana_wm_runtime/
β βββ __init__.py
β βββ cli.py
β βββ memory.py
β βββ pipeline.py
β βββ runner.py
βββ docs/
β βββ Running-SANA-WM-on-M3-Max-MacBook-Pro.md
β βββ 96GB-memory-contract.md
β βββ troubleshooting.md
βββ reports/
β βββ metrics_summary.json
βββ sample_outputs/
βββ contact_sheets/
Requirements
Recommended:
- Apple Silicon Mac.
- macOS 14 or later.
- Python 3.11.
- PyTorch with MPS support.
- 128 GB unified memory for full 321-frame author-default examples.
- Large local disk for upstream model weights.
The runtime was designed around a 96 GB cap, but full SANA-WM assets require substantial disk space.
Quick Start
Clone this repo and the upstream PR checkout:
git clone https://huggingface.co/osmAPI/SANA-WM-Bidirectional-on-Apple-Silicon
git clone https://github.com/NVlabs/Sana.git Sana-WM-PR379
cd Sana-WM-PR379
git fetch origin pull/379/head:sana-wm-pr379
git checkout sana-wm-pr379
Apply the MPS patch:
git apply ../SANA-WM-Bidirectional-on-Apple-Silicon/patches/sana-wm-pr379-mps.patch
Create the Python environment:
cd ../SANA-WM-Bidirectional-on-Apple-Silicon
python3.11 -m venv .venv-sana-wm
source .venv-sana-wm/bin/activate
pip install -r requirements-mps.txt
Download upstream weights separately:
huggingface-cli download Efficient-Large-Model/SANA-WM_bidirectional \
--local-dir <MODEL_ROOT>/SANA-WM_bidirectional
huggingface-cli download google/gemma-2-2b-it \
--local-dir <MODEL_ROOT>/gemma-2-2b-it
Set environment variables:
export SANA_WM_PR_ROOT=<PATH_TO_SANA_WM_PR379>
export SANA_WM_MODEL_ROOT=<MODEL_ROOT>/SANA-WM_bidirectional
export SANA_GEMMA_2_2B_IT_ROOT=<MODEL_ROOT>/gemma-2-2b-it
export PYTORCH_ENABLE_MPS_FALLBACK=1
export SANA_USE_LIGER=0
export GDN_DISABLE_COMPILE=1
Validate a manifest:
PYTHONPATH=./scripts python -m sana_wm_runtime.cli validate-manifest \
scripts/sana-wm/templates/demo0_camera_author_defaults.template.json
Run a staged pipeline:
PYTHONPATH=./scripts python -m sana_wm_runtime.cli run-pipeline \
scripts/sana-wm/templates/demo0_camera_author_defaults.template.json
Manifest templates should be edited to point to local model, PR checkout, input, artifact, and output directories before running.
Tested Outputs
The Apple Silicon runtime produced multiple 20-second MP4s locally:
| Run | Setting | Result |
|---|---|---|
demo0_camera_321f_4step |
low-step camera example | complete |
demo0_action_321f_4step |
low-step action example | complete |
demo1_camera_321f_4step |
low-step demo triplet | complete |
demo2_camera_321f_4step |
low-step demo triplet | complete |
demo0_camera_321f_author_defaults |
60-step camera example | complete |
demo0_action_321f_author_defaults |
60-step action example | complete |
compare_c17_w_321f_author_defaults |
project-page anchored reproduction | complete |
Sample contact sheets are included under sample_outputs/contact_sheets/.
Performance Snapshot
Representative author-default runs on M3 Max 128 GB:
| Run | Stage 1 | Refiner | VAE decode | Total |
|---|---|---|---|---|
demo0_camera_321f_author_defaults |
~2h44m | ~20m | ~8m | ~3h12m |
demo0_action_321f_author_defaults |
~2h50m | ~18m | ~9m | ~3h17m |
compare_c17_w_321f_author_defaults |
~2h46m | ~19m | ~8m | ~3h13m |
These are offline generation times. This runtime is not real-time.
Memory Snapshot
Observed stage peaks were stable under the policy:
| Stage | Typical process RSS peak | Notes |
|---|---|---|
| Stage 1 | ~12.9 GB | MPS current after Stage 1 around 14 GB, then returns to 0 after unload |
| Refiner | ~28.2-28.8 GB | MPS current after refine around 37.75 GB, then returns to 0 after unload |
| VAE-only decode | ~3.4 GB | Streaming/tiled decode avoids MPS accumulation |
The important result is not only RSS. The critical property is that active MPS allocations return to zero between heavy stages.
Limitations
- This is a compatibility runtime, not a native MLX/Core ML conversion.
- It is much slower than the NVIDIA paths described in the SANA-WM paper.
- It relies on pure PyTorch fallbacks where the original implementation expected CUDA/Triton kernels.
- Project-page reproductions are anchored approximations if the original clean first frame, exact trajectory, or intrinsics are not public.
- Full author-default examples take hours on M3 Max.
- Real-time keyboard-streaming world interaction is not implemented.
Roadmap
- Publish sanitized manifests and metrics summaries.
- Add a resume helper for completed Stage 1/refiner checkpoints.
- Improve path templating so manifests are easier to relocate.
- Test Pi3X or an Apple-safe intrinsics replacement.
- Explore MLX ports of isolated components.
- Investigate quantization only with visual quality gates.
- Prototype offline branching: choose a frame, choose an action branch, generate a short preview, refine later.
- Prototype chunked streaming once short-horizon generation is fast enough.
Attribution
This Apple Silicon runtime, research documentation, and staged memory-governed execution plan were developed by the osmAPI Research Team.
Upstream work belongs to the SANA-WM authors and contributors. This repository is a compatibility/runtime package built around the public SANA-WM release and PR #379.
Citation
If this runtime is useful, cite the original SANA-WM paper and the upstream model:
@misc{sana-wm,
title = {SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
year = {2026},
url = {https://arxiv.org/abs/2605.15178}
}
License
This runtime package is released under Apache 2.0 where applicable. Upstream model weights, model code, and third-party dependencies remain governed by their respective licenses.
Model tree for osmapi/SANA-WM-Bidirectional-on-Apple-Silicon
Base model
Efficient-Large-Model/SANA-WM_bidirectional