SANA-WM Bidirectional on Apple Silicon

Mac-local staged inference patches and runtime scripts for running SANA-WM Bidirectional on Apple Silicon with PyTorch MPS/Metal.

Developed and documented by the osmAPI Research Team.

Summary

SANA-WM is a 2.6B parameter controllable world model for image-to-video generation with 6-DoF camera control. The public release targets a CUDA-first ecosystem. This repository provides a practical Apple Silicon runtime path: patch CUDA/Triton assumptions into MPS-safe PyTorch fallbacks, run the pipeline in strict subprocess stages, save latents between stages, and decode video with a streaming/tiled VAE path.

This project was tested on an M3 Max MacBook Pro with 128 GB unified memory under a conservative 96 GB per-stage memory policy.

What This Repo Is

A Mac/Apple Silicon runtime and patch set for SANA-WM PR #379.
A PyTorch MPS/Metal compatibility path.
A staged load/unload pipeline that prevents Stage 1, refiner, and VAE from co-residing.
A collection of helper scripts, manifest templates, metrics notes, and sample contact sheets.

What This Repo Is Not

Not an official NVIDIA or NVlabs release.
Not a full MLX conversion.
Not a Core ML conversion.
Not a real-time playable world runtime.
Not a model-weight mirror.
Not a replacement for the upstream SANA-WM weights.

The output of the released SANA-WM inference path is a controllable video rollout: initial image + prompt + action/camera trajectory -> generated MP4. It is not yet a low-latency streaming simulator.

Base Model

Download the original weights from:

Efficient-Large-Model/SANA-WM_bidirectional

The Stage 1 text encoder is separate:

google/gemma-2-2b-it

This repository does not include those weights.

Upstream Code

The runnable SANA-WM implementation used by this runtime came from:

NVlabs/Sana PR #379

At the time of this work, the main branch and model-card instructions did not provide a complete Apple-ready path. PR #379 supplied the SANA-WM inference runner, config, model registration, GDN/CamCtrl blocks, refiner wrapper, and demo assets.

Apple Silicon Strategy

The runtime follows four rules:

Use PyTorch MPS/Metal for model execution.
Avoid CUDA-only packages such as flash-attn, CUDA xformers, CUDA bitsandbytes, and Triton-only paths.
Run each heavy model stage in a separate subprocess.
Save intermediate latents to disk before loading the next major model group.

The key memory invariant:

Stage 1, refiner text encoder, refiner transformer, and VAE must never all be resident at the same time.

Pipeline

flowchart TD
  A["Input image + prompt + action/camera + intrinsics"] --> B["Stage 1 subprocess"]
  B --> C["Save Stage 1 latent"]
  C --> D["Release Stage 1 memory"]
  D --> E["Refiner subprocess"]
  E --> F["Save refined latent"]
  F --> G["Release refiner memory"]
  G --> H["VAE-only subprocess"]
  H --> I["Streaming/tiled decode"]
  I --> J["Write MP4"]

Patch Highlights

The patch set adds or changes:

MPS device selection.
Backend-aware cache clearing.
FLA ShortConvolution fallback.
Triton import guards.
GDN/CamCtrl fallback routing for non-CUDA devices.
MPS-safe RoPE dtype handling.
MPS-safe attention fallback in the refiner.
torch.inference_mode() around direct Stage 1 latent sampling.
Stage 1 latent writer.
Refiner-from-latent runner.
VAE-only streaming/tiled decoder.
Memory-governed subprocess runtime.

Repository Layout

.
├── README.md
├── LICENSE
├── NOTICE
├── requirements-mps.txt
├── patches/
│   └── sana-wm-pr379-mps.patch
├── scripts/
│   ├── sana-wm/
│   │   ├── bootstrap_mac_env.sh
│   │   ├── create_split_manifest.py
│   │   ├── run_stage1_latent.py
│   │   ├── run_refiner_from_latent.py
│   │   ├── run_vae_decode_from_latent.py
│   │   ├── sana_wm_1600m_720p_mps_local.yaml
│   │   └── templates/
│   └── sana_wm_runtime/
│       ├── __init__.py
│       ├── cli.py
│       ├── memory.py
│       ├── pipeline.py
│       └── runner.py
├── docs/
│   ├── Running-SANA-WM-on-M3-Max-MacBook-Pro.md
│   ├── 96GB-memory-contract.md
│   └── troubleshooting.md
├── reports/
│   └── metrics_summary.json
└── sample_outputs/
    └── contact_sheets/

Requirements

Recommended:

Apple Silicon Mac.
macOS 14 or later.
Python 3.11.
PyTorch with MPS support.
128 GB unified memory for full 321-frame author-default examples.
Large local disk for upstream model weights.

The runtime was designed around a 96 GB cap, but full SANA-WM assets require substantial disk space.

Quick Start

Clone this repo and the upstream PR checkout:

git clone https://huggingface.co/osmAPI/SANA-WM-Bidirectional-on-Apple-Silicon
git clone https://github.com/NVlabs/Sana.git Sana-WM-PR379
cd Sana-WM-PR379
git fetch origin pull/379/head:sana-wm-pr379
git checkout sana-wm-pr379

Apply the MPS patch:

git apply ../SANA-WM-Bidirectional-on-Apple-Silicon/patches/sana-wm-pr379-mps.patch

Create the Python environment:

cd ../SANA-WM-Bidirectional-on-Apple-Silicon
python3.11 -m venv .venv-sana-wm
source .venv-sana-wm/bin/activate
pip install -r requirements-mps.txt

Download upstream weights separately:

huggingface-cli download Efficient-Large-Model/SANA-WM_bidirectional \
  --local-dir <MODEL_ROOT>/SANA-WM_bidirectional

huggingface-cli download google/gemma-2-2b-it \
  --local-dir <MODEL_ROOT>/gemma-2-2b-it

Set environment variables:

export SANA_WM_PR_ROOT=<PATH_TO_SANA_WM_PR379>
export SANA_WM_MODEL_ROOT=<MODEL_ROOT>/SANA-WM_bidirectional
export SANA_GEMMA_2_2B_IT_ROOT=<MODEL_ROOT>/gemma-2-2b-it
export PYTORCH_ENABLE_MPS_FALLBACK=1
export SANA_USE_LIGER=0
export GDN_DISABLE_COMPILE=1

Validate a manifest:

PYTHONPATH=./scripts python -m sana_wm_runtime.cli validate-manifest \
  scripts/sana-wm/templates/demo0_camera_author_defaults.template.json

Run a staged pipeline:

PYTHONPATH=./scripts python -m sana_wm_runtime.cli run-pipeline \
  scripts/sana-wm/templates/demo0_camera_author_defaults.template.json

Manifest templates should be edited to point to local model, PR checkout, input, artifact, and output directories before running.

Tested Outputs

The Apple Silicon runtime produced multiple 20-second MP4s locally:

Run	Setting	Result
`demo0_camera_321f_4step`	low-step camera example	complete
`demo0_action_321f_4step`	low-step action example	complete
`demo1_camera_321f_4step`	low-step demo triplet	complete
`demo2_camera_321f_4step`	low-step demo triplet	complete
`demo0_camera_321f_author_defaults`	60-step camera example	complete
`demo0_action_321f_author_defaults`	60-step action example	complete
`compare_c17_w_321f_author_defaults`	project-page anchored reproduction	complete

Sample contact sheets are included under sample_outputs/contact_sheets/.

Performance Snapshot

Representative author-default runs on M3 Max 128 GB:

Run	Stage 1	Refiner	VAE decode	Total
`demo0_camera_321f_author_defaults`	~2h44m	~20m	~8m	~3h12m
`demo0_action_321f_author_defaults`	~2h50m	~18m	~9m	~3h17m
`compare_c17_w_321f_author_defaults`	~2h46m	~19m	~8m	~3h13m

These are offline generation times. This runtime is not real-time.

Memory Snapshot

Observed stage peaks were stable under the policy:

Stage	Typical process RSS peak	Notes
Stage 1	~12.9 GB	MPS current after Stage 1 around 14 GB, then returns to 0 after unload
Refiner	~28.2-28.8 GB	MPS current after refine around 37.75 GB, then returns to 0 after unload
VAE-only decode	~3.4 GB	Streaming/tiled decode avoids MPS accumulation

The important result is not only RSS. The critical property is that active MPS allocations return to zero between heavy stages.

Limitations

This is a compatibility runtime, not a native MLX/Core ML conversion.
It is much slower than the NVIDIA paths described in the SANA-WM paper.
It relies on pure PyTorch fallbacks where the original implementation expected CUDA/Triton kernels.
Project-page reproductions are anchored approximations if the original clean first frame, exact trajectory, or intrinsics are not public.
Full author-default examples take hours on M3 Max.
Real-time keyboard-streaming world interaction is not implemented.

Roadmap

Publish sanitized manifests and metrics summaries.
Add a resume helper for completed Stage 1/refiner checkpoints.
Improve path templating so manifests are easier to relocate.
Test Pi3X or an Apple-safe intrinsics replacement.
Explore MLX ports of isolated components.
Investigate quantization only with visual quality gates.
Prototype offline branching: choose a frame, choose an action branch, generate a short preview, refine later.
Prototype chunked streaming once short-horizon generation is fast enough.

Attribution

This Apple Silicon runtime, research documentation, and staged memory-governed execution plan were developed by the osmAPI Research Team.

Upstream work belongs to the SANA-WM authors and contributors. This repository is a compatibility/runtime package built around the public SANA-WM release and PR #379.

Citation

If this runtime is useful, cite the original SANA-WM paper and the upstream model:

@misc{sana-wm,
  title = {SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
  year = {2026},
  url = {https://arxiv.org/abs/2605.15178}
}

License

This runtime package is released under Apache 2.0 where applicable. Upstream model weights, model code, and third-party dependencies remain governed by their respective licenses.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for osmapi/SANA-WM-Bidirectional-on-Apple-Silicon

Base model

Efficient-Large-Model/SANA-WM_bidirectional

Finetuned

(1)

this model

Paper for osmapi/SANA-WM-Bidirectional-on-Apple-Silicon

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Paper • 2605.15178 • Published 9 days ago • 80