SANA-WM Bidirectional on Apple Silicon

Mac-local staged inference patches and runtime scripts for running SANA-WM Bidirectional on Apple Silicon with PyTorch MPS/Metal.

Developed and documented by the osmAPI Research Team.

Summary

SANA-WM is a 2.6B parameter controllable world model for image-to-video generation with 6-DoF camera control. The public release targets a CUDA-first ecosystem. This repository provides a practical Apple Silicon runtime path: patch CUDA/Triton assumptions into MPS-safe PyTorch fallbacks, run the pipeline in strict subprocess stages, save latents between stages, and decode video with a streaming/tiled VAE path.

This project was tested on an M3 Max MacBook Pro with 128 GB unified memory under a conservative 96 GB per-stage memory policy.

What This Repo Is

  • A Mac/Apple Silicon runtime and patch set for SANA-WM PR #379.
  • A PyTorch MPS/Metal compatibility path.
  • A staged load/unload pipeline that prevents Stage 1, refiner, and VAE from co-residing.
  • A collection of helper scripts, manifest templates, metrics notes, and sample contact sheets.

What This Repo Is Not

  • Not an official NVIDIA or NVlabs release.
  • Not a full MLX conversion.
  • Not a Core ML conversion.
  • Not a real-time playable world runtime.
  • Not a model-weight mirror.
  • Not a replacement for the upstream SANA-WM weights.

The output of the released SANA-WM inference path is a controllable video rollout: initial image + prompt + action/camera trajectory -> generated MP4. It is not yet a low-latency streaming simulator.

Base Model

Download the original weights from:

The Stage 1 text encoder is separate:

This repository does not include those weights.

Upstream Code

The runnable SANA-WM implementation used by this runtime came from:

At the time of this work, the main branch and model-card instructions did not provide a complete Apple-ready path. PR #379 supplied the SANA-WM inference runner, config, model registration, GDN/CamCtrl blocks, refiner wrapper, and demo assets.

Apple Silicon Strategy

The runtime follows four rules:

  1. Use PyTorch MPS/Metal for model execution.
  2. Avoid CUDA-only packages such as flash-attn, CUDA xformers, CUDA bitsandbytes, and Triton-only paths.
  3. Run each heavy model stage in a separate subprocess.
  4. Save intermediate latents to disk before loading the next major model group.

The key memory invariant:

Stage 1, refiner text encoder, refiner transformer, and VAE must never all be resident at the same time.

Pipeline

flowchart TD
  A["Input image + prompt + action/camera + intrinsics"] --> B["Stage 1 subprocess"]
  B --> C["Save Stage 1 latent"]
  C --> D["Release Stage 1 memory"]
  D --> E["Refiner subprocess"]
  E --> F["Save refined latent"]
  F --> G["Release refiner memory"]
  G --> H["VAE-only subprocess"]
  H --> I["Streaming/tiled decode"]
  I --> J["Write MP4"]

Patch Highlights

The patch set adds or changes:

  • MPS device selection.
  • Backend-aware cache clearing.
  • FLA ShortConvolution fallback.
  • Triton import guards.
  • GDN/CamCtrl fallback routing for non-CUDA devices.
  • MPS-safe RoPE dtype handling.
  • MPS-safe attention fallback in the refiner.
  • torch.inference_mode() around direct Stage 1 latent sampling.
  • Stage 1 latent writer.
  • Refiner-from-latent runner.
  • VAE-only streaming/tiled decoder.
  • Memory-governed subprocess runtime.

Repository Layout

.
β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE
β”œβ”€β”€ NOTICE
β”œβ”€β”€ requirements-mps.txt
β”œβ”€β”€ patches/
β”‚   └── sana-wm-pr379-mps.patch
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ sana-wm/
β”‚   β”‚   β”œβ”€β”€ bootstrap_mac_env.sh
β”‚   β”‚   β”œβ”€β”€ create_split_manifest.py
β”‚   β”‚   β”œβ”€β”€ run_stage1_latent.py
β”‚   β”‚   β”œβ”€β”€ run_refiner_from_latent.py
β”‚   β”‚   β”œβ”€β”€ run_vae_decode_from_latent.py
β”‚   β”‚   β”œβ”€β”€ sana_wm_1600m_720p_mps_local.yaml
β”‚   β”‚   └── templates/
β”‚   └── sana_wm_runtime/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ cli.py
β”‚       β”œβ”€β”€ memory.py
β”‚       β”œβ”€β”€ pipeline.py
β”‚       └── runner.py
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ Running-SANA-WM-on-M3-Max-MacBook-Pro.md
β”‚   β”œβ”€β”€ 96GB-memory-contract.md
β”‚   └── troubleshooting.md
β”œβ”€β”€ reports/
β”‚   └── metrics_summary.json
└── sample_outputs/
    └── contact_sheets/

Requirements

Recommended:

  • Apple Silicon Mac.
  • macOS 14 or later.
  • Python 3.11.
  • PyTorch with MPS support.
  • 128 GB unified memory for full 321-frame author-default examples.
  • Large local disk for upstream model weights.

The runtime was designed around a 96 GB cap, but full SANA-WM assets require substantial disk space.

Quick Start

Clone this repo and the upstream PR checkout:

git clone https://huggingface.co/osmAPI/SANA-WM-Bidirectional-on-Apple-Silicon
git clone https://github.com/NVlabs/Sana.git Sana-WM-PR379
cd Sana-WM-PR379
git fetch origin pull/379/head:sana-wm-pr379
git checkout sana-wm-pr379

Apply the MPS patch:

git apply ../SANA-WM-Bidirectional-on-Apple-Silicon/patches/sana-wm-pr379-mps.patch

Create the Python environment:

cd ../SANA-WM-Bidirectional-on-Apple-Silicon
python3.11 -m venv .venv-sana-wm
source .venv-sana-wm/bin/activate
pip install -r requirements-mps.txt

Download upstream weights separately:

huggingface-cli download Efficient-Large-Model/SANA-WM_bidirectional \
  --local-dir <MODEL_ROOT>/SANA-WM_bidirectional

huggingface-cli download google/gemma-2-2b-it \
  --local-dir <MODEL_ROOT>/gemma-2-2b-it

Set environment variables:

export SANA_WM_PR_ROOT=<PATH_TO_SANA_WM_PR379>
export SANA_WM_MODEL_ROOT=<MODEL_ROOT>/SANA-WM_bidirectional
export SANA_GEMMA_2_2B_IT_ROOT=<MODEL_ROOT>/gemma-2-2b-it
export PYTORCH_ENABLE_MPS_FALLBACK=1
export SANA_USE_LIGER=0
export GDN_DISABLE_COMPILE=1

Validate a manifest:

PYTHONPATH=./scripts python -m sana_wm_runtime.cli validate-manifest \
  scripts/sana-wm/templates/demo0_camera_author_defaults.template.json

Run a staged pipeline:

PYTHONPATH=./scripts python -m sana_wm_runtime.cli run-pipeline \
  scripts/sana-wm/templates/demo0_camera_author_defaults.template.json

Manifest templates should be edited to point to local model, PR checkout, input, artifact, and output directories before running.

Tested Outputs

The Apple Silicon runtime produced multiple 20-second MP4s locally:

Run Setting Result
demo0_camera_321f_4step low-step camera example complete
demo0_action_321f_4step low-step action example complete
demo1_camera_321f_4step low-step demo triplet complete
demo2_camera_321f_4step low-step demo triplet complete
demo0_camera_321f_author_defaults 60-step camera example complete
demo0_action_321f_author_defaults 60-step action example complete
compare_c17_w_321f_author_defaults project-page anchored reproduction complete

Sample contact sheets are included under sample_outputs/contact_sheets/.

Performance Snapshot

Representative author-default runs on M3 Max 128 GB:

Run Stage 1 Refiner VAE decode Total
demo0_camera_321f_author_defaults ~2h44m ~20m ~8m ~3h12m
demo0_action_321f_author_defaults ~2h50m ~18m ~9m ~3h17m
compare_c17_w_321f_author_defaults ~2h46m ~19m ~8m ~3h13m

These are offline generation times. This runtime is not real-time.

Memory Snapshot

Observed stage peaks were stable under the policy:

Stage Typical process RSS peak Notes
Stage 1 ~12.9 GB MPS current after Stage 1 around 14 GB, then returns to 0 after unload
Refiner ~28.2-28.8 GB MPS current after refine around 37.75 GB, then returns to 0 after unload
VAE-only decode ~3.4 GB Streaming/tiled decode avoids MPS accumulation

The important result is not only RSS. The critical property is that active MPS allocations return to zero between heavy stages.

Limitations

  • This is a compatibility runtime, not a native MLX/Core ML conversion.
  • It is much slower than the NVIDIA paths described in the SANA-WM paper.
  • It relies on pure PyTorch fallbacks where the original implementation expected CUDA/Triton kernels.
  • Project-page reproductions are anchored approximations if the original clean first frame, exact trajectory, or intrinsics are not public.
  • Full author-default examples take hours on M3 Max.
  • Real-time keyboard-streaming world interaction is not implemented.

Roadmap

  • Publish sanitized manifests and metrics summaries.
  • Add a resume helper for completed Stage 1/refiner checkpoints.
  • Improve path templating so manifests are easier to relocate.
  • Test Pi3X or an Apple-safe intrinsics replacement.
  • Explore MLX ports of isolated components.
  • Investigate quantization only with visual quality gates.
  • Prototype offline branching: choose a frame, choose an action branch, generate a short preview, refine later.
  • Prototype chunked streaming once short-horizon generation is fast enough.

Attribution

This Apple Silicon runtime, research documentation, and staged memory-governed execution plan were developed by the osmAPI Research Team.

Upstream work belongs to the SANA-WM authors and contributors. This repository is a compatibility/runtime package built around the public SANA-WM release and PR #379.

Citation

If this runtime is useful, cite the original SANA-WM paper and the upstream model:

@misc{sana-wm,
  title = {SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer},
  year = {2026},
  url = {https://arxiv.org/abs/2605.15178}
}

License

This runtime package is released under Apache 2.0 where applicable. Upstream model weights, model code, and third-party dependencies remain governed by their respective licenses.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for osmapi/SANA-WM-Bidirectional-on-Apple-Silicon

Finetuned
(1)
this model

Paper for osmapi/SANA-WM-Bidirectional-on-Apple-Silicon