Apple Silicon

Apple Silicon (M-series) chips have a unified memory architecture where the CPU and GPU share the same memory pool. Shared memory eliminates the data transfer overhead of GPUs, making it practical to train large models locally. Transformers uses the Metal Performance Shaders (MPS) backend to accelerate training on this hardware.

This requires macOS 12.3 or later and PyTorch built with MPS support.

MPS doesn’t support all PyTorch operations yet (see this GitHub issue for more details about missing ops). Set PYTORCH_ENABLE_MPS_FALLBACK=1 to fall back to CPU kernels for unsupported operations. Open an issue in the PyTorch repository for any other unexpected behavior.

Model loading and device selection

MPS requires the entire model to fit in unified memory, so device_map="auto" can’t offload layers to the CPU like CUDA. In this case, try using a smaller model.

Trainer detects MPS automatically with torch.backends.mps.is_available and sets the device to mps without any configuration changes.

Mixed precision

MPS supports both bf16 and fp16 mixed precision (bf16 requires macOS 14.0 or later).

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./outputs",
    bf16=True,  # requires macOS 14.0+
)

Next steps

Read the Introducing Accelerated PyTorch Training on Mac blog post for background on the MPS backend.

Update on GitHub