stabilityai
/

SAME-S

+---
+language:
+- en
+library_name: stable-audio-3
+license: other
+license_name: stable-audio-community
+license_link: LICENSE
+tags:
+- music
+- audio
+- autoencoder
+---
+# SAME: A Semantically-Aligned Music Autoencoder
+Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
+## Model Description
+`Latent representations are at the heart of the majority of modern generative models.
+In the audio domain they are typically produced by a neural-audio-codec autoencoder.
+In this work we introduce SAME (Semantically Aligned Music autoEncoder),
+a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
+while maintaining excellent reconstruction quality and strong downstream generative performance.
+We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses.
+The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives.
+Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.`
+## Usage
+This model can be used with:
+1. the [`stable-audio-3`](https://github.com/Stability-AI/stable-audio-3) inference and fine-tuning library
+2. the [`stable-audio-tools`](https://github.com/Stability-AI/stable-audio-tools) research library
+### Using with `stable-audio-3`
+import torchaudio
+from stable_audio_3 import AutoencoderModel
+ae = AutoencoderModel.from_pretrained("same-s")
+waveform, sr = torchaudio.load("audio.wav")
+latents = ae.encode(waveform, sr)
+audio_out = ae.decode(latents)
+### Using with `stable-audio-tools`
+```python
+import torch
+import torchaudio
+from einops import rearrange
+from stable_audio_tools import get_pretrained_model
+from stable_audio_tools.inference.generation import generate_diffusion_cond
+device = "cuda" if torch.cuda.is_available() else "cpu"
+if device == "cuda":
+  model_half = True
+# Download model
+model, model_config = get_pretrained_model("stabilityai/SAME-S")
+sample_rate = model_config["sample_rate"]
+sample_size = model_config["sample_size"]
+model = model.to(device)
+if model_half:
+  model = model.to(torch.float16)
+audio, sr = torchaudio.load(/path/to/audiofile)  # [channels, samples]
+if audio.shape[0] == 1:
+    audio = audio.repeat(2, 1)
+audio = audio.unsqueeze(0).to(device)
+if model_half:
+  audio = audio.half()
+with torch.no_grad():
+    latents = model.encode_audio(audio)
+    reconstructed = model.decode_audio(latents)
+reconstructed = reconstructed.squeeze(0).cpu()
+reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
+```
+## Model Details
+* **Model type**: `SAME` is a continuous autoencoder model based on a transformer architecture.
+* **Language(s)**: English
+* **License**: [Stability AI Community License](https://huggingface.co/stabilityai/SAME-S/blob/main/LICENSE.md).
+* **Commercial License**: to use this model commercially, please refer to [https://stability.ai/license](https://stability.ai/license)
+* **Research Paper**: [https://arxiv.org/pdf/2605.18613](https://arxiv.org/pdf/2605.18613)
+## Training dataset
+### Datasets Used
+Our dataset consists of ~19,500 hours of licensed production audio from [Audiosparx](https://www.audiosparx.com/) which includes a 66/25/9% mix of music, sound effects, and instrument stems.