stabilityai
/

SAME-S

@@ -17,14 +17,14 @@ tags:
 Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
 ## Model Description
-`Latent representations are at the heart of the majority of modern generative models.
 In the audio domain they are typically produced by a neural-audio-codec autoencoder.
 In this work we introduce SAME (Semantically Aligned Music autoEncoder),
 a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
 while maintaining excellent reconstruction quality and strong downstream generative performance.
 We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses.
 The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives.
-Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.`
 ## Usage
@@ -34,6 +34,7 @@ This model can be used with:
 ### Using with `stable-audio-3`
 import torchaudio
 from stable_audio_3 import AutoencoderModel
@@ -41,7 +42,7 @@ ae = AutoencoderModel.from_pretrained("same-s")
 waveform, sr = torchaudio.load("audio.wav")
 latents = ae.encode(waveform, sr)
 audio_out = ae.decode(latents)
 ### Using with `stable-audio-tools`
@@ -92,7 +93,7 @@ reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch
 ## Training dataset
 ### Datasets Used
-Our dataset consists of ~19,500 hours of licensed production audio from [Audiosparx](https://www.audiosparx.com/) which includes a 66/25/9% mix of music, sound effects, and instrument stems.

 Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
 ## Model Description
+Latent representations are at the heart of the majority of modern generative models.
 In the audio domain they are typically produced by a neural-audio-codec autoencoder.
 In this work we introduce SAME (Semantically Aligned Music autoEncoder),
 a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
 while maintaining excellent reconstruction quality and strong downstream generative performance.
 We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses.
 The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives.
+Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
 ## Usage
 ### Using with `stable-audio-3`
+```python
 import torchaudio
 from stable_audio_3 import AutoencoderModel
 waveform, sr = torchaudio.load("audio.wav")
 latents = ae.encode(waveform, sr)
 audio_out = ae.decode(latents)
+```
 ### Using with `stable-audio-tools`
 ## Training dataset
 ### Datasets Used
+Our dataset consists of ~19,500 hours of licensed production audio from [AudioSparx](https://www.audiosparx.com/) which includes a 66/25/9% mix of music, sound effects, and instrument stems.