Update model card
Browse files
README.md
CHANGED
|
@@ -17,14 +17,14 @@ tags:
|
|
| 17 |
Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
|
| 18 |
|
| 19 |
## Model Description
|
| 20 |
-
|
| 21 |
In the audio domain they are typically produced by a neural-audio-codec autoencoder.
|
| 22 |
In this work we introduce SAME (Semantically Aligned Music autoEncoder),
|
| 23 |
a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
|
| 24 |
while maintaining excellent reconstruction quality and strong downstream generative performance.
|
| 25 |
We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses.
|
| 26 |
The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives.
|
| 27 |
-
Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
|
| 28 |
|
| 29 |
## Usage
|
| 30 |
|
|
@@ -34,6 +34,7 @@ This model can be used with:
|
|
| 34 |
|
| 35 |
|
| 36 |
### Using with `stable-audio-3`
|
|
|
|
| 37 |
import torchaudio
|
| 38 |
from stable_audio_3 import AutoencoderModel
|
| 39 |
|
|
@@ -41,7 +42,7 @@ ae = AutoencoderModel.from_pretrained("same-s")
|
|
| 41 |
waveform, sr = torchaudio.load("audio.wav")
|
| 42 |
latents = ae.encode(waveform, sr)
|
| 43 |
audio_out = ae.decode(latents)
|
| 44 |
-
|
| 45 |
|
| 46 |
### Using with `stable-audio-tools`
|
| 47 |
|
|
@@ -92,7 +93,7 @@ reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch
|
|
| 92 |
## Training dataset
|
| 93 |
|
| 94 |
### Datasets Used
|
| 95 |
-
Our dataset consists of ~19,500 hours of licensed production audio from [
|
| 96 |
|
| 97 |
|
| 98 |
|
|
|
|
| 17 |
Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
|
| 18 |
|
| 19 |
## Model Description
|
| 20 |
+
Latent representations are at the heart of the majority of modern generative models.
|
| 21 |
In the audio domain they are typically produced by a neural-audio-codec autoencoder.
|
| 22 |
In this work we introduce SAME (Semantically Aligned Music autoEncoder),
|
| 23 |
a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
|
| 24 |
while maintaining excellent reconstruction quality and strong downstream generative performance.
|
| 25 |
We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses.
|
| 26 |
The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives.
|
| 27 |
+
Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
|
| 28 |
|
| 29 |
## Usage
|
| 30 |
|
|
|
|
| 34 |
|
| 35 |
|
| 36 |
### Using with `stable-audio-3`
|
| 37 |
+
```python
|
| 38 |
import torchaudio
|
| 39 |
from stable_audio_3 import AutoencoderModel
|
| 40 |
|
|
|
|
| 42 |
waveform, sr = torchaudio.load("audio.wav")
|
| 43 |
latents = ae.encode(waveform, sr)
|
| 44 |
audio_out = ae.decode(latents)
|
| 45 |
+
```
|
| 46 |
|
| 47 |
### Using with `stable-audio-tools`
|
| 48 |
|
|
|
|
| 93 |
## Training dataset
|
| 94 |
|
| 95 |
### Datasets Used
|
| 96 |
+
Our dataset consists of ~19,500 hours of licensed production audio from [AudioSparx](https://www.audiosparx.com/) which includes a 66/25/9% mix of music, sound effects, and instrument stems.
|
| 97 |
|
| 98 |
|
| 99 |
|