Update model card
Browse files
README.md
CHANGED
|
@@ -17,14 +17,14 @@ tags:
|
|
| 17 |
Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
|
| 18 |
|
| 19 |
## Model Description
|
| 20 |
-
|
| 21 |
In the audio domain they are typically produced by a neural-audio-codec autoencoder.
|
| 22 |
In this work we introduce SAME (Semantically Aligned Music autoEncoder),
|
| 23 |
a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
|
| 24 |
while maintaining excellent reconstruction quality and strong downstream generative performance.
|
| 25 |
We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses.
|
| 26 |
The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives.
|
| 27 |
-
Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
|
| 28 |
|
| 29 |
## Usage
|
| 30 |
|
|
@@ -92,7 +92,7 @@ reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch
|
|
| 92 |
## Training dataset
|
| 93 |
|
| 94 |
### Datasets Used
|
| 95 |
-
Our dataset consists of ~19,500 hours of licensed production audio from [
|
| 96 |
|
| 97 |
|
| 98 |
|
|
|
|
| 17 |
Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
|
| 18 |
|
| 19 |
## Model Description
|
| 20 |
+
Latent representations are at the heart of the majority of modern generative models.
|
| 21 |
In the audio domain they are typically produced by a neural-audio-codec autoencoder.
|
| 22 |
In this work we introduce SAME (Semantically Aligned Music autoEncoder),
|
| 23 |
a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
|
| 24 |
while maintaining excellent reconstruction quality and strong downstream generative performance.
|
| 25 |
We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses.
|
| 26 |
The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives.
|
| 27 |
+
Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
|
| 28 |
|
| 29 |
## Usage
|
| 30 |
|
|
|
|
| 92 |
## Training dataset
|
| 93 |
|
| 94 |
### Datasets Used
|
| 95 |
+
Our dataset consists of ~19,500 hours of licensed production audio from [AudioSparx](https://www.audiosparx.com/) which includes a 66/25/9% mix of music, sound effects, and instrument stems.
|
| 96 |
|
| 97 |
|
| 98 |
|