stabilityai
/

SAME-L

Model card Files Files and versions

mattricesound commited on 2 days ago

Commit

156e879

·

verified ·

1 Parent(s): a0f2a49

Update model card

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -17,14 +17,14 @@ tags:
 Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
 ## Model Description
-`Latent representations are at the heart of the majority of modern generative models.
 In the audio domain they are typically produced by a neural-audio-codec autoencoder.
 In this work we introduce SAME (Semantically Aligned Music autoEncoder),
 a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
 while maintaining excellent reconstruction quality and strong downstream generative performance.
 We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses.
 The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives.
-Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.`
 ## Usage
@@ -92,7 +92,7 @@ reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch
 ## Training dataset
 ### Datasets Used
-Our dataset consists of ~19,500 hours of licensed production audio from [Audiosparx](https://www.audiosparx.com/) which includes a 66/25/9% mix of music, sound effects, and instrument stems.

 Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
 ## Model Description
+Latent representations are at the heart of the majority of modern generative models.
 In the audio domain they are typically produced by a neural-audio-codec autoencoder.
 In this work we introduce SAME (Semantically Aligned Music autoEncoder),
 a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
 while maintaining excellent reconstruction quality and strong downstream generative performance.
 We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses.
 The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives.
+Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
 ## Usage
 ## Training dataset
 ### Datasets Used
+Our dataset consists of ~19,500 hours of licensed production audio from [AudioSparx](https://www.audiosparx.com/) which includes a 66/25/9% mix of music, sound effects, and instrument stems.