stabilityai
/

stable-audio-3-optimized

@@ -18,8 +18,6 @@ tags:
 > **Note:** This repository contains experimental checkpoints optimised for acceleration on specific hardware. For standard checkpoints, please use [Stable Audio 3 Medium](https://huggingface.co/stabilityai/stable-audio-3-medium) instead.
-![Stable Audio 3 logo](./Stable_Audio_3.0_Thumbnail_1x1.png)
 Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
 ## Model Description
@@ -41,17 +39,17 @@ This model can be used with:
 ### Using with `stable-audio-3`
 from stable_audio_3 import StableAudioModel
 model = StableAudioModel.from_pretrained("medium")
 audio = model.generate(
     prompt="House music that encapsulates the feeling of being at a festival in the sunny weather with all your friends 124 BPM",
-    duration=180,
 )
 ### Using with `stable-audio-tools`
 ```python
 import torch
 import torchaudio
@@ -98,7 +96,7 @@ torchaudio.save("output.wav", output, sample_rate)
 ## Model Details
-* **Model type**: `Stable Audio Open 3` is a latent diffusion model based on a transformer architecture.
 * **Language(s)**: English
 * **License**: [Stability AI Community License](https://huggingface.co/stabilityai/stable-audio-3/blob/main/LICENSE.md).
 * **Commercial License**: to use this model commercially, please refer to [https://stability.ai/license](https://stability.ai/license)
@@ -109,13 +107,8 @@ We use a publicly available pre-trained T5Gemma model ([t5gemma-b-b-ul2](https:/
 ## Training dataset
 ### Datasets Used
-Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed from [Audiosparx](https://www.audiosparx.com/) and a further 472,618 are from [Freesound](https://freesound.org/).
 The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified
 using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15),
 that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454
-CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open: https://info.stability.ai/attributions. All stable-audio-3 small models are initially pre-trained on a mixture of AudioSparx and Freesound.
-But for the final stage of pre-training, distillation warmup, and post-training, we use AudioSparx for small-music
-and a higher-quality subset of Freesound for small-sfx. As a result, note that medium and large models are able to
-handle both music and sound effect generation within a single unified model. However, we find that for small models
-the inclusion of sound effects data degrades musical coherence. By isolating the sound effects subset into small-sfx,
-we mitigate this interference and obtain improved perceptual quality in both domains.

 > **Note:** This repository contains experimental checkpoints optimised for acceleration on specific hardware. For standard checkpoints, please use [Stable Audio 3 Medium](https://huggingface.co/stabilityai/stable-audio-3-medium) instead.
 Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
 ## Model Description
 ### Using with `stable-audio-3`
+```python
 from stable_audio_3 import StableAudioModel
 model = StableAudioModel.from_pretrained("medium")
 audio = model.generate(
     prompt="House music that encapsulates the feeling of being at a festival in the sunny weather with all your friends 124 BPM",
+    duration=180
 )
+```
 ### Using with `stable-audio-tools`
 ```python
 import torch
 import torchaudio
 ## Model Details
+* **Model type**: `Stable Audio 3` is a latent diffusion model based on a transformer architecture.
 * **Language(s)**: English
 * **License**: [Stability AI Community License](https://huggingface.co/stabilityai/stable-audio-3/blob/main/LICENSE.md).
 * **Commercial License**: to use this model commercially, please refer to [https://stability.ai/license](https://stability.ai/license)
 ## Training dataset
 ### Datasets Used
+Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed from [AudioSparx](https://www.audiosparx.com/) and a further 472,618 are from [Freesound](https://freesound.org/).
 The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified
 using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15),
 that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454
+CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open: https://info.stability.ai/attributions.