| --- |
| language: |
| - en |
| library_name: stable-audio-3 |
| license: other |
| license_name: stable-audio-community |
| license_link: LICENSE |
| pipeline_tag: text-to-audio |
| base_model: stabilityai/stable-audio-3-small-sfx-base |
| tags: |
| - audio-generation |
| - sound-effects |
| - diffusion |
| extra_gated_prompt: >- |
| By clicking "Agree", you agree to the [License Agreement](https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md) |
| and acknowledge Stability AI's [Privacy Policy](https://stability.ai/privacy-policy). |
| This model also includes components redistributed under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms). |
| By proceeding, you agree to those terms as well, including the use restrictions in Section 3.2. |
| extra_gated_fields: |
| Name: text |
| Email: text |
| Country: country |
| Organization or Affiliation: text |
| Receive email updates and promotions on Stability AI products, services, and research?: |
| type: select |
| options: |
| - 'Yes' |
| - 'No' |
| What do you intend to use the model for?: |
| type: select |
| options: |
| - Research |
| - Personal use |
| - Creative Professional |
| - Startup |
| - Enterprise |
| --- |
| |
| # Stable Audio 3 Small SFX |
|
|
| Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license) |
|
|
| ## Model Description |
| `Stable Audio 3` is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing. Since our models can generate several minutes of audio, |
| variable-length generations are key to avoid the cost of producing full-length generations for short |
| sounds. We also support inpainting, enabling targeted audio editing and the continuation of short |
| recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that |
| projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial |
| post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on |
| licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU |
| and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, |
| that can run on consumer-grade hardware, together with their training and inference pipeline. |
|
|
| ## Usage |
|
|
| This model can be used with: |
| 1. the [`stable-audio-3`](https://github.com/Stability-AI/stable-audio-3) inference and fine-tuning library |
| 2. the [`stable-audio-tools`](https://github.com/Stability-AI/stable-audio-tools) research library |
|
|
|
|
| ### Using with `stable-audio-3` |
| ```python |
| from stable_audio_3 import StableAudioModel |
| |
| model = StableAudioModel.from_pretrained("small-sfx") |
| audio = model.generate( |
| prompt="chugging train coming into station with horn", |
| duration=7, |
| ) |
| ``` |
|
|
| ### Using with `stable-audio-tools` |
|
|
| ```python |
| import torch |
| import torchaudio |
| from einops import rearrange |
| from stable_audio_tools import get_pretrained_model |
| from stable_audio_tools.inference.generation import generate_diffusion_cond_inpaint |
| |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| if device == "cuda": |
| model_half = True |
| |
| # Download model |
| model, model_config = get_pretrained_model("stabilityai/stable-audio-3-small-sfx") |
| sample_rate = model_config["sample_rate"] |
| sample_size = model_config["sample_size"] |
| |
| model = model.to(device) |
| if model_half: |
| model = model.to(torch.float16) |
| # Set up text and timing conditioning |
| conditioning = [{ |
| "prompt": "chugging train coming into station with horn", |
| "seconds_total": 7 |
| }] |
| |
| # Generate stereo audio |
| output = generate_diffusion_cond_inpaint( |
| model, |
| steps=8, |
| cfg_scale=1.0, |
| conditioning=conditioning, |
| sample_size=sample_size, |
| sampler_type="pingpong", |
| device=device |
| ) |
| |
| # Rearrange audio batch to a single sequence |
| output = rearrange(output, "b d n -> d (b n)") |
| |
| # Peak normalize, clip, convert to int16, and save to file |
| output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu() |
| torchaudio.save("output.wav", output, sample_rate) |
| ``` |
|
|
|
|
| ## Model Details |
| * **Model type**: `Stable Audio 3` is a latent diffusion model based on a transformer architecture. |
| * **Language(s)**: English |
| * **License**: [Stability AI Community License](https://stability.ai/license). |
| * **Research Paper**: [https://arxiv.org/abs/2605.17991](https://arxiv.org/abs/2605.17991) |
|
|
| We use a publicly available pre-trained T5Gemma model ([t5gemma-b-b-ul2](https://huggingface.co/google/t5gemma-b-b-ul2)) for text conditioning. T5Gemma is redistributed under the [Gemma Terms of Use](LICENSE_GEMMA.md). |
|
|
| ## Training dataset |
|
|
| ### Datasets Used |
| Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed from [AudioSparx](https://www.audiosparx.com/) and a further 472,618 are from [Freesound](https://freesound.org/). |
| The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified |
| using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15), |
| that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454 |
| CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open: https://info.stability.ai/attributions. |
|
|