File size: 5,728 Bytes
1078532
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
language:
- en
library_name: stable-audio-3
license: other
license_name: stable-audio-community
license_link: LICENSE
pipeline_tag: text-to-audio
base_model: stabilityai/stable-audio-3-small-sfx-base
tags:
- audio-generation
- sound-effects
- diffusion
extra_gated_prompt: >-
  By clicking "Agree", you agree to the [License Agreement](https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md)
  and acknowledge Stability AI's [Privacy Policy](https://stability.ai/privacy-policy).
  This model also includes components redistributed under the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).
  By proceeding, you agree to those terms as well, including the use restrictions in Section 3.2.
extra_gated_fields:
  Name: text
  Email: text
  Country: country
  Organization or Affiliation: text
  Receive email updates and promotions on Stability AI products, services, and research?:
    type: select
    options:
      - 'Yes'
      - 'No'
  What do you intend to use the model for?:
    type: select
    options:
      - Research
      - Personal use
      - Creative Professional
      - Startup
      - Enterprise
---

# Stable Audio 3 Small SFX

Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)

## Model Description
`Stable Audio 3` is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing. Since our models can generate several minutes of audio,
variable-length generations are key to avoid the cost of producing full-length generations for short
sounds. We also support inpainting, enabling targeted audio editing and the continuation of short
recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that
projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial
post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on
licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU
and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium,
that can run on consumer-grade hardware, together with their training and inference pipeline.

## Usage

This model can be used with:
1. the [`stable-audio-3`](https://github.com/Stability-AI/stable-audio-3) inference and fine-tuning library
2. the [`stable-audio-tools`](https://github.com/Stability-AI/stable-audio-tools) research library


### Using with `stable-audio-3`
```python
from stable_audio_3 import StableAudioModel

model = StableAudioModel.from_pretrained("small-sfx")
audio = model.generate(
    prompt="chugging train coming into station with horn",
    duration=7,
)
```

### Using with `stable-audio-tools`

```python
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond_inpaint

device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
  model_half = True

# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-3-small-sfx")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)
if model_half:
  model = model.to(torch.float16)
# Set up text and timing conditioning
conditioning = [{
    "prompt": "chugging train coming into station with horn",
    "seconds_total": 7
}]

# Generate stereo audio
output = generate_diffusion_cond_inpaint(
    model,
    steps=8,
    cfg_scale=1.0,
    conditioning=conditioning,
    sample_size=sample_size,
    sampler_type="pingpong",
    device=device
)

# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")

# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)
```


## Model Details
* **Model type**: `Stable Audio 3` is a latent diffusion model based on a transformer architecture.
* **Language(s)**: English
* **License**: [Stability AI Community License](https://stability.ai/license).
* **Research Paper**: [https://arxiv.org/abs/2605.17991](https://arxiv.org/abs/2605.17991)

We use a publicly available pre-trained T5Gemma model ([t5gemma-b-b-ul2](https://huggingface.co/google/t5gemma-b-b-ul2)) for text conditioning. T5Gemma is redistributed under the [Gemma Terms of Use](LICENSE_GEMMA.md).

## Training dataset

### Datasets Used
Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed from [AudioSparx](https://www.audiosparx.com/) and a further 472,618 are from [Freesound](https://freesound.org/). 
The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified
using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15),
that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454
CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open: https://info.stability.ai/attributions.