mattricesound's picture
Update model card
ae12755
metadata
language:
  - en
library_name: stable-audio-3
license: other
license_name: stable-audio-community
license_link: LICENSE
pipeline_tag: text-to-audio
base_model: stabilityai/stable-audio-3-small-sfx-base
tags:
  - audio-generation
  - sound-effects
  - diffusion
extra_gated_prompt: >-
  By clicking "Agree", you agree to the [License
  Agreement](https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md)
  and acknowledge Stability AI's [Privacy
  Policy](https://stability.ai/privacy-policy). This model also includes
  components redistributed under the [Gemma Terms of
  Use](https://ai.google.dev/gemma/terms). By proceeding, you agree to those
  terms as well, including the use restrictions in Section 3.2.
extra_gated_fields:
  Name: text
  Email: text
  Country: country
  Organization or Affiliation: text
  Receive email updates and promotions on Stability AI products, services, and research?:
    type: select
    options:
      - 'Yes'
      - 'No'
  What do you intend to use the model for?:
    type: select
    options:
      - Research
      - Personal use
      - Creative Professional
      - Startup
      - Enterprise

Stable Audio 3 Small SFX

Please note: For commercial use, please refer to https://stability.ai/license

Model Description

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.

Usage

This model can be used with:

  1. the stable-audio-3 inference and fine-tuning library
  2. the stable-audio-tools research library

Using with stable-audio-3

from stable_audio_3 import StableAudioModel

model = StableAudioModel.from_pretrained("small-sfx")
audio = model.generate(
    prompt="chugging train coming into station with horn",
    duration=7,
)

Using with stable-audio-tools

import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond_inpaint

device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
  model_half = True

# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-3-small-sfx")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)
if model_half:
  model = model.to(torch.float16)
# Set up text and timing conditioning
conditioning = [{
    "prompt": "chugging train coming into station with horn",
    "seconds_total": 7
}]

# Generate stereo audio
output = generate_diffusion_cond_inpaint(
    model,
    steps=8,
    cfg_scale=1.0,
    conditioning=conditioning,
    sample_size=sample_size,
    sampler_type="pingpong",
    device=device
)

# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")

# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)

Model Details

We use a publicly available pre-trained T5Gemma model (t5gemma-b-b-ul2) for text conditioning. T5Gemma is redistributed under the Gemma Terms of Use.

Training dataset

Datasets Used

Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed from AudioSparx and a further 472,618 are from Freesound. The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15), that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454 CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open: https://info.stability.ai/attributions.