--- language: - en library_name: stable-audio-3 license: other license_name: stable-audio-community license_link: LICENSE pipeline_tag: text-to-audio tags: - audio-generation - music - sound-effects - diffusion --- # Stable Audio 3 Optimized > **Note:** This repository contains experimental checkpoints optimised for acceleration on specific hardware. For standard checkpoints, please use [Stable Audio 3 Medium](https://huggingface.co/stabilityai/stable-audio-3-medium) instead. Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license) ## Model Description `Stable Audio 3` is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline. ## Usage This model can be used with: 1. the [`stable-audio-3`](https://github.com/Stability-AI/stable-audio-3) inference and fine-tuning library 2. the [`stable-audio-tools`](https://github.com/Stability-AI/stable-audio-tools) research library ### Using with `stable-audio-3` ```python from stable_audio_3 import StableAudioModel model = StableAudioModel.from_pretrained("medium") audio = model.generate( prompt=( "House music that encapsulates the feeling of being at a festival " "in the sunny weather with all your friends 124 BPM" ), duration=180 ) ``` ### Using with `stable-audio-tools` ```python import torch import torchaudio from einops import rearrange from stable_audio_tools import get_pretrained_model from stable_audio_tools.inference.generation import generate_diffusion_cond_inpaint device = "cuda" if torch.cuda.is_available() else "cpu" if device == "cuda": model_half = True # Download model model, model_config = get_pretrained_model("stabilityai/stable-audio-3-medium") sample_rate = model_config["sample_rate"] sample_size = model_config["sample_size"] model = model.to(device) if model_half: model = model.to(torch.float16) # Set up text and timing conditioning conditioning = [{ "prompt": ( "A dream-like Synthpop instrumental that would accompany " "a dream-sequence in a surrealist movie 120 BPM" ), "seconds_total": 380 }] # Generate stereo audio output = generate_diffusion_cond_inpaint( model, steps=8, cfg_scale=1.0, conditioning=conditioning, sample_size=sample_size, sampler_type="pingpong", device=device ) # Rearrange audio batch to a single sequence output = rearrange(output, "b d n -> d (b n)") # Peak normalize, clip, convert to int16, and save to file output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu() torchaudio.save("output.wav", output, sample_rate) ``` ## Model Details * **Model type**: `Stable Audio 3` is a latent diffusion model based on a transformer architecture. * **Language(s)**: English * **License**: [Stability AI Community License](https://stability.ai/license). * **Research Paper**: [https://arxiv.org/abs/2605.17991](https://arxiv.org/abs/2605.17991) We use a publicly available pre-trained T5Gemma model ([t5gemma-b-b-ul2](https://huggingface.co/google/t5gemma-b-b-ul2)) for text conditioning. T5Gemma is redistributed under the [Gemma Terms of Use](LICENSE_GEMMA.md). ## Training dataset ### Datasets Used Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed from [AudioSparx](https://www.audiosparx.com/) and a further 472,618 are from [Freesound](https://freesound.org/). The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15), that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454 CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open: https://info.stability.ai/attributions.