File size: 5,207 Bytes
392862c e021817 392862c 9738daa 392862c a8853da 392862c b9f8891 a8853da 392862c a8853da 392862c b9f8891 392862c a8853da 392862c 32f06ad 392862c a8853da 392862c a8853da | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | ---
language:
- en
library_name: stable-audio-3
license: other
license_name: stable-audio-community
license_link: LICENSE
pipeline_tag: text-to-audio
tags:
- audio-generation
- music
- sound-effects
- diffusion
---
# Stable Audio 3 Optimized
> **Note:** This repository contains experimental checkpoints optimised for acceleration on specific hardware. For standard checkpoints, please use [Stable Audio 3 Medium](https://huggingface.co/stabilityai/stable-audio-3-medium) instead.
Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
## Model Description
`Stable Audio 3` is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing. Since our models can generate several minutes of audio,
variable-length generations are key to avoid the cost of producing full-length generations for short
sounds. We also support inpainting, enabling targeted audio editing and the continuation of short
recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that
projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial
post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on
licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU
and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium,
that can run on consumer-grade hardware, together with their training and inference pipeline.
## Usage
This model can be used with:
1. the [`stable-audio-3`](https://github.com/Stability-AI/stable-audio-3) inference and fine-tuning library
2. the [`stable-audio-tools`](https://github.com/Stability-AI/stable-audio-tools) research library
### Using with `stable-audio-3`
```python
from stable_audio_3 import StableAudioModel
model = StableAudioModel.from_pretrained("medium")
audio = model.generate(
prompt=(
"House music that encapsulates the feeling of being at a festival "
"in the sunny weather with all your friends 124 BPM"
),
duration=180
)
```
### Using with `stable-audio-tools`
```python
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond_inpaint
device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
model_half = True
# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-3-medium")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
model = model.to(device)
if model_half:
model = model.to(torch.float16)
# Set up text and timing conditioning
conditioning = [{
"prompt": (
"A dream-like Synthpop instrumental that would accompany "
"a dream-sequence in a surrealist movie 120 BPM"
),
"seconds_total": 380
}]
# Generate stereo audio
output = generate_diffusion_cond_inpaint(
model,
steps=8,
cfg_scale=1.0,
conditioning=conditioning,
sample_size=sample_size,
sampler_type="pingpong",
device=device
)
# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")
# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)
```
## Model Details
* **Model type**: `Stable Audio 3` is a latent diffusion model based on a transformer architecture.
* **Language(s)**: English
* **License**: [Stability AI Community License](https://stability.ai/license).
* **Research Paper**: [https://arxiv.org/abs/2605.17991](https://arxiv.org/abs/2605.17991)
We use a publicly available pre-trained T5Gemma model ([t5gemma-b-b-ul2](https://huggingface.co/google/t5gemma-b-b-ul2)) for text conditioning. T5Gemma is redistributed under the [Gemma Terms of Use](LICENSE_GEMMA.md).
## Training dataset
### Datasets Used
Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed from [AudioSparx](https://www.audiosparx.com/) and a further 472,618 are from [Freesound](https://freesound.org/).
The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified
using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15),
that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454
CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open: https://info.stability.ai/attributions. |