mattricesound commited on
Commit
2226f57
·
verified ·
1 Parent(s): d709234

Update model card

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: stable-audio-3
5
+ license: other
6
+ license_name: stable-audio-community
7
+ license_link: LICENSE
8
+ pipeline_tag: text-to-audio
9
+ ---
10
+
11
+ # Stable Audio 3 Medium (Base)
12
+
13
+ > **Note:** This is the base (pre-trained) model intended for fine-tuning. If you are looking to generate audio directly, please use [Stable Audio 3 Medium](https://huggingface.co/stabilityai/stable-audio-3-medium) instead.
14
+
15
+ ![Stable Audio 3 logo](./Stable_Audio_3.0_Thumbnail_1x1.png)
16
+
17
+ Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
18
+
19
+ ## Model Description
20
+ `Stable Audio 3` is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing. Since our models can generate several minutes of audio,
21
+ variable-length generations are key to avoid the cost of producing full-length generations for short
22
+ sounds. We also support inpainting, enabling targeted audio editing and the continuation of short
23
+ recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that
24
+ projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial
25
+ post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on
26
+ licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU
27
+ and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium,
28
+ that can run on consumer-grade hardware, together with their training and inference pipeline.
29
+
30
+ ## Usage
31
+
32
+ This model can be used with:
33
+ 1. the [`stable-audio-3`](https://github.com/Stability-AI/stable-audio-3) inference and fine-tuning library
34
+ 2. the [`stable-audio-tools`](https://github.com/Stability-AI/stable-audio-tools) research library
35
+
36
+
37
+ ### Using with `stable-audio-3`
38
+ from stable_audio_3 import StableAudioModel
39
+
40
+ model = StableAudioModel.from_pretrained("medium-base")
41
+ audio = model.generate(
42
+ prompt="House music that encapsulates the feeling of being at a festival in the sunny weather with all your friends 124 BPM",
43
+ duration=180,
44
+ )
45
+
46
+
47
+ ### Using with `stable-audio-tools`
48
+
49
+ ```python
50
+ import torch
51
+ import torchaudio
52
+ from einops import rearrange
53
+ from stable_audio_tools import get_pretrained_model
54
+ from stable_audio_tools.inference.generation import generate_diffusion_cond_inpaint
55
+
56
+ device = "cuda" if torch.cuda.is_available() else "cpu"
57
+ if device == "cuda":
58
+ model_half = True
59
+
60
+ # Download model
61
+ model, model_config = get_pretrained_model("stabilityai/stable-audio-3-medium-base")
62
+ sample_rate = model_config["sample_rate"]
63
+ sample_size = model_config["sample_size"]
64
+
65
+ model = model.to(device)
66
+ if model_half:
67
+ model = model.to(torch.float16)
68
+ # Set up text and timing conditioning
69
+ conditioning = [{
70
+ "prompt": "A dream-like Synthpop instrumental that would accompany a dream-sequence in a surrealist movie 120 BPM",
71
+ "seconds_total": 380
72
+ }]
73
+
74
+ # Generate stereo audio
75
+ output = generate_diffusion_cond_inpaint(
76
+ model,
77
+ steps=8,
78
+ cfg_scale=1.0,
79
+ conditioning=conditioning,
80
+ sample_size=sample_size,
81
+ sampler_type="pingpong",
82
+ device=device
83
+ )
84
+
85
+ # Rearrange audio batch to a single sequence
86
+ output = rearrange(output, "b d n -> d (b n)")
87
+
88
+ # Peak normalize, clip, convert to int16, and save to file
89
+ output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
90
+ torchaudio.save("output.wav", output, sample_rate)
91
+ ```
92
+
93
+
94
+ ## Model Details
95
+ * **Model type**: `Stable Audio Open 3` is a latent diffusion model based on a transformer architecture.
96
+ * **Language(s)**: English
97
+ * **License**: [Stability AI Community License](https://huggingface.co/stabilityai/stable-audio-3/blob/main/LICENSE.md).
98
+ * **Commercial License**: to use this model commercially, please refer to [https://stability.ai/license](https://stability.ai/license)
99
+ * **Research Paper**: [https://arxiv.org/abs/2605.17991](https://arxiv.org/abs/2605.17991)
100
+
101
+ We use a publicly available pre-trained T5Gemma model ([t5gemma-b-b-ul2](https://huggingface.co/google/t5gemma-b-b-ul2)) for text conditioning. T5Gemma is redistributed under the [Gemma Terms of Use](LICENSE_GEMMA.md).
102
+
103
+ ## Training dataset
104
+
105
+ ### Datasets Used
106
+ Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed from [Audiosparx](https://www.audiosparx.com/) and a further 472,618 are from [Freesound](https://freesound.org/).
107
+ The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified
108
+ using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15),
109
+ that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454
110
+ CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open: https://info.stability.ai/attributions. All stable-audio-3 small models are initially pre-trained on a mixture of AudioSparx and Freesound.
111
+ But for the final stage of pre-training, distillation warmup, and post-training, we use AudioSparx for small-music
112
+ and a higher-quality subset of Freesound for small-sfx. As a result, note that medium and large models are able to
113
+ handle both music and sound effect generation within a single unified model. However, we find that for small models
114
+ the inclusion of sound effects data degrades musical coherence. By isolating the sound effects subset into small-sfx,
115
+ we mitigate this interference and obtain improved perceptual quality in both domains.