owenisas commited on
Commit
465731f
·
verified ·
1 Parent(s): a3557ba

Add Stable Audio 3 testing Space

Browse files
Files changed (4) hide show
  1. .gitignore +9 -0
  2. README.md +32 -6
  3. app.py +579 -0
  4. requirements.txt +13 -0
.gitignore ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ .gradio/
4
+ .cache/
5
+ outputs/
6
+ *.wav
7
+ *.flac
8
+ *.mp3
9
+ *.m4a
README.md CHANGED
@@ -1,13 +1,39 @@
1
  ---
2
  title: Stable Audio 3 Lab
3
- emoji: 🌍
4
- colorFrom: yellow
5
- colorTo: red
6
  sdk: gradio
7
- sdk_version: 6.14.0
8
- python_version: '3.13'
9
  app_file: app.py
 
 
10
  pinned: false
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: Stable Audio 3 Lab
3
+ colorFrom: blue
4
+ colorTo: indigo
 
5
  sdk: gradio
6
+ sdk_version: 6.3.0
 
7
  app_file: app.py
8
+ python_version: 3.10
9
+ suggested_hardware: a10g-small
10
  pinned: false
11
+ license: mit
12
  ---
13
 
14
+ # Stable Audio 3 Lab
15
+
16
+ Gradio Space for testing Stability AI's Stable Audio 3 collections:
17
+
18
+ - Standard collection: `stabilityai/stable-audio-3-small-music`, `stabilityai/stable-audio-3-small-sfx`, `stabilityai/stable-audio-3-medium`
19
+ - Extra collection generation checkpoints: `small-music-base`, `small-sfx-base`, `medium-base`
20
+ - Extra collection autoencoders: `SAME-S`, `SAME-L`
21
+
22
+ The optimized repo (`stabilityai/stable-audio-3-optimized`) currently ships MLX and TensorRT assets rather than a generic `model_config.json` + `model.safetensors` checkpoint. This Space lists it in Coverage, but does not run it through the PyTorch `stable_audio_3` path.
23
+
24
+ ## Access
25
+
26
+ The post-trained Stable Audio 3 checkpoints are gated on Hugging Face. Before using them here:
27
+
28
+ 1. Accept the terms on each gated model page while logged in.
29
+ 2. Add a read-only `HF_TOKEN` secret to this Space.
30
+
31
+ Base checkpoints are not gated, but they are intended mainly for fine-tuning and may not sound as polished.
32
+
33
+ ## Hardware
34
+
35
+ - Small models can run on CPU, but GPU is still preferred.
36
+ - Medium and Medium Base expect CUDA plus `flash-attn`.
37
+ - `SAME-L` is treated as GPU-first; `SAME-S` can be used for CPU autoencoder round trips.
38
+
39
+ The Space is configured with `suggested_hardware: a10g-small`. Upgrade hardware if medium generations fail due to memory or Flash Attention support.
app.py ADDED
@@ -0,0 +1,579 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import gc
4
+ import importlib
5
+ import importlib.util
6
+ import json
7
+ import os
8
+ import tempfile
9
+ import time
10
+ from dataclasses import dataclass
11
+ from typing import Any
12
+
13
+ import gradio as gr
14
+ import numpy as np
15
+
16
+ os.environ.setdefault("HF_HUB_ENABLE_HF_TRANSFER", "1")
17
+
18
+
19
+ @dataclass(frozen=True)
20
+ class GenerationModel:
21
+ label: str
22
+ key: str
23
+ repo_id: str
24
+ family: str
25
+ default_prompt: str
26
+ default_duration: int
27
+ max_duration: int
28
+ default_steps: int
29
+ default_cfg: float
30
+ default_sampler: str
31
+ requires_cuda: bool = False
32
+ gated: bool = False
33
+ note: str = ""
34
+
35
+
36
+ GENERATION_MODELS: dict[str, GenerationModel] = {
37
+ "small-music": GenerationModel(
38
+ label="Stable Audio 3 Small Music",
39
+ key="small-music",
40
+ repo_id="stabilityai/stable-audio-3-small-music",
41
+ family="post-trained",
42
+ default_prompt=(
43
+ "Warm lo-fi house groove, soft sidechained pads, clean drums, "
44
+ "late-night atmosphere, 118 BPM"
45
+ ),
46
+ default_duration=20,
47
+ max_duration=120,
48
+ default_steps=8,
49
+ default_cfg=1.0,
50
+ default_sampler="pingpong",
51
+ gated=True,
52
+ note="Lightweight music checkpoint.",
53
+ ),
54
+ "small-sfx": GenerationModel(
55
+ label="Stable Audio 3 Small SFX",
56
+ key="small-sfx",
57
+ repo_id="stabilityai/stable-audio-3-small-sfx",
58
+ family="post-trained",
59
+ default_prompt="Close binaural rain on a window, soft cloth movement, detailed texture",
60
+ default_duration=8,
61
+ max_duration=120,
62
+ default_steps=8,
63
+ default_cfg=1.0,
64
+ default_sampler="pingpong",
65
+ gated=True,
66
+ note="Lightweight sound-effects checkpoint.",
67
+ ),
68
+ "medium": GenerationModel(
69
+ label="Stable Audio 3 Medium",
70
+ key="medium",
71
+ repo_id="stabilityai/stable-audio-3-medium",
72
+ family="post-trained",
73
+ default_prompt=(
74
+ "Cinematic ambient electronic cue, deep sub pulse, shimmering stereo texture, "
75
+ "slow evolving melody"
76
+ ),
77
+ default_duration=20,
78
+ max_duration=380,
79
+ default_steps=8,
80
+ default_cfg=1.0,
81
+ default_sampler="pingpong",
82
+ requires_cuda=True,
83
+ gated=True,
84
+ note="High-quality checkpoint; CUDA and flash-attn are expected.",
85
+ ),
86
+ "small-music-base": GenerationModel(
87
+ label="Stable Audio 3 Small Music Base",
88
+ key="small-music-base",
89
+ repo_id="stabilityai/stable-audio-3-small-music-base",
90
+ family="base",
91
+ default_prompt="Dreamlike synthpop instrumental, surreal film sequence, 120 BPM",
92
+ default_duration=20,
93
+ max_duration=120,
94
+ default_steps=50,
95
+ default_cfg=7.0,
96
+ default_sampler="euler",
97
+ note="Base checkpoint intended mainly for fine-tuning.",
98
+ ),
99
+ "small-sfx-base": GenerationModel(
100
+ label="Stable Audio 3 Small SFX Base",
101
+ key="small-sfx-base",
102
+ repo_id="stabilityai/stable-audio-3-small-sfx-base",
103
+ family="base",
104
+ default_prompt="Chugging train coming into station with horn",
105
+ default_duration=7,
106
+ max_duration=120,
107
+ default_steps=50,
108
+ default_cfg=7.0,
109
+ default_sampler="euler",
110
+ note="Base checkpoint intended mainly for fine-tuning.",
111
+ ),
112
+ "medium-base": GenerationModel(
113
+ label="Stable Audio 3 Medium Base",
114
+ key="medium-base",
115
+ repo_id="stabilityai/stable-audio-3-medium-base",
116
+ family="base",
117
+ default_prompt="Dreamlike synthpop instrumental, surreal film sequence, 120 BPM",
118
+ default_duration=20,
119
+ max_duration=380,
120
+ default_steps=50,
121
+ default_cfg=7.0,
122
+ default_sampler="euler",
123
+ requires_cuda=True,
124
+ note="Base checkpoint intended mainly for fine-tuning; CUDA and flash-attn are expected.",
125
+ ),
126
+ }
127
+
128
+ AUTOENCODER_MODELS = {
129
+ "same-s": {
130
+ "label": "SAME-S",
131
+ "repo_id": "stabilityai/SAME-S",
132
+ "requires_cuda": False,
133
+ },
134
+ "same-l": {
135
+ "label": "SAME-L",
136
+ "repo_id": "stabilityai/SAME-L",
137
+ "requires_cuda": True,
138
+ },
139
+ }
140
+
141
+ COLLECTION_ROWS = [
142
+ ["stable-audio-3-small-music", "Text-to-audio", "Generate tab", "Gated post-trained small music"],
143
+ ["stable-audio-3-small-sfx", "Text-to-audio", "Generate tab", "Gated post-trained small SFX"],
144
+ ["stable-audio-3-medium", "Text-to-audio", "Generate tab", "Gated medium; needs CUDA + flash-attn"],
145
+ ["stable-audio-3-small-music-base", "Text-to-audio", "Generate tab", "Base checkpoint"],
146
+ ["stable-audio-3-small-sfx-base", "Text-to-audio", "Generate tab", "Base checkpoint"],
147
+ ["stable-audio-3-medium-base", "Text-to-audio", "Generate tab", "Base checkpoint; needs CUDA + flash-attn"],
148
+ ["stable-audio-3-optimized", "Optimized assets", "Listed only", "MLX/TensorRT artifacts, not generic PyTorch generation"],
149
+ ["SAME-S", "Autoencoder", "Autoencoder tab", "CPU-capable round trip"],
150
+ ["SAME-L", "Autoencoder", "Autoencoder tab", "Large autoencoder; CUDA recommended"],
151
+ ]
152
+
153
+ MODEL_CACHE: dict[str, Any] = {"key": None, "model": None}
154
+ AE_CACHE: dict[str, Any] = {"key": None, "model": None}
155
+
156
+
157
+ def gpu_task(duration: int):
158
+ try:
159
+ import spaces
160
+
161
+ return spaces.GPU(duration=duration)
162
+ except Exception:
163
+ return lambda fn: fn
164
+
165
+
166
+ def import_torch():
167
+ return importlib.import_module("torch")
168
+
169
+
170
+ def current_device(torch_module: Any) -> str:
171
+ if torch_module.cuda.is_available():
172
+ return "cuda"
173
+ if hasattr(torch_module.backends, "mps") and torch_module.backends.mps.is_available():
174
+ return "mps"
175
+ return "cpu"
176
+
177
+
178
+ def flash_attn_available() -> bool:
179
+ return importlib.util.find_spec("flash_attn") is not None
180
+
181
+
182
+ def stable_audio_token_hint(model: GenerationModel) -> str:
183
+ if not model.gated:
184
+ return ""
185
+ if os.getenv("HF_TOKEN") or os.getenv("HUGGING_FACE_HUB_TOKEN"):
186
+ return ""
187
+ return (
188
+ "This is a gated Stability model. Accept the model terms on Hugging Face "
189
+ "and add a read-only HF_TOKEN Space secret if download fails."
190
+ )
191
+
192
+
193
+ def assert_generation_runtime(model: GenerationModel, allow_cpu_medium: bool) -> str:
194
+ torch = import_torch()
195
+ device = current_device(torch)
196
+ if model.requires_cuda and device != "cuda" and not allow_cpu_medium:
197
+ raise gr.Error(
198
+ f"{model.label} is blocked on this runtime because CUDA is not available. "
199
+ "Use a GPU Space or enable the CPU override for a slow/debug-only attempt."
200
+ )
201
+ if model.requires_cuda and device == "cuda" and not flash_attn_available():
202
+ raise gr.Error(
203
+ f"{model.label} expects flash-attn on CUDA. Rebuild the Space with the "
204
+ "flash-attn wheel in requirements.txt or use a small model."
205
+ )
206
+ return device
207
+
208
+
209
+ def normalize_audio_array(data: np.ndarray) -> np.ndarray:
210
+ array = np.asarray(data)
211
+ if np.issubdtype(array.dtype, np.integer):
212
+ limit = max(abs(np.iinfo(array.dtype).min), np.iinfo(array.dtype).max)
213
+ array = array.astype(np.float32) / float(limit)
214
+ else:
215
+ array = array.astype(np.float32)
216
+ if array.ndim == 1:
217
+ array = array[None, :]
218
+ elif array.ndim == 2:
219
+ array = array.T
220
+ else:
221
+ raise gr.Error("Audio must be mono or stereo.")
222
+ return np.nan_to_num(array, nan=0.0, posinf=0.0, neginf=0.0)
223
+
224
+
225
+ def clear_torch_memory() -> None:
226
+ try:
227
+ torch = import_torch()
228
+ if torch.cuda.is_available():
229
+ torch.cuda.empty_cache()
230
+ except Exception:
231
+ pass
232
+ gc.collect()
233
+
234
+
235
+ def load_generation_model(model_key: str, allow_cpu_medium: bool):
236
+ model_def = GENERATION_MODELS[model_key]
237
+ device = assert_generation_runtime(model_def, allow_cpu_medium)
238
+
239
+ if MODEL_CACHE["key"] == model_key and MODEL_CACHE["model"] is not None:
240
+ return MODEL_CACHE["model"], device
241
+
242
+ MODEL_CACHE["model"] = None
243
+ MODEL_CACHE["key"] = None
244
+ clear_torch_memory()
245
+
246
+ from stable_audio_3 import StableAudioModel
247
+
248
+ model_half = device == "cuda"
249
+ model = StableAudioModel.from_pretrained(model_key, model_half=model_half)
250
+ MODEL_CACHE["key"] = model_key
251
+ MODEL_CACHE["model"] = model
252
+ return model, device
253
+
254
+
255
+ def load_autoencoder(model_key: str, allow_cpu_same_l: bool):
256
+ model_def = AUTOENCODER_MODELS[model_key]
257
+ torch = import_torch()
258
+ device = current_device(torch)
259
+ if model_def["requires_cuda"] and device != "cuda" and not allow_cpu_same_l:
260
+ raise gr.Error(
261
+ f"{model_def['label']} is blocked on this runtime because CUDA is not available. "
262
+ "Use SAME-S or enable the CPU override for a slow/debug-only attempt."
263
+ )
264
+
265
+ if AE_CACHE["key"] == model_key and AE_CACHE["model"] is not None:
266
+ return AE_CACHE["model"], device
267
+
268
+ AE_CACHE["model"] = None
269
+ AE_CACHE["key"] = None
270
+ clear_torch_memory()
271
+
272
+ from stable_audio_3 import AutoencoderModel
273
+
274
+ model = AutoencoderModel.from_pretrained(model_key)
275
+ AE_CACHE["key"] = model_key
276
+ AE_CACHE["model"] = model
277
+ return model, device
278
+
279
+
280
+ def model_changed(model_key: str):
281
+ model = GENERATION_MODELS[model_key]
282
+ return (
283
+ gr.update(value=model.default_prompt),
284
+ gr.update(value=model.default_duration, maximum=model.max_duration),
285
+ gr.update(value=model.default_steps),
286
+ gr.update(value=model.default_cfg),
287
+ gr.update(value=model.default_sampler),
288
+ {
289
+ "repo_id": model.repo_id,
290
+ "family": model.family,
291
+ "max_duration_s": model.max_duration,
292
+ "default_sampler": model.default_sampler,
293
+ "note": model.note,
294
+ "token_hint": stable_audio_token_hint(model),
295
+ },
296
+ )
297
+
298
+
299
+ @gpu_task(duration=int(os.getenv("SPACES_GENERATE_GPU_SECONDS", "900")))
300
+ def generate_audio(
301
+ model_key: str,
302
+ prompt: str,
303
+ negative_prompt: str,
304
+ duration: float,
305
+ steps: int,
306
+ cfg_scale: float,
307
+ sampler_type: str,
308
+ seed: int,
309
+ chunked_decode: bool,
310
+ allow_cpu_medium: bool,
311
+ progress=gr.Progress(track_tqdm=True),
312
+ ):
313
+ if not prompt or not prompt.strip():
314
+ raise gr.Error("Prompt is required.")
315
+ model_def = GENERATION_MODELS[model_key]
316
+ progress(0.05, desc="Loading model")
317
+ started = time.time()
318
+ seed = int(seed)
319
+ if seed < 0:
320
+ seed = int.from_bytes(os.urandom(4), "little") % 100000
321
+
322
+ model, device = load_generation_model(model_key, allow_cpu_medium)
323
+ progress(0.25, desc="Generating")
324
+ audio = model.generate(
325
+ prompt=prompt.strip(),
326
+ negative_prompt=negative_prompt.strip() or None,
327
+ duration=float(duration),
328
+ steps=int(steps),
329
+ cfg_scale=float(cfg_scale),
330
+ seed=seed,
331
+ sampler_type=sampler_type,
332
+ chunked_decode=bool(chunked_decode),
333
+ )
334
+
335
+ progress(0.9, desc="Writing WAV")
336
+ import torchaudio
337
+
338
+ sample_rate = int(model.model_config["sample_rate"])
339
+ waveform = audio[0].detach().to("cpu").float().clamp(-1, 1)
340
+ out_file = tempfile.NamedTemporaryFile(prefix=f"{model_key}-", suffix=".wav", delete=False)
341
+ out_file.close()
342
+ torchaudio.save(out_file.name, waveform, sample_rate)
343
+
344
+ elapsed = round(time.time() - started, 3)
345
+ metadata = {
346
+ "model": model_def.key,
347
+ "repo_id": model_def.repo_id,
348
+ "family": model_def.family,
349
+ "device": device,
350
+ "duration_s": float(duration),
351
+ "steps": int(steps),
352
+ "cfg_scale": float(cfg_scale),
353
+ "sampler_type": sampler_type,
354
+ "seed": seed,
355
+ "sample_rate": sample_rate,
356
+ "elapsed_s": elapsed,
357
+ "output_file": out_file.name,
358
+ "note": model_def.note,
359
+ }
360
+ return out_file.name, metadata
361
+
362
+
363
+ @gpu_task(duration=int(os.getenv("SPACES_AUTOENCODER_GPU_SECONDS", "600")))
364
+ def roundtrip_autoencoder(
365
+ model_key: str,
366
+ audio_input: tuple[int, np.ndarray] | None,
367
+ chunked: bool,
368
+ allow_cpu_same_l: bool,
369
+ progress=gr.Progress(track_tqdm=True),
370
+ ):
371
+ if audio_input is None:
372
+ raise gr.Error("Upload or record audio first.")
373
+
374
+ progress(0.05, desc="Loading autoencoder")
375
+ started = time.time()
376
+ model, device = load_autoencoder(model_key, allow_cpu_same_l)
377
+
378
+ progress(0.25, desc="Encoding")
379
+ sr, data = audio_input
380
+ waveform_np = normalize_audio_array(data)
381
+
382
+ torch = import_torch()
383
+ waveform = torch.from_numpy(waveform_np)
384
+ latents = model.encode(waveform, int(sr), chunked=bool(chunked))
385
+
386
+ progress(0.65, desc="Decoding")
387
+ decoded = model.decode(latents, chunked=bool(chunked))
388
+ decoded = decoded[0].detach().to("cpu").float().clamp(-1, 1)
389
+
390
+ import torchaudio
391
+
392
+ out_file = tempfile.NamedTemporaryFile(prefix=f"{model_key}-roundtrip-", suffix=".wav", delete=False)
393
+ out_file.close()
394
+ torchaudio.save(out_file.name, decoded, int(model.sample_rate))
395
+
396
+ metadata = {
397
+ "autoencoder": model_key,
398
+ "repo_id": AUTOENCODER_MODELS[model_key]["repo_id"],
399
+ "device": device,
400
+ "input_sample_rate": int(sr),
401
+ "output_sample_rate": int(model.sample_rate),
402
+ "input_shape": list(waveform.shape),
403
+ "latent_shape": list(latents.shape),
404
+ "elapsed_s": round(time.time() - started, 3),
405
+ "output_file": out_file.name,
406
+ }
407
+ return out_file.name, metadata
408
+
409
+
410
+ def unload_models():
411
+ MODEL_CACHE["key"] = None
412
+ MODEL_CACHE["model"] = None
413
+ AE_CACHE["key"] = None
414
+ AE_CACHE["model"] = None
415
+ clear_torch_memory()
416
+ return {"status": "unloaded"}
417
+
418
+
419
+ def runtime_status():
420
+ try:
421
+ torch = import_torch()
422
+ device = current_device(torch)
423
+ cuda_name = torch.cuda.get_device_name(0) if torch.cuda.is_available() else None
424
+ except Exception as exc:
425
+ device = "unavailable"
426
+ cuda_name = None
427
+ return {"torch": repr(exc), "device": device}
428
+
429
+ return {
430
+ "device": device,
431
+ "cuda_name": cuda_name,
432
+ "flash_attn": flash_attn_available(),
433
+ "hf_token_present": bool(os.getenv("HF_TOKEN") or os.getenv("HUGGING_FACE_HUB_TOKEN")),
434
+ "loaded_generation_model": MODEL_CACHE["key"],
435
+ "loaded_autoencoder": AE_CACHE["key"],
436
+ }
437
+
438
+
439
+ MODEL_CHOICES = [(model.label, key) for key, model in GENERATION_MODELS.items()]
440
+ AE_CHOICES = [(value["label"], key) for key, value in AUTOENCODER_MODELS.items()]
441
+ SAMPLER_CHOICES = ["pingpong", "euler", "rk4", "dpmpp", "dpmpp-3m-sde"]
442
+
443
+ css = """
444
+ .gradio-container { max-width: 1160px !important; }
445
+ #run-buttons button { min-height: 42px; }
446
+ """
447
+
448
+ with gr.Blocks(title="Stable Audio 3 Lab", css=css) as demo:
449
+ gr.Markdown("# Stable Audio 3 Lab")
450
+
451
+ with gr.Tab("Generate"):
452
+ with gr.Row(equal_height=False):
453
+ with gr.Column(scale=2):
454
+ model_dropdown = gr.Dropdown(
455
+ label="Model",
456
+ choices=MODEL_CHOICES,
457
+ value="small-sfx",
458
+ interactive=True,
459
+ )
460
+ prompt_box = gr.Textbox(
461
+ label="Prompt",
462
+ value=GENERATION_MODELS["small-sfx"].default_prompt,
463
+ lines=4,
464
+ )
465
+ negative_prompt_box = gr.Textbox(label="Negative prompt", lines=2)
466
+ with gr.Row():
467
+ duration_slider = gr.Slider(
468
+ label="Duration",
469
+ minimum=1,
470
+ maximum=GENERATION_MODELS["small-sfx"].max_duration,
471
+ value=GENERATION_MODELS["small-sfx"].default_duration,
472
+ step=1,
473
+ )
474
+ steps_slider = gr.Slider(
475
+ label="Steps",
476
+ minimum=1,
477
+ maximum=100,
478
+ value=GENERATION_MODELS["small-sfx"].default_steps,
479
+ step=1,
480
+ )
481
+ cfg_slider = gr.Slider(
482
+ label="CFG",
483
+ minimum=0,
484
+ maximum=12,
485
+ value=GENERATION_MODELS["small-sfx"].default_cfg,
486
+ step=0.1,
487
+ )
488
+ with gr.Row():
489
+ sampler_dropdown = gr.Dropdown(
490
+ label="Sampler",
491
+ choices=SAMPLER_CHOICES,
492
+ value=GENERATION_MODELS["small-sfx"].default_sampler,
493
+ )
494
+ seed_number = gr.Number(label="Seed", value=-1, precision=0)
495
+ with gr.Row():
496
+ chunked_decode_box = gr.Checkbox(label="Chunked decode", value=True)
497
+ allow_cpu_medium_box = gr.Checkbox(label="CPU override", value=False)
498
+ with gr.Row(elem_id="run-buttons"):
499
+ generate_button = gr.Button("Generate", variant="primary")
500
+ unload_button = gr.Button("Unload")
501
+ status_button = gr.Button("Runtime")
502
+ with gr.Column(scale=1):
503
+ model_info = gr.JSON(
504
+ label="Model info",
505
+ value={
506
+ "repo_id": GENERATION_MODELS["small-sfx"].repo_id,
507
+ "family": GENERATION_MODELS["small-sfx"].family,
508
+ "note": GENERATION_MODELS["small-sfx"].note,
509
+ "token_hint": stable_audio_token_hint(GENERATION_MODELS["small-sfx"]),
510
+ },
511
+ )
512
+ audio_output = gr.Audio(label="Output", type="filepath")
513
+ metadata_output = gr.JSON(label="Run metadata")
514
+
515
+ model_dropdown.change(
516
+ model_changed,
517
+ inputs=model_dropdown,
518
+ outputs=[
519
+ prompt_box,
520
+ duration_slider,
521
+ steps_slider,
522
+ cfg_slider,
523
+ sampler_dropdown,
524
+ model_info,
525
+ ],
526
+ )
527
+ generate_button.click(
528
+ generate_audio,
529
+ inputs=[
530
+ model_dropdown,
531
+ prompt_box,
532
+ negative_prompt_box,
533
+ duration_slider,
534
+ steps_slider,
535
+ cfg_slider,
536
+ sampler_dropdown,
537
+ seed_number,
538
+ chunked_decode_box,
539
+ allow_cpu_medium_box,
540
+ ],
541
+ outputs=[audio_output, metadata_output],
542
+ concurrency_limit=1,
543
+ )
544
+ unload_button.click(unload_models, outputs=metadata_output)
545
+ status_button.click(runtime_status, outputs=metadata_output)
546
+
547
+ with gr.Tab("Autoencoder"):
548
+ with gr.Row(equal_height=False):
549
+ with gr.Column(scale=2):
550
+ ae_dropdown = gr.Dropdown(label="Autoencoder", choices=AE_CHOICES, value="same-s")
551
+ ae_audio_input = gr.Audio(label="Input", sources=["upload", "microphone"], type="numpy")
552
+ with gr.Row():
553
+ ae_chunked_box = gr.Checkbox(label="Chunked", value=True)
554
+ ae_allow_cpu_box = gr.Checkbox(label="CPU override", value=False)
555
+ ae_button = gr.Button("Round Trip", variant="primary")
556
+ with gr.Column(scale=1):
557
+ ae_output = gr.Audio(label="Decoded", type="filepath")
558
+ ae_metadata = gr.JSON(label="Round-trip metadata")
559
+
560
+ ae_button.click(
561
+ roundtrip_autoencoder,
562
+ inputs=[ae_dropdown, ae_audio_input, ae_chunked_box, ae_allow_cpu_box],
563
+ outputs=[ae_output, ae_metadata],
564
+ concurrency_limit=1,
565
+ )
566
+
567
+ with gr.Tab("Coverage"):
568
+ gr.Dataframe(
569
+ value=COLLECTION_ROWS,
570
+ headers=["Collection entry", "Type", "Space path", "Status"],
571
+ datatype=["str", "str", "str", "str"],
572
+ interactive=False,
573
+ wrap=True,
574
+ )
575
+ gr.JSON(label="Runtime", value=runtime_status())
576
+
577
+
578
+ if __name__ == "__main__":
579
+ demo.queue(default_concurrency_limit=1).launch()
requirements.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ --extra-index-url https://download.pytorch.org/whl/cu126
2
+
3
+ torch==2.7.1
4
+ torchaudio==2.7.1
5
+ gradio==6.3.0
6
+ spaces
7
+ hf_transfer
8
+ soundfile
9
+ git+https://github.com/Stability-AI/stable-audio-3.git@main
10
+
11
+ # Required for Stable Audio 3 Medium on CUDA. This is the wheel recommended by
12
+ # Stability AI's README for torch 2.7 / CUDA 12.6 / Python 3.10.
13
+ https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.6.3+cu126torch2.7-cp310-cp310-linux_x86_64.whl