Update README — mirror is now bf16 (~15.7 GB, half the fp32 footprint)

ddd5356 verified about 23 hours ago

6.91 kB

	---
	license: other
	license_name: stability-ai-community-license
	license_link: https://stability.ai/license
	library_name: stable-audio-3
	tags:
	- audio
	- audio-generation
	- text-to-audio
	- audio-to-audio
	- inpainting
	- stable-audio-3
	- stability-ai
	- safetensors
	pipeline_tag: text-to-audio
	---

	# Stable Audio 3 — bundled mirror

	Self-contained inference bundle for the [MAESTRO](https://github.com/AEmotionStudio/MAESTRO) desktop app.
	One-to-one mirror of Stability AI's [Stable Audio 3 collection](https://huggingface.co/collections/stabilityai/stable-audio-3) and the [extras collection](https://huggingface.co/collections/stabilityai/stable-audio-3-extra) (base checkpoints + standalone autoencoders), bundled into a single browseable HF repo so the MAESTRO panel can pick the variant a user wants without juggling eight separate downloads.

	## License — Stability AI Community License

	All weights in this repository are released by Stability AI under the [Stability AI Community License](https://stability.ai/license):

	> Free for organizations with under $1M annual revenue. Commercial use of the models and outputs is permitted within that threshold; redistribution, fine-tuning, and derivative works are explicitly allowed. Outputs are yours. Above the revenue threshold, contact Stability AI for an Enterprise License.

	The upstream [`stable-audio-3` source code](https://github.com/Stability-AI/stable-audio-3) is released separately under MIT.

	### Gated subdirs

	Three subdirs mirror upstream repos that are gated on huggingface.co — you must accept Stability AI's terms (and the Gemma terms-of-use, since the text encoder is T5-Gemma) before this mirror's gating allows access:

	- `small-music/` (mirror of [`stabilityai/stable-audio-3-small-music`](https://huggingface.co/stabilityai/stable-audio-3-small-music))
	- `small-sfx/` (mirror of [`stabilityai/stable-audio-3-small-sfx`](https://huggingface.co/stabilityai/stable-audio-3-small-sfx))
	- `medium/` (mirror of [`stabilityai/stable-audio-3-medium`](https://huggingface.co/stabilityai/stable-audio-3-medium))

	The base checkpoints and SAME autoencoders are open.

	## Contents

	\| Subdir \| Role \| Params \| Max duration \| Upstream \|
	\|---\|---\|---\|---\|---\|
	\| `small-music/` \| Post-trained text → audio (music) \| 433 M \| 120 s \| `stabilityai/stable-audio-3-small-music` (gated) \|
	\| `small-sfx/` \| Post-trained text → audio (SFX) \| 433 M \| 120 s \| `stabilityai/stable-audio-3-small-sfx` (gated) \|
	\| `medium/` \| Post-trained text → audio (music + SFX) \| 1.4 B \| 380 s \| `stabilityai/stable-audio-3-medium` (gated) \|
	\| `small-music-base/` \| Base ckpt for LoRA fine-tuning \| 433 M \| 120 s \| `stabilityai/stable-audio-3-small-music-base` \|
	\| `small-sfx-base/` \| Base ckpt for LoRA fine-tuning \| 433 M \| 120 s \| `stabilityai/stable-audio-3-small-sfx-base` \|
	\| `medium-base/` \| Base ckpt for LoRA fine-tuning \| 1.4 B \| 380 s \| `stabilityai/stable-audio-3-medium-base` \|
	\| `same-s/` \| SAME-Small standalone autoencoder \| ~50 M \| — \| `stabilityai/SAME-S` \|
	\| `same-l/` \| SAME-Large standalone autoencoder \| ~200 M \| — \| `stabilityai/SAME-L` \|

	Every subdir contains `model.safetensors` + `model_config.json` (plus the post-trained / base variants include the bundled T5-Gemma text encoder + SAME pretransform; SAME repos are AE-only).

	## Capabilities

	All six generative variants share a single inference surface in MAESTRO with four modes:

	- Text → Audio — prompt-only generation, stereo 44.1 kHz
	- Audio → Audio — style transfer / restyling with an adjustable `init_noise_level`
	- Inpaint — multi-region regeneration of a source clip; non-region time is preserved verbatim
	- Continue — extend an existing clip past its end

	Generation knobs exposed: prompt, negative prompt, duration, steps, CFG scale, APG scale, seed, batch size, sampler type (`dpmpp-3m-sde` / `dpmpp-2m` / `euler` / `heun`), distribution shift (`logSNR` / `flux` / `identity`), precision (fp16 / fp32), chunked decode, and a user-loadable stackable LoRA stack.

	> Medium variants require [Flash Attention 2](https://github.com/Dao-AILab/flash-attention) for the SAME-Large decoder path. Without `flash-attn` installed, Medium generation degrades to static-glitch output. Small variants do not require it.

	## Format

	- All weights are `safetensors`. No `.pt` / `.ckpt` / `.bin` in this mirror.
	- Mirror is bf16 — re-saved via `safetensors.torch.save_model` (preserves shared RotaryEmbedding buffers that bare `save_file` would corrupt). Bytewise this halves disk size vs the fp32 upstream. The MAESTRO runner upcasts to fp32 transiently during `load_state_dict` then casts to fp16 (`model_half=True`) for inference — runtime VRAM is unchanged from the fp32 mirror, but disk + I/O + initial safetensors-read CPU spike are all halved.
	- Approximate disk sizes per subdir: small variants ~1.14 GB each, medium variants ~4.61 GB each, SAME-S ~0.22 GB, SAME-L ~1.70 GB. Total mirror footprint ≈ 15.7 GB.

	## Usage

	### Inside MAESTRO

	The MAESTRO desktop app's `AI > Create > Stable Audio 3` panel handles the download + variant selection. The bundled runner at `backend/ai/models/stable_audio_3.py` reads the per-variant subdir name from the manifest and feeds it into the vendored `stable_audio_3` package at `backend/ai/stable_audio_3_vendor/`.

	### Standalone

	The repo can also be consumed directly by Stability AI's upstream [`stable-audio-3` package](https://github.com/Stability-AI/stable-audio-3):

	```python
	from stable_audio_3.loading_utils import load_diffusion_cond
	from stable_audio_3.model import StableAudioModel
	import json
	from huggingface_hub import snapshot_download

	# Pull one variant (e.g. small-sfx)
	local = snapshot_download(
	repo_id="AEmotionStudio/stable-audio-3-mirrors",
	allow_patterns=["small-sfx/**"],
	)

	with open(f"{local}/small-sfx/model_config.json") as f:
	cfg = json.load(f)

	inner = load_diffusion_cond(cfg, f"{local}/small-sfx/model.safetensors",
	device="cuda", model_half=True)
	inner.use_lora = False
	inner.lora_names = []
	model = StableAudioModel(inner, cfg, "cuda", model_half=True)

	audio = model.generate(
	prompt="heavy rain on a tin roof with distant thunder",
	duration=10,
	steps=8,
	cfg_scale=1.0,
	)
	```

	## Attribution

	- Models: Stability AI — Stable Audio 3 ([blog](https://stability.ai/news/stable-audio-3-open), upstream code: [`Stability-AI/stable-audio-3`](https://github.com/Stability-AI/stable-audio-3)).
	- Text encoder: Google T5-Gemma (bundled in each generative subdir).
	- Autoencoder: Stability AI SAME — Semantic-Acoustic Music Encoder.

	This mirror exists to bundle the family + extras into a single browseable HF repo for the MAESTRO desktop app. It does not modify the weights; report quality or licensing issues to the upstream repos.