Instructions to use lsb/stable-audio-3-small-music-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use lsb/stable-audio-3-small-music-onnx with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('text-to-audio', 'lsb/stable-audio-3-small-music-onnx');
stable-audio-3-small-music β 4-bit ONNX bundle for the browser
Quantized ONNX export of stabilityai/stable-audio-3-small-music (Stability AI Community License) intended to run end-to-end in a web browser via onnxruntime-web.
All weight-bearing MatMul/Linear nodes are quantized to int4 MatMulNBits with block_size=16, embedding tables are quantized as GatherBlockQuantized, and remaining initializers (LayerNorm/RMSNorm scales, biases, Conv1d kernels) stay in fp32. External-data sidecars are split into multiple files β€ 100 MB each.
The bundle is a drop-in for the demo at https://github.com/lsb/stable-audio-3-small-music-onnx (or whatever public copy lives alongside it).
Files
onnx/
text_encoder_q4.onnx + text_encoder_q4_chunk_{0..N}.data (T5Gemma encoder, ~213 MB)
dit_q4.onnx + dit_q4_chunk_{0..N}.data (diffusion transformer, ~380 MB)
decoder_q4.onnx + decoder_q4_chunk_{0..N}.data (SAME-S decoder, ~45 MB)
number_conditioner.onnx (duration embedder, ~0.8 MB)
*_chunks.json browser weight manifest per graph
tokenizer/ T5Gemma tokenizer files
config.json runtime config
LICENSE.md LICENSE_GEMMA.md NOTICE
Total bundle size: about 640 MB of int4 weights spread across 9 chunks.
Inference shape
- Latent:
(1, 256, T_lat)whereT_lat = ceil((seconds + 6) * 44100 / 8192) * 2 - Cross-attention conditioning:
(1, 257, 768)(256 T5Gemma tokens + 1 duration embedding) - Global conditioning (adaLN):
(1, 768)(duration embedding) - Local-add conditioning (inpaint):
(1, 257, T_lat)(zeros for plain text-to-audio) - Padding mask:
(1, T_lat)boolean - Output (decoder):
(1, 2, T_lat * 4096)stereo audio at 44.1 kHz, clamped to [-1, 1]
Sampler
rf_denoiser objective with the pingpong sampler (5 lines of arithmetic β ported to JS in the demo):
denoised = x - t_curr * dit(x, t_curr, β¦)
x = (1 - t_next) * denoised + t_next * randn_like(x)
Schedule comes from LogSNRShift(rate=0, anchor_logsnr=-6.2, logsnr_end=2.0) β sequence-length-invariant, so the same closed-form formula works for any duration. Default 8 steps. CFG is disabled at inference time (cfg_scale=1.0 in the original).
Browser usage (sketch)
// npm i onnxruntime-web @huggingface/transformers (bundle with esbuild)
import * as ort from "onnxruntime-web/wasm"; // ort.wasm.bundle.min.mjs β includes the
// SIMD-threaded WASM with MatMulNBits
// + GatherBlockQuantized
import { AutoTokenizer } from "@huggingface/transformers";
ort.env.wasm.numThreads = 1;
ort.env.wasm.simd = true;
const base = "https://huggingface.co/lsb/stable-audio-3-small-music-onnx/resolve/main";
const manifest = await fetch(`${base}/onnx/dit_q4_chunks.json`).then(r => r.json());
const ditBuf = await fetch(`${base}/onnx/dit_q4.onnx`).then(r => r.arrayBuffer());
const externalData = await Promise.all(manifest.chunks.map(async c => ({
path: c.name,
data: new Uint8Array(await (await fetch(`${base}/onnx/${c.name}`)).arrayBuffer()),
})));
const sess = await ort.InferenceSession.create(new Uint8Array(ditBuf), {
executionProviders: ["wasm"],
externalData,
});
See the demo source for the full pipeline (tokenizer β text encoder β number_conditioner β pingpong loop β decoder β WAV).
Quality / performance
- Per-graph q4 vs fp32 SNR (single forward pass): DiT ~10 dB, decoder ~15 dB, text encoder ~13 dB.
- End-to-end vs the PyTorch fp32 reference: envelope correlation β 0.88 on the same prompt/seed β same musical structure, slightly more high-frequency artifacts.
- Single-threaded WASM on an M-series Mac: roughly 60β120 s wall-clock for a 10 s clip at 8 steps. WebGPU would be much faster but is intentionally not used here so the bundle works from any static host.
License
This bundle inherits the Stability AI Community License from the upstream weights. The T5Gemma encoder weights additionally fall under Google's Gemma Terms of Use. Both license files are included verbatim; see NOTICE for the combined attribution.
- Downloads last month
- -
Model tree for lsb/stable-audio-3-small-music-onnx
Base model
stabilityai/stable-audio-3-small-music-base