stable-audio-3-small-music — 4-bit ONNX bundle for the browser

Quantized ONNX export of stabilityai/stable-audio-3-small-music (Stability AI Community License) intended to run end-to-end in a web browser via onnxruntime-web.

All weight-bearing MatMul/Linear nodes are quantized to int4 MatMulNBits with block_size=16, embedding tables are quantized as GatherBlockQuantized, and remaining initializers (LayerNorm/RMSNorm scales, biases, Conv1d kernels) stay in fp32. External-data sidecars are split into multiple files ≤ 100 MB each.

The bundle is a drop-in for the demo at https://github.com/lsb/stable-audio-3-small-music-onnx (or whatever public copy lives alongside it).

Files

onnx/
  text_encoder_q4.onnx      + text_encoder_q4_chunk_{0..N}.data   (T5Gemma encoder, ~213 MB)
  dit_q4.onnx               + dit_q4_chunk_{0..N}.data            (diffusion transformer, ~380 MB)
  decoder_q4.onnx           + decoder_q4_chunk_{0..N}.data        (SAME-S decoder, ~45 MB)
  number_conditioner.onnx                                            (duration embedder, ~0.8 MB)
  *_chunks.json                                                     browser weight manifest per graph
tokenizer/                                                          T5Gemma tokenizer files
config.json                                                         runtime config
LICENSE.md LICENSE_GEMMA.md NOTICE

Total bundle size: about 640 MB of int4 weights spread across 9 chunks.

Inference shape

Latent: (1, 256, T_lat) where T_lat = ceil((seconds + 6) * 44100 / 8192) * 2
Cross-attention conditioning: (1, 257, 768) (256 T5Gemma tokens + 1 duration embedding)
Global conditioning (adaLN): (1, 768) (duration embedding)
Local-add conditioning (inpaint): (1, 257, T_lat) (zeros for plain text-to-audio)
Padding mask: (1, T_lat) boolean
Output (decoder): (1, 2, T_lat * 4096) stereo audio at 44.1 kHz, clamped to [-1, 1]

Sampler

rf_denoiser objective with the pingpong sampler (5 lines of arithmetic — ported to JS in the demo):

denoised = x - t_curr * dit(x, t_curr, …)
x = (1 - t_next) * denoised + t_next * randn_like(x)

Schedule comes from LogSNRShift(rate=0, anchor_logsnr=-6.2, logsnr_end=2.0) — sequence-length-invariant, so the same closed-form formula works for any duration. Default 8 steps. CFG is disabled at inference time (cfg_scale=1.0 in the original).

Browser usage (sketch)

// npm i onnxruntime-web @huggingface/transformers   (bundle with esbuild)
import * as ort from "onnxruntime-web/wasm";        // ort.wasm.bundle.min.mjs — includes the
                                                    //   SIMD-threaded WASM with MatMulNBits
                                                    //   + GatherBlockQuantized
import { AutoTokenizer } from "@huggingface/transformers";

ort.env.wasm.numThreads = 1;
ort.env.wasm.simd = true;

const base = "https://huggingface.co/lsb/stable-audio-3-small-music-onnx/resolve/main";
const manifest = await fetch(`${base}/onnx/dit_q4_chunks.json`).then(r => r.json());
const ditBuf = await fetch(`${base}/onnx/dit_q4.onnx`).then(r => r.arrayBuffer());
const externalData = await Promise.all(manifest.chunks.map(async c => ({
  path: c.name,
  data: new Uint8Array(await (await fetch(`${base}/onnx/${c.name}`)).arrayBuffer()),
})));
const sess = await ort.InferenceSession.create(new Uint8Array(ditBuf), {
  executionProviders: ["wasm"],
  externalData,
});

See the demo source for the full pipeline (tokenizer → text encoder → number_conditioner → pingpong loop → decoder → WAV).

Quality / performance

Per-graph q4 vs fp32 SNR (single forward pass): DiT ~10 dB, decoder ~15 dB, text encoder ~13 dB.
End-to-end vs the PyTorch fp32 reference: envelope correlation ≈ 0.88 on the same prompt/seed — same musical structure, slightly more high-frequency artifacts.
Single-threaded WASM on an M-series Mac: roughly 60–120 s wall-clock for a 10 s clip at 8 steps. WebGPU would be much faster but is intentionally not used here so the bundle works from any static host.

License

This bundle inherits the Stability AI Community License from the upstream weights. The T5Gemma encoder weights additionally fall under Google's Gemma Terms of Use. Both license files are included verbatim; see NOTICE for the combined attribution.

Downloads last month: -

Model tree for lsb/stable-audio-3-small-music-onnx

Base model

stabilityai/stable-audio-3-small-music-base

Finetuned

stabilityai/stable-audio-3-small-music

Quantized

(1)

this model