mattricesound commited on
Commit
3334a18
·
verified ·
1 Parent(s): 07369e3

Update model card

Browse files
Files changed (1) hide show
  1. README.md +101 -0
README.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: stable-audio-3
5
+ license: other
6
+ license_name: stable-audio-community
7
+ license_link: LICENSE
8
+ tags:
9
+ - music
10
+ - audio
11
+ - autoencoder
12
+ ---
13
+
14
+ # SAME: A Semantically-Aligned Music Autoencoder
15
+
16
+ Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)
17
+
18
+ ## Model Description
19
+ `Latent representations are at the heart of the majority of modern generative models.
20
+ In the audio domain they are typically produced by a neural-audio-codec autoencoder.
21
+ In this work we introduce SAME (Semantically Aligned Music autoEncoder),
22
+ a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
23
+ while maintaining excellent reconstruction quality and strong downstream generative performance.
24
+ We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses.
25
+ The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives.
26
+ Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.`
27
+
28
+ ## Usage
29
+
30
+ This model can be used with:
31
+ 1. the [`stable-audio-3`](https://github.com/Stability-AI/stable-audio-3) inference and fine-tuning library
32
+ 2. the [`stable-audio-tools`](https://github.com/Stability-AI/stable-audio-tools) research library
33
+
34
+
35
+ ### Using with `stable-audio-3`
36
+ import torchaudio
37
+ from stable_audio_3 import AutoencoderModel
38
+
39
+ ae = AutoencoderModel.from_pretrained("same-s")
40
+ waveform, sr = torchaudio.load("audio.wav")
41
+ latents = ae.encode(waveform, sr)
42
+ audio_out = ae.decode(latents)
43
+
44
+
45
+ ### Using with `stable-audio-tools`
46
+
47
+ ```python
48
+ import torch
49
+ import torchaudio
50
+ from einops import rearrange
51
+ from stable_audio_tools import get_pretrained_model
52
+ from stable_audio_tools.inference.generation import generate_diffusion_cond
53
+
54
+ device = "cuda" if torch.cuda.is_available() else "cpu"
55
+ if device == "cuda":
56
+ model_half = True
57
+
58
+ # Download model
59
+ model, model_config = get_pretrained_model("stabilityai/SAME-S")
60
+ sample_rate = model_config["sample_rate"]
61
+ sample_size = model_config["sample_size"]
62
+
63
+ model = model.to(device)
64
+ if model_half:
65
+ model = model.to(torch.float16)
66
+
67
+ audio, sr = torchaudio.load(/path/to/audiofile) # [channels, samples]
68
+ if audio.shape[0] == 1:
69
+ audio = audio.repeat(2, 1)
70
+
71
+ audio = audio.unsqueeze(0).to(device)
72
+ if model_half:
73
+ audio = audio.half()
74
+ with torch.no_grad():
75
+ latents = model.encode_audio(audio)
76
+ reconstructed = model.decode_audio(latents)
77
+ reconstructed = reconstructed.squeeze(0).cpu()
78
+ reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
79
+
80
+ ```
81
+
82
+
83
+ ## Model Details
84
+ * **Model type**: `SAME` is a continuous autoencoder model based on a transformer architecture.
85
+ * **Language(s)**: English
86
+ * **License**: [Stability AI Community License](https://huggingface.co/stabilityai/SAME-S/blob/main/LICENSE.md).
87
+ * **Commercial License**: to use this model commercially, please refer to [https://stability.ai/license](https://stability.ai/license)
88
+ * **Research Paper**: [https://arxiv.org/pdf/2605.18613](https://arxiv.org/pdf/2605.18613)
89
+
90
+
91
+ ## Training dataset
92
+
93
+ ### Datasets Used
94
+ Our dataset consists of ~19,500 hours of licensed production audio from [Audiosparx](https://www.audiosparx.com/) which includes a 66/25/9% mix of music, sound effects, and instrument stems.
95
+
96
+
97
+
98
+
99
+
100
+
101
+