File size: 3,351 Bytes
3334a18
 
 
 
 
 
 
 
 
d06b500
3334a18
 
 
 
 
 
 
 
 
44d4d73
3334a18
 
 
 
 
 
44d4d73
3334a18
 
 
 
 
 
 
 
 
44d4d73
3334a18
 
 
 
 
 
 
44d4d73
3334a18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fbeb3dc
f7586f8
3334a18
 
 
 
 
44d4d73
3334a18
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
language:
- en
library_name: stable-audio-3
license: other
license_name: stable-audio-community
license_link: LICENSE
tags:
- music
- sound-effects
- audio
- autoencoder
---

# SAME: A Semantically-Aligned Music Autoencoder

Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)

## Model Description
Latent representations are at the heart of the majority of modern generative models. 
In the audio domain they are typically produced by a neural-audio-codec autoencoder. 
In this work we introduce SAME (Semantically Aligned Music autoEncoder), 
a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
while maintaining excellent reconstruction quality and strong downstream generative performance. 
We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses. 
The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. 
Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.

## Usage

This model can be used with:
1. the [`stable-audio-3`](https://github.com/Stability-AI/stable-audio-3) inference and fine-tuning library
2. the [`stable-audio-tools`](https://github.com/Stability-AI/stable-audio-tools) research library


### Using with `stable-audio-3`
```python
import torchaudio
from stable_audio_3 import AutoencoderModel

ae = AutoencoderModel.from_pretrained("same-s")
waveform, sr = torchaudio.load("audio.wav")
latents = ae.encode(waveform, sr)
audio_out = ae.decode(latents)
```

### Using with `stable-audio-tools`

```python
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond

device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
  model_half = True

# Download model
model, model_config = get_pretrained_model("stabilityai/SAME-S")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)
if model_half:
  model = model.to(torch.float16)

audio, sr = torchaudio.load(/path/to/audiofile)  # [channels, samples]
if audio.shape[0] == 1:
    audio = audio.repeat(2, 1)

audio = audio.unsqueeze(0).to(device)
if model_half:
  audio = audio.half()
with torch.no_grad():
    latents = model.encode_audio(audio)  
    reconstructed = model.decode_audio(latents)         
reconstructed = reconstructed.squeeze(0).cpu()  
reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch.int16).cpu()

```


## Model Details
* **Model type**: `SAME` is a continuous autoencoder model based on a transformer architecture.
* **Language(s)**: English
* **License**: [Stability AI Community License](https://stability.ai/license).
* **Research Paper**: [https://arxiv.org/abs/2605.18613](https://arxiv.org/abs/2605.18613)


## Training dataset

### Datasets Used
Our dataset consists of ~19,500 hours of licensed production audio from [AudioSparx](https://www.audiosparx.com/) which includes a 66/25/9% mix of music, sound effects, and instrument stems.