bharatverse11 gab-gdp commited on
Commit
1c50cd7
·
0 Parent(s):

Duplicate from gab-gdp/StableBeaT

Browse files

Co-authored-by: Gabriel Guiet-Dupré <gab-gdp@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ results/dreamt_14/1830357556.wav filter=lfs diff=lfs merge=lfs -text
37
+ results/dreamt_14/2306776750.wav filter=lfs diff=lfs merge=lfs -text
38
+ results/dreamt_14/2505643137.wav filter=lfs diff=lfs merge=lfs -text
39
+ results/dreamt_17/1830357556.wav filter=lfs diff=lfs merge=lfs -text
40
+ results/dreamt_17/2306776750.wav filter=lfs diff=lfs merge=lfs -text
41
+ results/dreamt_17/2505643137.wav filter=lfs diff=lfs merge=lfs -text
42
+ results/dreamt_8/1830357556.wav filter=lfs diff=lfs merge=lfs -text
43
+ results/dreamt_8/2306776750.wav filter=lfs diff=lfs merge=lfs -text
44
+ results/dreamt_8/2505643137.wav filter=lfs diff=lfs merge=lfs -text
45
+ results/stable-audio-1/1830357556.wav filter=lfs diff=lfs merge=lfs -text
46
+ results/stable-audio-1/2306776750.wav filter=lfs diff=lfs merge=lfs -text
47
+ results/stable-audio-1/2505643137.wav filter=lfs diff=lfs merge=lfs -text
48
+ results/dreamt_14/1580039167.wav filter=lfs diff=lfs merge=lfs -text
49
+ results/dreamt_14/1984661836.wav filter=lfs diff=lfs merge=lfs -text
50
+ results/dreamt_14/2756405298.wav filter=lfs diff=lfs merge=lfs -text
51
+ results/dreamt_14/3278661061.wav filter=lfs diff=lfs merge=lfs -text
52
+ assets/FreqInstruments.png filter=lfs diff=lfs merge=lfs -text
53
+ assets/FreqMoods.png filter=lfs diff=lfs merge=lfs -text
54
+ assets/preview.gif filter=lfs diff=lfs merge=lfs -text
55
+ results/stable-audio-1/1580039167.wav filter=lfs diff=lfs merge=lfs -text
56
+ results/stable-audio-1/1984661836.wav filter=lfs diff=lfs merge=lfs -text
57
+ results/stable-audio-1/2756405298.wav filter=lfs diff=lfs merge=lfs -text
58
+ results/stable-audio-1/3278661061.wav filter=lfs diff=lfs merge=lfs -text
59
+ assets/cluster.png filter=lfs diff=lfs merge=lfs -text
60
+ results/dreamt_14/2321349264.wav filter=lfs diff=lfs merge=lfs -text
61
+ results/dreamt_14/3674155910.wav filter=lfs diff=lfs merge=lfs -text
62
+ results/stable-audio-1/2321349264.wav filter=lfs diff=lfs merge=lfs -text
63
+ results/stable-audio-1/3674155910.wav filter=lfs diff=lfs merge=lfs -text
64
+ results/dreamt_14/3576830411.wav filter=lfs diff=lfs merge=lfs -text
65
+ dataset/tags.json filter=lfs diff=lfs merge=lfs -text
66
+ results/stable-audio-1/3576830411.wav filter=lfs diff=lfs merge=lfs -text
67
+ results/dreamt_14/1121349264.wav filter=lfs diff=lfs merge=lfs -text
68
+ results/dreamt_14/1784661836.wav filter=lfs diff=lfs merge=lfs -text
69
+ results/stable-audio-1/1121349264.wav filter=lfs diff=lfs merge=lfs -text
70
+ results/stable-audio-1/1784661836.wav filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - stabilityai/stable-audio-open-1.0
4
+ tags:
5
+ - music-generation
6
+ - trap
7
+ - rap
8
+ - hip-hop
9
+ - beat-generation
10
+ - fine-tuning
11
+ - music-tagging
12
+ ---
13
+
14
+ <h1 align="center"> SAO fine tuning for modern beat generation</h1>
15
+ <p align="center">
16
+ As a music and AI lover I wanted to dive into the music generation technologies.
17
+ </p>
18
+
19
+ <p align="center">
20
+ <img src="./assets/preview.gif" alt="preview" width="400"/>
21
+ </p>
22
+
23
+ <p align="center">
24
+ First, I started by exploring existing models for music generation such as Suno or Stable Audio 2.0, but I couldn't find any that could generate modern trap/rap/r&b beat as well. Then I got this idea, fine tune an open source model over a good amount of trap beat. I chose Stable Audio Open 1.0, as I found it to be the most suitable open-source foundation for this kind of task.
25
+ </p>
26
+
27
+ # Results
28
+
29
+ [**Here**](https://github.com/Gab404/Stable-BeaT) the GitHub repository for model inference.
30
+ </br>
31
+ All the following results have been generated with 200 steps, CFG scale of 7, second start set on 0s and duration on 47s.
32
+
33
+ ---
34
+
35
+ ### Prompt 1
36
+ *A dark and melancholic cloud trap beat, with nostalgic piano, plucked bass and synth bells, at 110 BPM.*
37
+
38
+ | Stable Audio Open 1.0 | StableBeaT |
39
+ |:--|:--|
40
+ | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/2306776750.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/2306776750.wav"></audio> |
41
+
42
+ | BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
43
+ |:--|:--|:--|:--|:--|:--|
44
+ | **106.13** | **1159.43** | **0.000091** | **0.460** | **0.000073** | **0.489** |
45
+
46
+ ---
47
+
48
+ ### Prompt 2
49
+ *A laid back lo-fi jazz rap at 85 BPM, featuring deep sub, plucked bass, and vocal chop, with chill and jazzy relaxed moods.*
50
+
51
+ | Stable Audio Open 1.0 | StableBeaT |
52
+ |:--|:--|
53
+ | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/2505643137.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/2505643137.wav"></audio> |
54
+
55
+ | BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
56
+ |:--|:--|:--|:--|:--|:--|
57
+ | **82.72** | **784.82** | **0.000030** | **0.457** | **0.000015** | **0.429** |
58
+
59
+ ---
60
+
61
+ ### Prompt 3
62
+ *Melancholic trap beat at 105 BPM with shimmering synth bells and deep sub bass, minor chord progressions on piano, and airy vocal pads, evoking a cinematic and emotional atmosphere.*
63
+
64
+ | Stable Audio Open 1.0 | StableBeaT |
65
+ |:--|:--|
66
+ | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1580039167.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1580039167.wav"></audio> |
67
+
68
+ | BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
69
+ |:--|:--|:--|:--|:--|:--|
70
+ | **100.45** | **2540.28** | **0.000284** | **1.412** | **0.0000585** | **0.523** |
71
+
72
+ ---
73
+
74
+ ### Prompt 4
75
+ *A jazzy chillhop beat at 101 BPM featuring synth bells, vocal pad, and movie sample, evoking trap nostalgic and chill moods.*
76
+
77
+ | Stable Audio Open 1.0 | StableBeaT |
78
+ |:--|:--|
79
+ | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1784661836.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1784661836.wav"></audio> |
80
+
81
+ | BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
82
+ |:--|:--|:--|:--|:--|:--|
83
+ | **148.02** | **4287.26** | **0.00179** | **2.963** | **0.000195** | **0.552** |
84
+
85
+ ---
86
+
87
+ ### Prompt 5
88
+ *Smooth and seductive at 115 BPM trap beat with electric guitar riffs, plucked bass, vocal adlibs, and warm synth pads. Relaxed, romantic, and sexy mood.*
89
+
90
+ | Stable Audio Open 1.0 | StableBeaT |
91
+ |:--|:--|
92
+ | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/3278661061.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/3278661061.wav"></audio> |
93
+
94
+ | BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
95
+ |:--|:--|:--|:--|:--|:--|
96
+ | **82.72** | **1056.42** | **0.000046** | **0.645** | **0.000089** | **0.478*** |
97
+
98
+ ---
99
+
100
+ ### Prompt 6
101
+ *A moody cloud trap beat, boomy bass, synth bells and melodic piano, evoking etherate mood at 100 BPM.*
102
+
103
+ | Stable Audio Open 1.0 | StableBeaT |
104
+ |:--|:--|
105
+ | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/3576830411.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/3576830411.wav"></audio> |
106
+
107
+ | BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
108
+ |:--|:--|:--|:--|:--|:--|
109
+ | **144.2** | **2458.5** | **0.000356** | **0.738** | **0.00206** | **0.363** |
110
+
111
+ ---
112
+
113
+ ### Prompt 7
114
+ *A smooth neo-soul R&B instrumental at 90 BPM in D major, featuring live bass, soft Rhodes keys, and warm analog drum grooves.*
115
+
116
+ | Stable Audio Open 1.0 | Stable BeaT |
117
+ |:--|:--|
118
+ | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/stable-audio-1/1121349264.wav"></audio> | <audio controls src="https://huggingface.co/gab-gdp/sao-finetuned-trap-rap-beat/resolve/main/results/dreamt_14/1121349264.wav"></audio> |
119
+
120
+ | BPM | Spectral Centroid | Spectral Flatness | Harmonic/Percussive Ratio | Transient Sharpness | CLAP Prompt Score |
121
+ |:--|:--|:--|:--|:--|:--|
122
+ | **130.81** | **1000.87** | **0.000166** | **0.679** | **0.000007288** | **0.250** |
123
+
124
+
125
+ ---
126
+
127
+
128
+ # Dataset
129
+
130
+ I used 20,000 trap/rap beats spanning various subgenres such as cloud, trap, R&B, EDM, industrial hip-hop, jazzy chillhop... For each instrumental, I extracted two segments of 20 to 35 seconds, so it ended up with 40k audio dataset for about 277h of audio, while keeping track of their starting timestamps. This allowed the model not only to learn the content of the beats but also to capture the temporal structure inherent to the musical phrases.
131
+
132
+ A key goal of this project was to enable the model to learn new instruments (synth bells, deep sub, plucked bass, snare, ...), tempos, and rhythmic patterns that are strongly associated with trap and its subgenres. To achieve this, I tagged each segment by computing its similarity with curated lists of instruments, moods, and genres using a CLAP LAION model.
133
+
134
+ Additionally, I used the Essentia library to extract the BPM (deeptemp-k16-3) and key/scale of each audio segment, considering only predictions with confidence above 70%.
135
+
136
+ ```json
137
+ {
138
+ "39118.wav": {
139
+ "instruments_tags": [
140
+ "plucked guitar",
141
+ "synth bells",
142
+ "movie sample"
143
+ ],
144
+ "genres_tags": [
145
+ "rap with soul"
146
+ ],
147
+ "moods_tags": [
148
+ "trap melancholic",
149
+ "love"
150
+ ],
151
+ "key": "G",
152
+ "scale": "minor",
153
+ "tempo": 109.0,
154
+ "start": 63,
155
+ "duration": 26
156
+ }
157
+ }
158
+ ```
159
+
160
+ I chose to generate some synonyms to improve the model’s language variety. This combination of features instrumentation, tempo, key, mood, and genre provided a rich set of musical metadata.
161
+
162
+ <p align="center">
163
+ <img src="./assets/cluster.png" alt="Frequence moods" width="500"/>
164
+ </p>
165
+ We can observe how T5-Base encodes all of my tags, resulting in five distinct groups:
166
+
167
+ - Emotion (e.g., cheerful, joyful, dreamy)
168
+
169
+ - Groove (e.g., swing groove, nylon guitar, movie sample)
170
+
171
+ - Genre (e.g., g-funk, chill rap beat, jazzy chillhop)
172
+
173
+ - Sonority (e.g., trap vocal, trap guitar)
174
+
175
+ The clusters are very close to each other (Silhouette Score: 0.095), which is expected given that the model is fine-tuned on a specific musical subgenre. This proximity reflects the semantic density of the dataset: many tags are naturally related and share subtle differences.
176
+
177
+ Using this metadata, I was able to generate more human-readable prompts for the model via Llama 3.1 3B running locally, allowing the fine-tuned model to produce beats that better reflect the stylistic and structural characteristics of trap music.
178
+
179
+ ```json
180
+ {"filepath": "39118.wav", "start": 63, "duration": 26, "prompt": "A melancholic and love-inspired rap with soul beat at 109 BPM in G minor, using plucked guitar, synth bells, and movie sample."}
181
+ ```
182
+
183
+ # Training
184
+
185
+ The model was trained on a A100 Nvidia GPU Google Colab during about 42h, with a total of 40k audio segments (~277h) over 14 epochs. I set a batch size of 16, resulting in approximately 2,5k steps per epoch, so 35k steps in total.
186
+ </br>
187
+ It takes ~0.37s per step on a Nvidia RTX 4050 Laptop GPU, so about 1min15 for a good generation.
188
+
189
+
190
+
191
+ # Results Analysis
192
+
193
+ The model performs particularly well on melodic beats with a smooth and floating atmosphere.
194
+ It captures harmonic structures effectively and keeps a strong sense of coherence between instruments, mood, and tempo, which makes the generated beats sound natural, balanced, and musically pleasing.
195
+ The model is able to generate interesting beats that pretty well reflect the given prompt.
196
+
197
+ However, the model tends to underperform on styles that were underrepresented in the training dataset, such as boom bap or high-energy beats with dense percussive layers.
198
+
199
+ <p align="center">
200
+ <img src="./assets/FreqMoods.png" alt="Frequence moods" width="600"/>
201
+ </p>
202
+
203
+ This limitation mainly stems from the uneven tag distribution within the dataset, certain instruments and genres are simply less present.
204
+ In addition, the tagging tool (CLAP), trained on general-purpose music datasets like LAION-Audio-630K, is not specialized for specific genres such as trap or hip-hop, leading to imprecise tagging of elements like snares, hi-hats, or 808 bass.
205
+ As a result, these styles are harder for the model to reproduce accurately.
206
+ I also noticed that the generated melodic elements, like piano or synths, often sound much quieter than the drums, since their frequencies are more subtle.
207
+
208
+ # Perspectives
209
+
210
+ I'd like to fine tune over only 2-3 more epoch of a smaller dataset that represent better underrepresented styles.
211
+ It'd be interesting to start over with a CLAP specialized on trap/rap genres.
212
+ Also interested about noise input conditioning such as [**SpecGrad**](https://arxiv.org/pdf/2203.16749).
213
+
214
+ I’m open to any feedback or suggestions on my work.
215
+
216
+ ## Sources
217
+ - [**Stable Audio Open 1.0**](https://huggingface.co/stabilityai/stable-audio-open-1.0) - Model used.
218
+ - [**LoRAW**](https://github.com/NeuralNotW0rk/LoRAW) — Pipeline implementation for stable audio open LoRA finetuning.
219
+ - [**Stable Audio Tools**](https://github.com/Stability-AI/stable-audio-tools) — Official stability.ai framework to use stable audio open.
220
+ - [**Essentia**](https://essentia.upf.edu/models.html) - Library for music features extractions.
221
+
222
+ ## Contact - Gabriel Guiet-Dupré
223
+ - [**Linkedin**](https://www.linkedin.com/in/gabriel-guiet-dupre/)
224
+ - [**GitHub**](https://github.com/Gab404)
assets/FreqInstruments.png ADDED

Git LFS Details

  • SHA256: afefa2a03ef9ca3b8689f433c521ff1efcd57833c0c2399227d2ad92b9458e6f
  • Pointer size: 131 Bytes
  • Size of remote file: 154 kB
assets/FreqMoods.png ADDED

Git LFS Details

  • SHA256: 0b466bd1097d418d46a261a93a0593d570f7240cfed10a467a479a9f9f6aa900
  • Pointer size: 131 Bytes
  • Size of remote file: 132 kB
assets/FreqTempo.png ADDED
assets/cluster.png ADDED

Git LFS Details

  • SHA256: 65f14da64b6d1ef6072c4766dbf8d9018cd0a4af10c4cf5d5b950a4a5246aafc
  • Pointer size: 131 Bytes
  • Size of remote file: 217 kB
assets/preview.gif ADDED

Git LFS Details

  • SHA256: f09f6374703cd3f6b23663e6bfad4e532cb338e7cb31eb91a23361146be0ebd6
  • Pointer size: 131 Bytes
  • Size of remote file: 786 kB
dataset/prompt.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
dataset/tags.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ecbfc12f1a6919f8d7d9b3129c1ab4b5f4880f9e1bdea7c6d60edd96a6a9f857
3
+ size 17381017
model.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:af43348b46199771e4dd7057dae6b743e27f924ab06c268b1de896ca3e912325
3
+ size 4854110291
model_config.json ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "diffusion_cond",
3
+ "sample_size": 2097152,
4
+ "sample_rate": 44100,
5
+ "audio_channels": 2,
6
+ "model": {
7
+ "pretransform": {
8
+ "type": "autoencoder",
9
+ "iterate_batch": true,
10
+ "config": {
11
+ "encoder": {
12
+ "type": "oobleck",
13
+ "requires_grad": false,
14
+ "config": {
15
+ "in_channels": 2,
16
+ "channels": 128,
17
+ "c_mults": [1, 2, 4, 8, 16],
18
+ "strides": [2, 4, 4, 8, 8],
19
+ "latent_dim": 128,
20
+ "use_snake": true
21
+ }
22
+ },
23
+ "decoder": {
24
+ "type": "oobleck",
25
+ "config": {
26
+ "out_channels": 2,
27
+ "channels": 128,
28
+ "c_mults": [1, 2, 4, 8, 16],
29
+ "strides": [2, 4, 4, 8, 8],
30
+ "latent_dim": 64,
31
+ "use_snake": true,
32
+ "final_tanh": false
33
+ }
34
+ },
35
+ "bottleneck": {
36
+ "type": "vae"
37
+ },
38
+ "latent_dim": 64,
39
+ "downsampling_ratio": 2048,
40
+ "io_channels": 2
41
+ }
42
+ },
43
+ "conditioning": {
44
+ "configs": [
45
+ {
46
+ "id": "prompt",
47
+ "type": "t5",
48
+ "config": {
49
+ "t5_model_name": "t5-base",
50
+ "max_length": 128
51
+ }
52
+ },
53
+ {
54
+ "id": "seconds_start",
55
+ "type": "number",
56
+ "config": {
57
+ "min_val": 0,
58
+ "max_val": 512
59
+ }
60
+ },
61
+ {
62
+ "id": "seconds_total",
63
+ "type": "number",
64
+ "config": {
65
+ "min_val": 0,
66
+ "max_val": 512
67
+ }
68
+ }
69
+ ],
70
+ "cond_dim": 768
71
+ },
72
+ "diffusion": {
73
+ "cross_attention_cond_ids": ["prompt", "seconds_start", "seconds_total"],
74
+ "global_cond_ids": ["seconds_start", "seconds_total"],
75
+ "type": "dit",
76
+ "config": {
77
+ "io_channels": 64,
78
+ "embed_dim": 1536,
79
+ "depth": 24,
80
+ "num_heads": 24,
81
+ "cond_token_dim": 768,
82
+ "global_cond_dim": 1536,
83
+ "project_cond_tokens": false,
84
+ "transformer_type": "continuous_transformer"
85
+ }
86
+ },
87
+ "io_channels": 64
88
+ },
89
+ "training": {
90
+ "use_ema": true,
91
+ "log_loss_info": false,
92
+ "optimizer_configs": {
93
+ "diffusion": {
94
+ "optimizer": {
95
+ "type": "AdamW",
96
+ "config": {
97
+ "lr": 5e-5,
98
+ "betas": [0.9, 0.999],
99
+ "weight_decay": 1e-3
100
+ }
101
+ },
102
+ "scheduler": {
103
+ "type": "InverseLR",
104
+ "config": {
105
+ "inv_gamma": 1000000,
106
+ "power": 0.5,
107
+ "warmup": 0.99
108
+ }
109
+ }
110
+ }
111
+ },
112
+ "demo": {
113
+ "demo_every": 2000,
114
+ "demo_steps": 250,
115
+ "num_demos": 4,
116
+ "demo_cond": [
117
+ {"prompt": "Amen break 174 BPM", "seconds_start": 0, "seconds_total": 12},
118
+ {"prompt": "A beautiful orchestral symphony, classical music", "seconds_start": 0, "seconds_total": 160},
119
+ {"prompt": "Chill hip-hop beat, chillhop", "seconds_start": 0, "seconds_total": 190},
120
+ {"prompt": "A pop song about love and loss", "seconds_start": 0, "seconds_total": 180}
121
+ ],
122
+ "demo_cfg_scales": [3, 6, 9]
123
+ }
124
+ }
125
+ }
results/dreamt_14/1121349264.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0d4f3af61a60d4c68b44d37b4e5256ada941592934136410ab9082a090f7d87
3
+ size 8290844
results/dreamt_14/1580039167.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f48aa832af522b87cc6a890463cef95682bc919c8efe0871613d57c373875a94
3
+ size 8290844
results/dreamt_14/1784661836.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb6e61ccb8728ec0ffbfded5dfe34e170a3ede80548c1b2f2835363b6813f1e1
3
+ size 5515162
results/dreamt_14/2306776750.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ed0c4354ff5926ee8d1cdfbca59c8462bdba79adc74651934dfcc458fdc559b5
3
+ size 8290844
results/dreamt_14/2505643137.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ee910420c35b4635c0f958660a654459d4fef199f32e3e10be39f103418fb1c
3
+ size 8290844
results/dreamt_14/3278661061.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:290661b14156ab14702d92bb09b0a4ec203a46d0c3d5ee70e9ea63778665eca1
3
+ size 8290844
results/dreamt_14/3576830411.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17f248e39fcef2502920fcd026ddd9c52017f914419ccc2c07d6bb9eef35d977
3
+ size 16777304
results/stable-audio-1/1121349264.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea7848df25a036dee12672fba4ffa03360346ebd5b3edd2b46f578b9fcca5490
3
+ size 8290844
results/stable-audio-1/1580039167.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:22a885cf12dacc57923157c4a01bc0ed41a611d1bad3fe97aa3b0758820c974d
3
+ size 8290844
results/stable-audio-1/1784661836.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e04d52c91ae9607de49ad83ebd00824be021cc3145677e8a0ddb5616f54c832
3
+ size 8290844
results/stable-audio-1/2306776750.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:882684b71cd66ec1bb027dd09ab4affa5e11b6c6fdb56f89c42758f23b219d8d
3
+ size 8290844
results/stable-audio-1/2505643137.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b630f8e15734f0d7703ba35cb8217c96ee7529b492a93448c52f9cd1bf21c886
3
+ size 8290844
results/stable-audio-1/3278661061.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9833083d3e4ede77eb615a224d4b1c078172281bea5fbc89df4d8d588c81f6e1
3
+ size 8290844
results/stable-audio-1/3576830411.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6452b921834c62190475bac5dd2e6bf901435f276465b8eb5fffccbf3787729a
3
+ size 8290844