techfreakworm commited on
Commit
9071450
Β·
unverified Β·
1 Parent(s): c86ad91

docs: track spec + mockups + model research

Browse files

The docs/ and research/ directories existed on disk from before
git init (they hold the brainstorming + research output) but were
never explicitly added by the early A-series commits. Tracking them
now so the spec, UI mockups, and base-model research are part of
the repo history alongside the implementation plan.

Contents:

- docs/superpowers/specs/2026-05-18-ace-music-studio-design.md
- docs/superpowers/specs/mockups/01_generate_mobile_errors.html
- docs/superpowers/specs/mockups/02_cover_extend.html
- docs/superpowers/specs/mockups/03_edit_lyrics.html
- docs/superpowers/specs/mockups/README.md
- research/00_executive_summary.md
- research/01_yue.md
- research/02_diffrhythm.md
- research/03_acestep.md
- research/04_newcomers_and_survey.md
- research/05_apple_silicon_mps_audit.md
- research/06_comparison_matrix.md
- research/07_platform_architecture.md

docs/superpowers/specs/2026-05-18-ace-music-studio-design.md ADDED
@@ -0,0 +1,550 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE Music Studio β€” Design Spec
2
+
3
+ **Date:** 2026-05-18
4
+ **Status:** Approved β€” ready for implementation plan
5
+ **Repo:** `~/Projects/llm/music-generator/` β†’ GitHub `techfreakworm/ace-music-studio` (to be created)
6
+ **HF Space:** `huggingface.co/spaces/techfreakworm/ace-music-studio` (to be created)
7
+ **Companion docs:** `research/00_executive_summary.md` (model selection rationale)
8
+
9
+ ---
10
+
11
+ ## 1. Goal
12
+
13
+ A single-process Gradio app that wraps **ACE-Step 1.5 XL SFT** for full-song generation with vocals, deployable both to a free non-profit **Hugging Face ZeroGPU Space** and locally on **Apple M5 Max (MPS / MLX)** or **NVIDIA (CUDA)** workstations. Supports the full ACE-Step feature surface β€” text-to-song, audio-reference cover, song extension, segment-level edit/repaint, plus an in-app lyrics writer powered by a bundled small LM. Users can stack any number of LoRAs from a curated preset library or upload custom `.safetensors` files at runtime.
14
+
15
+ Non-goals (v1): commercial-tier SaaS, multi-user accounts, persistent storage across sessions, social features, payment integration.
16
+
17
+ ---
18
+
19
+ ## 2. Locked product decisions
20
+
21
+ | Decision | Value | Source |
22
+ |---|---|---|
23
+ | Product name | **ACE Music Studio** (slug `ace-music-studio`) | brainstorming Q1 |
24
+ | Base model | ACE-Step 1.5 XL SFT (4 B DiT + 4 B Qwen3 planner) | research bundle `03_acestep.md` |
25
+ | Backend pattern | Direct ACE-Step Python API, single Gradio process | brainstorming Q architecture |
26
+ | UI layout | Sidebar nav + form + output (3 columns on desktop) | brainstorming Q layout = B |
27
+ | Theme | Brutalist Mono (pure black/white, no accent) | brainstorming Q palette = E |
28
+ | Tab set | Generate Β· Cover Β· Extend Β· Edit Β· Lyrics | brainstorming Q scope = all |
29
+ | LoRA capability | Multi-stack via PEFT + bundled presets + custom upload | brainstorming Q scope |
30
+ | Lyrics LM | Qwen 2.5 7B Instruct (Apache-2.0, ~14 GB bf16) | brainstorming Q lyrics LLM |
31
+ | Hosting | Free ZeroGPU (community grant if needed) | brainstorming Q hosting |
32
+ | License | MIT, public GitHub | brainstorming Q license |
33
+ | Mobile | Horizontal scroll tabs at top, ≀ 640 px | brainstorming Q responsive = A |
34
+ | Authorship rule | Mayank Gupta sole author on every commit | user prior memory `feedback_git_authorship.md` |
35
+
36
+ ---
37
+
38
+ ## 3. Architecture
39
+
40
+ ### 3.1 Top-level shape
41
+
42
+ ```
43
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
44
+ browser ─▢ β”‚ app.py β€” Gradio Blocks β”‚
45
+ β”‚ header Β· sidebar Β· 5 tabs Β· CTA footer β”‚
46
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
47
+ β”‚
48
+ β–Ό
49
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
50
+ β”‚ backend.py β€” ACEStepStudioBackend β”‚
51
+ β”‚ @spaces.GPU(duration=callable) β”‚
52
+ β”‚ lazy singletons; one mode-dispatch fn β”‚
53
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
54
+ β”‚
55
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
56
+ β–Ό β–Ό β–Ό β–Ό
57
+ ace_pipeline.py lora_stack.py lyrics_lm.py post_process.py
58
+ ACEStepPipeline preset registry Qwen 2.5 7B Demucs stems
59
+ device/cache PEFT adapters MLX or PyTorch pyloudnorm
60
+ sniff + validate lazy load
61
+ ```
62
+
63
+ ### 3.2 Backend singleton β€” `ACEStepStudioBackend`
64
+
65
+ One per-process instance, constructed lazily on first request. Owns three independently-lazy sub-singletons:
66
+
67
+ | Sub-singleton | Loads when | Holds |
68
+ |---|---|---|
69
+ | `ACEStepPipeline` instance | first generation request | DiT, Qwen3 planner, audio codec, VAE |
70
+ | `LyricsLM` instance | first lyrics-tab request | Qwen 2.5 7B weights, tokenizer |
71
+ | `Demucs` instance | first stem-separation request | `htdemucs_ft` weights |
72
+
73
+ Boot cost: only `_bootstrap()` (cache mirror + symlinks) β€” ~1–5 s. First gen request: +30–60 s warm-up. First lyrics request: +20–40 s. First stem request: +10 s. All amortised across the session.
74
+
75
+ ### 3.3 Device autodetect (`ace_pipeline.py`)
76
+
77
+ Priority: **CUDA β†’ MPS β†’ CPU**.
78
+
79
+ Apple Silicon path:
80
+
81
+ - Set `PYTORCH_ENABLE_MPS_FALLBACK=1` before any torch import (in `app.py` module preamble, before backend imports torch).
82
+ - Use the **Apple-Silicon fork's branch of ACE-Step** (`clockworksquirrel/ace-step-apple-silicon`) on Mac β€” pinned via `requirements-mac.txt` extra. Hybrid MLX (LM planner) + PyTorch MPS (DiT decoder).
83
+ - Skip the CUDA-only `torch.mps.mem_get_info` gate β€” `vram_limit_for("mps")` returns `None` so ACE-Step's free-VRAM check short-circuits.
84
+ - bf16 throughout; `--bf16 false` only if a specific kernel falls back.
85
+
86
+ CUDA path:
87
+
88
+ - Vanilla `ace-step` from git (or PyPI when published).
89
+ - bf16; allow flash-attn if installed.
90
+ - `vram_limit_for("cuda")` returns the safe cap from `torch.cuda.mem_get_info`.
91
+
92
+ CPU path (warning only, not blocked):
93
+
94
+ - Single warning banner on app load if no GPU detected: "CPU inference: expect ~10Γ— slower."
95
+
96
+ ### 3.4 HF Spaces bootstrap (`app.py:_bootstrap()`)
97
+
98
+ Direct port of z-image-studio's pattern, with model paths swapped:
99
+
100
+ 1. If `on_spaces()`, mirror the read-only `HF_HOME` (build cache) to `~/hf-cache-rw/`.
101
+ 2. Repoint `HF_HOME` and `HF_HUB_CACHE` env vars at the writable copy.
102
+ 3. Set `ACESTEP_MODEL_BASE_PATH` (or whatever the fork's env var is) to a project-local `./models/`.
103
+ 4. Symlink each cached HF snapshot into `./models/<repo>/` so the pipeline's loader finds them locally.
104
+
105
+ This avoids re-downloads on every cold container start and works around HF's read-only build cache layer.
106
+
107
+ ### 3.5 ZeroGPU integration
108
+
109
+ - `@spaces.GPU(duration=…)` decorates `backend.generate(mode, params)` at module load time. The decorator is a no-op identity off Spaces.
110
+ - `duration` is a callable that estimates per-call timeout from `(mode, params)`, clamped to `[60, 180] s`:
111
+ - Generate / Cover at default settings β†’ 60 s
112
+ - Long Generate (>120 s output) or Edit β†’ 90–120 s
113
+ - Extend with large repaint window β†’ 120–180 s
114
+ - Lyrics (separate decoration) β†’ 30 s
115
+ - On `"GPU task aborted"` exception, auto-retry once at 2Γ— duration. After second failure, return `gr.Warning` with timing diagnostics.
116
+ - `requirements.txt` **must not pin `spaces`** (HF injects its own version).
117
+
118
+ ---
119
+
120
+ ## 4. The five modes
121
+
122
+ All mode handlers live in `modes.py` as pure functions over `(backend, params) β†’ (audio_path, meta_dict)`. They share the **LoRA stack** and **advanced opts** code paths via shared helpers.
123
+
124
+ ### 4.1 Generate (text β†’ song)
125
+
126
+ **Inputs**: `prompt` (style), `lyrics`, `duration_s` (5–240), `instrumental` (bool), `lora_stack`, `advanced`.
127
+
128
+ **ACE-Step params**: `audio_cover_strength=0`, `repaint_mode=None`, `flow_edit_morph=False`, `cot_*` controlled by advanced "LM thinking" toggle.
129
+
130
+ **Output**: WAV (44.1 kHz stereo) + metadata JSON.
131
+
132
+ ### 4.2 Cover (audio reference β†’ song in that style)
133
+
134
+ **Inputs**: `prompt` (new style hint, optional), `ref_audio` file (any of mp3/wav/flac, ≀ 60 s), `lyrics` (new lyrics), `duration_s`, `lora_stack`, `advanced`.
135
+
136
+ **ACE-Step params**: `audio_cover_strengthβ‰ˆ0.93` (configurable in advanced), `cover_noise_strength=0`, `infer_method="ode"`.
137
+
138
+ **Output**: WAV.
139
+
140
+ ### 4.3 Extend (continue an existing song)
141
+
142
+ **Inputs**: `seed_audio` (≀ 240 s), `extra_prompt`, `extra_duration_s` (5–120), `lora_stack`, `advanced`.
143
+
144
+ **ACE-Step params**: `repaint_mode="balanced"`, `repaint_strength` configurable, `repainting_start` set to the seed-audio end timestamp, `repainting_end` set to seed-end + `extra_duration_s`. Exact param names + sentinels for "append-after-end" must be verified against the current ACE-Step Python API during M3 implementation β€” see Β§14 open question.
145
+
146
+ **Output**: WAV (seed + extension concatenated).
147
+
148
+ ### 4.4 Edit (repaint / flow morph a segment)
149
+
150
+ **Inputs**: `source_audio`, `source_lyrics`, `target_lyrics`, `segment_start_s`, `segment_end_s`, `mode` ∈ {`repaint`, `flow_edit`}, `lora_stack`, `advanced`.
151
+
152
+ **ACE-Step params**:
153
+
154
+ - repaint sub-mode: `repaint_mode="balanced"`, `repainting_start=segment_start_s`, `repainting_end=segment_end_s`, `repaint_strength=0.5`.
155
+ - flow_edit sub-mode: `flow_edit_morph=True`, `flow_edit_source_caption`, `flow_edit_source_lyrics`, `flow_edit_n_min=0.0`, `flow_edit_n_max=1.0`, `flow_edit_n_avg=1`.
156
+
157
+ **Output**: WAV.
158
+
159
+ ### 4.5 Lyrics (Qwen 2.5 β†’ structured lyrics)
160
+
161
+ **Inputs**: `brief` (free-text prompt), `target_structure` (e.g., "intro, verse, chorus, verse, chorus, bridge, chorus, outro"), `language`, `tone` (optional).
162
+
163
+ **System prompt** (locked):
164
+
165
+ ```
166
+ You are a songwriter. Output ONLY structured lyrics for an AI music generator. Use these section tags exactly:
167
+ [intro] [verse 1] [verse 2] [chorus] [bridge] [outro] (etc.)
168
+
169
+ Each section is on its own line, followed by the lyrics for that section. Keep verses 4-8 lines, choruses 4 lines, bridges 2-4 lines. Match the requested tone and language. Do not include commentary, headers, or markdown.
170
+ ```
171
+
172
+ **Output**: plain text with structural tags. A "Use these in Generate" button populates the Generate tab's `lyrics` field.
173
+
174
+ ### 4.6 Retake button
175
+
176
+ Every mode's output panel has a "↻ retake" button. It re-runs the same mode handler with a new random seed, all other params unchanged.
177
+
178
+ ---
179
+
180
+ ## 5. LoRA stack (`lora_stack.py`)
181
+
182
+ ### 5.1 Preset registry
183
+
184
+ `presets/manifest.json`:
185
+
186
+ ```json
187
+ [
188
+ {"name":"RapMachine","hf_id":"ACE-Step/ACE-Step-v1-RapMachine-LoRA","kind":"genre"},
189
+ {"name":"Chinese Rap","hf_id":"ACE-Step/ACE-Step-v1-Chinese-Rap-LoRA","kind":"genre"},
190
+ {"name":"Lyric2Vocal","hf_id":"ACE-Step/ACE-Step-v1-Lyric2Vocal-LoRA","kind":"voice"},
191
+ {"name":"Text2Samples","hf_id":"ACE-Step/ACE-Step-v1-Text2Samples-LoRA","kind":"instrumental"}
192
+ ]
193
+ ```
194
+
195
+ Presets are downloaded from HF on first preset-click, cached, and registered as PEFT adapters with the preset name. The four preset chips appear in every song-mode tab.
196
+
197
+ ### 5.2 Custom upload
198
+
199
+ User drops a `.safetensors` file into the upload zone:
200
+
201
+ 1. `sniff(path)` reads the safetensors header (no full load, just metadata).
202
+ 2. Verifies key naming matches ACE-Step 1.5 XL DiT (`*.to_q.lora_A.weight`, etc.) and rank ≀ 256, alpha set, file ≀ 500 MB.
203
+ 3. On success, registers as a new PEFT adapter under `Path(path).stem` as adapter name; appears in the active stack.
204
+ 4. On failure, raises `LoRAValidationError` β†’ `gr.Error` toast: "This LoRA isn't compatible with ACE-Step 1.5 XL SFT. Expected DiT modules: to_q, to_k, to_v, to_out.0, ff.net.0.proj, ff.net.2."
205
+
206
+ ### 5.3 Active stack management
207
+
208
+ UI shows a list of active LoRAs with per-row strength slider (0.0–1.5) and Γ— button. State held in `gr.State` per tab. On generate:
209
+
210
+ ```python
211
+ backend.apply_lora_stack(active_adapters) # pipe.set_adapters(names, weights=scales)
212
+ audio, meta = backend.generate(mode, params)
213
+ meta["loras"] = [{"name":n, "scale":s, "sha256":h} for n,s,h in active_adapters]
214
+ ```
215
+
216
+ After generation the adapters stay loaded (cheap memory cost) but are deactivated via `pipe.disable_adapters()` if the user clears the stack.
217
+
218
+ ### 5.4 Sole-LoRA edge cases
219
+
220
+ - All chips off + no upload β†’ `pipe.disable_adapters()` (vanilla SFT XL output).
221
+ - One LoRA with scale 0.0 β†’ effectively disabled but still listed (UX: don't surprise the user by silently dropping it).
222
+ - Same LoRA loaded twice (user dragged the same file twice) β†’ dedupe by file sha256; UI flash: "already in stack."
223
+
224
+ ---
225
+
226
+ ## 6. Lyrics LM (`lyrics_lm.py`)
227
+
228
+ ### 6.1 Backend selection
229
+
230
+ | Device | Backend | Weights size |
231
+ |---|---|---|
232
+ | `mps` (Mac) | `mlx-lm` with quantised Qwen 2.5 7B 4-bit | ~4 GB |
233
+ | `cuda` | `transformers` with bf16 | ~14 GB |
234
+ | ZeroGPU | `transformers` bf16, sliced into the `@spaces.GPU` lifetime | ~14 GB |
235
+
236
+ Quantisation on Mac is the practical choice β€” 4-bit MLX-quant Qwen 2.5 7B runs ~3Γ— faster than full-precision PyTorch MPS and barely affects lyric quality.
237
+
238
+ ### 6.2 Generation
239
+
240
+ - `max_new_tokens=600`, `temperature=0.85`, `top_p=0.9`, `repetition_penalty=1.1`.
241
+ - Stop sequences: `\n\n[end]`, `</lyrics>`.
242
+ - Post-process: strip leading/trailing whitespace, normalize section tags to lowercase (e.g., `[Verse 1]` β†’ `[verse 1]`).
243
+
244
+ ### 6.3 Lazy loading
245
+
246
+ ```python
247
+ class LyricsLM:
248
+ _instance = None
249
+ @classmethod
250
+ def get(cls):
251
+ if cls._instance is None:
252
+ cls._instance = cls._load()
253
+ return cls._instance
254
+ ```
255
+
256
+ First call cost: ~20–40 s on Mac, ~10 s on CUDA. Surfaced to the user via `gr.Progress` on the Lyrics tab.
257
+
258
+ ---
259
+
260
+ ## 7. Post-processing (`post_process.py`)
261
+
262
+ ### 7.1 Stem separation
263
+
264
+ - `demucs.api.Separator(model="htdemucs_ft")` lazy singleton.
265
+ - Output: 4 WAV files (vocals, drums, bass, other).
266
+ - Runs synchronously after generation if the user expands the Stems section, or on-demand via a "Separate stems" button in the output panel.
267
+ - On ZeroGPU, counted in the same `@spaces.GPU` lifetime as the generation that produced the audio.
268
+
269
+ ### 7.2 Loudness normalization
270
+
271
+ - `pyloudnorm` normalises to **-14 LUFS** (streaming spec).
272
+ - Toggled by an `Advanced` checkbox per mode (default ON).
273
+ - Applied to the final WAV before MP3 encoding.
274
+
275
+ ### 7.3 MP3 export
276
+
277
+ - `ffmpeg` via `subprocess` β€” 320 kbps CBR, 44.1 kHz, stereo.
278
+ - Embeds metadata as ID3 tags (prompt, lora hashes, seed).
279
+
280
+ ---
281
+
282
+ ## 8. Frontend (`app.py` + `ui.py` + `theme.py`)
283
+
284
+ > **Reference mockups (visual source of truth):**
285
+ >
286
+ > | File | Covers |
287
+ > |---|---|
288
+ > | [`mockups/01_generate_mobile_errors.html`](./mockups/01_generate_mobile_errors.html) | Generate tab (fully expanded), mobile phone screens, error / edge-case states |
289
+ > | [`mockups/02_cover_extend.html`](./mockups/02_cover_extend.html) | Cover tab + Extend tab (both fully expanded) |
290
+ > | [`mockups/03_edit_lyrics.html`](./mockups/03_edit_lyrics.html) | Edit tab (Repaint + Flow Morph sub-modes) + Lyrics tab (Qwen LM params) |
291
+ > | [`mockups/README.md`](./mockups/README.md) | What's shared across tabs + what each tab adds |
292
+ >
293
+ > The mockups define the **layout, spacing, control surface, and disclosure hierarchy.** The prose below defines the **semantics** β€” what each control does, what the defaults are, what the responsive breakpoints are. If a discrepancy ever shows up, the mockups are the source for layout, and Β§3–§7 of this spec are the source for behaviour.
294
+
295
+ ### 8.1 Page chrome
296
+
297
+ ```html
298
+ HEADER (sticky):
299
+ [brand: "ACE Music Studio." in 15px white, "." in #FFF as period]
300
+ [status: "ready Β· MPS Β· M5 Max" in 10px muted]
301
+
302
+ CTA (below header, separator below):
303
+ Built with β™₯. Drop a like Β· Follow @techfreakworm for what's next.
304
+
305
+ (Tab content)
306
+ ```
307
+
308
+ ### 8.2 Sidebar (desktop β‰₯ 1024 px)
309
+
310
+ 5 mode items + History section below. Active item: white left border + brighter text. Width: 170 px.
311
+
312
+ ### 8.3 Tablet (640–1024 px)
313
+
314
+ Sidebar collapses to 30 px wide **icon rail**. Hover shows tooltip with full label. Same active treatment.
315
+
316
+ ### 8.4 Mobile (< 640 px)
317
+
318
+ Native `gr.Tabs` (horizontal scroll) replaces the sidebar entirely. Hidden via CSS media query swap: `display: none` on `.ms-sidebar`, `display: flex` on a `.ms-mobile-tabs`. No JS.
319
+
320
+ ### 8.5 Tab body
321
+
322
+ Two-column on desktop (form 60% / output 40%), stacks vertically on tablet and mobile.
323
+
324
+ Form layer order (top to bottom, always-visible by default):
325
+
326
+ 1. Style prompt (textarea, ~3 rows)
327
+ 2. Lyrics (textarea, ~6 rows) β€” except Lyrics tab, which replaces with brief + structure inputs
328
+ 3. Mode-specific: ref audio (Cover), seed audio (Extend), source + segment (Edit)
329
+ 4. Duration slider + vocals/instrumental toggle (Generate only)
330
+ 5. LoRA section (collapsed by default; chip row visible if any preset is "on")
331
+ 6. Advanced accordion (collapsed by default)
332
+ 7. LM-planner accordion (collapsed by default)
333
+ 8. Generate button (primary; white-on-black; full-width on mobile)
334
+
335
+ ### 8.6 Output panel
336
+
337
+ - Audio player with built-in waveform (Gradio 5 native)
338
+ - Retake button (↻)
339
+ - Stems grid (Demucs) β€” only visible after Demucs runs
340
+ - Action row: ↓ mp3 Β· ↓ wav Β· `{ }` meta Β· β†— share (copies a permalink with prompt+seed in URL params)
341
+ - Metadata JSON viewer (collapsible, default closed)
342
+
343
+ ### 8.7 Theme tokens (`theme.py`)
344
+
345
+ ```python
346
+ BG = "#0A0A0A"
347
+ SURFACE = "#141414"
348
+ SURFACE_STRONG = "#000000"
349
+ BORDER = "#1F1F1F"
350
+ BORDER_STRONG = "#2A2A2A"
351
+ INK = "#E5E5E5"
352
+ INK_MUTED = "#6B6B6B"
353
+ PRIMARY = "#FFFFFF"
354
+ ERROR = "#E5E5E5" # high-contrast white in Brutalist Mono; gradio error background still red-ish but our text is white
355
+ RADIUS = "6px"
356
+ FONT_STACK = '"Inter", -apple-system, BlinkMacSystemFont, "Segoe UI", system-ui, sans-serif'
357
+ ```
358
+
359
+ CSS injected via `gr.Blocks(css=…)` covers sidebar layout, responsive media queries, LoRA chip pill, waveform tightening, accordion arrow customization, hide-Gradio-footer.
360
+
361
+ ---
362
+
363
+ ## 9. Data flow per generation
364
+
365
+ ```
366
+ 1. User clicks "Generate" button on the Generate tab.
367
+ 2. app.py:on_generate(...) handler reads all gr inputs, coerces types.
368
+ 3. Handler validates active LoRAs (cheap header sniff) β€” raises gr.Error on failure.
369
+ 4. Handler calls backend.generate_with_retry(mode="generate", params={...}).
370
+ 5. backend.generate_with_retry is the @spaces.GPU-decorated entrypoint.
371
+ 6. Inside the GPU lifetime:
372
+ a. _ensure_pipeline() β€” lazy load on first call
373
+ b. _apply_lora_stack(params.loras) β€” pipe.set_adapters(names, weights)
374
+ c. _dispatch_mode("generate", params) β€” calls pipe(...) with mode-specific kwargs
375
+ d. _post_process(audio, params) β€” loudness norm, optionally stems
376
+ e. _emit_meta(params, audio) β€” build metadata JSON, sha256s
377
+ 7. Returns (audio_path, meta_dict).
378
+ 8. Handler updates UI: audio player, metadata JSON viewer.
379
+ 9. History entry appended (in-memory, last 10).
380
+ ```
381
+
382
+ ZeroGPU abort handling wraps step 5 in a one-shot retry at 2Γ— duration. Beyond that: `gr.Warning` with the suggestion to reduce duration or steps.
383
+
384
+ ---
385
+
386
+ ## 10. Error handling matrix
387
+
388
+ | Trigger | User-facing | Logs |
389
+ |---|---|---|
390
+ | LoRA file invalid (rank, modules, size) | `gr.Error("This LoRA isn't compatible with ACE-Step 1.5 XL SFT. …")` | full traceback to stderr |
391
+ | Audio input wrong format | `gr.Error("Audio must be wav/mp3/flac, ≀ 240 s.")` | format diagnostics |
392
+ | Cover/Extend/Edit missing required input | `gr.Error("Reference audio is required for Cover mode.")` | param dump |
393
+ | ZeroGPU abort | auto-retry once at 2Γ— duration; if still aborts: `gr.Warning("Generation timed out. Try a shorter duration or fewer steps.")` | timing info |
394
+ | Lyrics LM cold-load fails (OOM) | `gr.Error("Couldn't load lyrics model. Free some memory and retry.")` | full traceback |
395
+ | MPS op not implemented | falls back to CPU via env var; if still crashes: `gr.Error("This ACE-Step op isn't yet supported on Apple Silicon. Generation aborted.")` | op name + diagnostics |
396
+ | Demucs separator fails on weird audio | `gr.Warning("Stem separation failed β€” audio still saved.")` | traceback |
397
+ | Custom-LoRA download fails (preset) | `gr.Error("Couldn't download preset 'X'. Check network.")` | network log |
398
+ | Out-of-disk on cache mirror | `gr.Error("Disk full. Free space and reload.")` | mount stats |
399
+
400
+ ---
401
+
402
+ ## 11. Testing
403
+
404
+ ### 11.1 Layers
405
+
406
+ - **L1 β€” no GPU, no models**: module structure, type signatures, theme CSS asserts, LoRA-header sniff unit tests, metadata JSON shape, preset manifest schema. ~30 tests, runs in < 5 s.
407
+ - **L2 β€” mocked pipeline**: each mode handler calls the backend with the right kwargs; `set_adapters` invoked with correct order/weights; lyrics LM prompt template asserted. ~25 tests, runs in < 30 s.
408
+ - **GPU smoke (`@pytest.mark.gpu`, skipped by default)**: one Generate + one Cover + one Extend + one Lyrics at minimum settings, asserts output exists and is non-zero size. ~4 tests, runs in 5–10 min on M5 Max.
409
+
410
+ ### 11.2 CI
411
+
412
+ - GitHub Actions: Python 3.11, run L1 + L2 with `pytest -m "not gpu"`.
413
+ - ruff format + ruff check both pass.
414
+ - No GPU testing in CI (cost). The user runs `pytest -m gpu` locally on the M5 Max before each release tag.
415
+
416
+ ### 11.3 Manual verification before merge
417
+
418
+ - Each new mode handler: at least one end-to-end on M5 Max with a real prompt + the psytrance LoRA loaded.
419
+ - LoRA upload: at least one bad-file rejection (rank mismatch) + one good-file success.
420
+ - Responsive: open on phone (Safari iOS), verify horizontal tab strip, verify generate end-to-end.
421
+
422
+ ---
423
+
424
+ ## 12. Deployment
425
+
426
+ ### 12.1 HF Spaces
427
+
428
+ `README.md` frontmatter:
429
+
430
+ ```yaml
431
+ ---
432
+ title: ACE Music Studio
433
+ emoji: 🎡
434
+ colorFrom: gray
435
+ colorTo: gray
436
+ sdk: gradio
437
+ sdk_version: "5.50.0"
438
+ app_file: app.py
439
+ python_version: "3.11"
440
+ suggested_hardware: zero-a10g
441
+ hf_oauth: false
442
+ preload_from_hub:
443
+ - ACE-Step/ACE-Step-v1.5-XL-SFT *.safetensors,config.json,scheduler/*,vae/*,tokenizer/*
444
+ - Qwen/Qwen2.5-7B-Instruct *.safetensors,config.json,tokenizer*
445
+ - facebook/htdemucs_ft *.th
446
+ - ACE-Step/ACE-Step-v1-RapMachine-LoRA *.safetensors
447
+ - ACE-Step/ACE-Step-v1-Chinese-Rap-LoRA *.safetensors
448
+ - ACE-Step/ACE-Step-v1-Lyric2Vocal-LoRA *.safetensors
449
+ - ACE-Step/ACE-Step-v1-Text2Samples-LoRA *.safetensors
450
+ ---
451
+ ```
452
+
453
+ Preload size estimate: ACE-Step XL SFT ~16 GB + Qwen 2.5 ~14 GB + htdemucs ~250 MB + 4 LoRAs ~400 MB = **~31 GB**, well under HF's 150 GB cap.
454
+
455
+ ### 12.2 GitHub
456
+
457
+ - Repo: `techfreakworm/ace-music-studio` (public).
458
+ - License: MIT.
459
+ - HF Space mirror via dedicated git remote (`git push space main`).
460
+ - README badges: HF Space, GitHub stars, MIT license, Python 3.11, backend ACE-Step.
461
+
462
+ ### 12.3 Local install
463
+
464
+ ```bash
465
+ git clone https://github.com/techfreakworm/ace-music-studio
466
+ cd ace-music-studio
467
+ bash setup.sh # creates .venv (Python 3.11), installs requirements
468
+ source .venv/bin/activate
469
+ python app.py # http://127.0.0.1:7860
470
+ ```
471
+
472
+ `setup.sh` detects Mac vs CUDA and installs the right ACE-Step branch + Qwen backend (mlx-lm on Mac).
473
+
474
+ ---
475
+
476
+ ## 13. Out of scope for v1
477
+
478
+ These are deferred to v2+ β€” do **not** implement without explicit user OK:
479
+
480
+ - Multi-prompt batch queue (generate 5 variants in a row)
481
+ - Persistent generation history across sessions (DB-backed)
482
+ - User accounts / auth
483
+ - Telemetry dashboard
484
+ - Voice cloning ("Persona" feature β€” RVC integration)
485
+ - LoRA training inside the app
486
+ - ControlNet-style conditioning (rhythm tracks, MIDI input)
487
+ - Spectrogram visualization (waveform is enough for v1)
488
+ - Multi-language UI strings (English only; song content can be any language)
489
+ - Watermarking output audio
490
+ - Browser-side audio editing (cut, paste, fade)
491
+ - Multi-tenant rate limiting
492
+ - Export to DAW format (stem zip is enough for v1)
493
+ - Visual regression tests for the Gradio UI
494
+
495
+ ---
496
+
497
+ ## 14. Open implementation questions (defer to writing-plans)
498
+
499
+ 1. **ACE-Step package β€” git or PyPI?** As of 2026-05-18, the official `ace-step` PyPI package exists for v1.5 but the Apple-Silicon fork is git-only. Decision: `pip install ace-step` on CUDA, `pip install git+https://github.com/clockworksquirrel/ace-step-apple-silicon` on Mac (detected by `setup.sh`).
500
+ 2. **Demucs model β€” `htdemucs` or `htdemucs_ft`?** `htdemucs_ft` is the fine-tuned variant with slightly better separation. Larger weight (~250 MB) but trivial in our budget. Default: `htdemucs_ft`.
501
+ 3. **LoRA preset HF IDs** β€” placeholder paths above (`ACE-Step/ACE-Step-v1-*-LoRA`) may not match the exact HF org/repo naming when this is implemented; the plan should verify each preset's actual canonical HF path before the preload directive is finalised.
502
+ 4. **Qwen 2.5 7B vs 3B for ZeroGPU comfort** β€” 7B is correct per the brainstorming answer. If ZeroGPU's 60 s budget is too tight for cold-load + generate, fall back to **Qwen 2.5 3B Instruct** (~6 GB) without UI changes.
503
+ 5. **Edit-mode UX for segment selection** β€” start with two numeric inputs (start_s, end_s). v1.5 can add a waveform-clickable selector if user feedback demands it.
504
+ 6. **History persistence** β€” v1 is in-memory only. The sidebar history list is `gr.State`-backed and wipes on page reload. Persistent history is v2.
505
+ 7. **ACE-Step Extend / Repaint exact API surface** β€” the psytrance LoRA generation config shows the relevant kwargs (`repainting_start`, `repainting_end`, `repaint_mode`, `repaint_strength`, `chunk_mask_mode`, `repaint_latent_crossfade_frames`, `repaint_wav_crossfade_sec`). Verify the conventions for "append after end of seed audio" (e.g., does `repainting_end > audio_length` extend, or do we need a different sentinel?) before M3 ships.
506
+ 8. **MLX-quant Qwen 2.5 7B availability** β€” confirm `mlx-community/Qwen2.5-7B-Instruct-4bit` exists and produces acceptable lyric quality. If not, use `mlx-community/Qwen2.5-3B-Instruct-4bit` as the Mac path (the model card under Β§6.1's table moves to 3B-on-Mac, 7B-on-CUDA).
507
+
508
+ ---
509
+
510
+ ## 15. Sole-author rule
511
+
512
+ Per the user's permanent feedback (memory `feedback_git_authorship.md`):
513
+
514
+ - Mayank Gupta is sole author on every commit.
515
+ - **NO** `Co-Authored-By: Claude…` trailer.
516
+ - **NO** `Generated with Claude Code` footer.
517
+ - **NO** `--author=…` flag.
518
+ - This applies to commits made by any AI assistant working on this repo.
519
+
520
+ Encoded in `CLAUDE.md`, `AGENTS.md`, and `SKILLS.md` at the top of the repo so every assistant sees it on first read.
521
+
522
+ ---
523
+
524
+ ## 16. Implementation milestones (rough)
525
+
526
+ (Detailed sequencing belongs in the implementation plan β€” see `docs/superpowers/plans/`.)
527
+
528
+ | Milestone | Deliverable | Validates |
529
+ |---|---|---|
530
+ | M0 β€” Bootstrap | `app.py:_bootstrap()` + device autodetect + Gradio Blocks skeleton + theme | App boots on M5 Max and on a Space-equivalent CPU env |
531
+ | M1 β€” Generate mode (no LoRA) | `modes.generate` + `ace_pipeline.py` + audio player output | End-to-end "psytrance, 30 s" generation on M5 Max |
532
+ | M2 β€” LoRA stack | `lora_stack.py` + preset chips + custom upload + active stack UI | Psytrance v2 + RapMachine stacked at 0.95 / 0.85 produce visibly different output |
533
+ | M3 β€” Cover, Extend, Edit | Three more handlers + their tab UIs | Each mode produces a non-trivial output |
534
+ | M4 β€” Lyrics LM | `lyrics_lm.py` + Lyrics tab + "use these" flow | Qwen 2.5 emits valid structural-tag lyrics; round-trip into Generate works |
535
+ | M5 β€” Post-processing | Demucs + pyloudnorm + mp3 export | Stems download, normalised output, ID3-tagged MP3 |
536
+ | M6 β€” Responsive + polish | Mobile media queries + tooltips + error UX + history sidebar | Phone Safari renders + generates end-to-end |
537
+ | M7 β€” Deploy | Preload directive + ZeroGPU decorator + retry logic + Space mirror | Public Space serves requests at parity with local |
538
+
539
+ ---
540
+
541
+ ## 17. References
542
+
543
+ - ACE-Step 1.5 paper: [arXiv 2506.00045](https://arxiv.org/abs/2506.00045)
544
+ - ACE-Step 1.5 repo: [github.com/ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)
545
+ - Apple Silicon fork: [github.com/clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)
546
+ - ACE-Step LoRA family: [ace-step.github.io](https://ace-step.github.io/)
547
+ - Qwen 2.5: [huggingface.co/Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
548
+ - Demucs: [github.com/facebookresearch/demucs](https://github.com/facebookresearch/demucs)
549
+ - z-image-studio (architectural precedent): `~/Projects/llm/z-image-studio/`
550
+ - Research bundle: `research/00_executive_summary.md` and siblings
docs/superpowers/specs/mockups/01_generate_mobile_errors.html ADDED
@@ -0,0 +1,604 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h2>Generate fully expanded Β· mobile Β· error states</h2>
2
+ <p class="subtitle">Last batch. Generate tab with every control surfaced. Mobile phone screens for Generate + Cover + Lyrics. Six error/edge-case states.</p>
3
+
4
+ <style>
5
+ .gm { background:#0A0A0A; color:#E5E5E5; border:1px solid #1F1F1F; border-radius:10px; padding:18px; font-size:12px; line-height:1.5; margin-top:14px; }
6
+ .gm-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:10px; border-bottom:1px solid #1F1F1F; margin-bottom:14px; }
7
+ .gm-brand { font-size:15px; font-weight:600; }
8
+ .gm-cta { font-size:11px; color:#6B6B6B; }
9
+ .gm-cta strong { color:#E5E5E5; }
10
+ .gm-status { font-size:10px; color:#6B6B6B; letter-spacing:0.08em; text-transform:uppercase; }
11
+ .gm-row { display:flex; gap:16px; align-items:flex-start; }
12
+ .gm-sidebar { background:#000; padding:14px 10px; border-radius:6px; min-width:170px; }
13
+ .gm-side { display:block; padding:8px 10px; border-radius:4px; margin-bottom:3px; font-size:12px; color:#6B6B6B; }
14
+ .gm-side.active { background:#1A1A1A; color:#FFF; border-left:2px solid #FFF; padding-left:8px; }
15
+ .gm-side .em { margin-right:6px; }
16
+ .gm-main { flex:1; display:flex; gap:14px; align-items:flex-start; }
17
+ .gm-form { flex:1.3; background:#141414; padding:16px; border-radius:6px; }
18
+ .gm-output { flex:1; background:#141414; padding:16px; border-radius:6px; min-width:260px; }
19
+ .gm-label { font-size:10px; text-transform:uppercase; letter-spacing:0.08em; color:#6B6B6B; margin-bottom:6px; display:flex; justify-content:space-between; align-items:center; }
20
+ .gm-label .hint { color:#5A5048; font-size:9px; text-transform:none; letter-spacing:normal; font-weight:400; }
21
+ .gm-input { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; margin-bottom:12px; font-size:11px; }
22
+ .gm-textarea { min-height:46px; }
23
+ .gm-grid2 { display:grid; grid-template-columns:1fr 1fr; gap:12px; margin-bottom:12px; }
24
+ .gm-grid3 { display:grid; grid-template-columns:1fr 1fr 1fr; gap:10px; margin-bottom:12px; }
25
+ .gm-grid4 { display:grid; grid-template-columns:1fr 1fr 1fr 1fr; gap:8px; margin-bottom:12px; }
26
+ .gm-slider-row { display:flex; align-items:center; gap:10px; padding:6px 8px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; }
27
+ .gm-slider-row .name { color:#6B6B6B; font-size:10px; min-width:130px; }
28
+ .gm-slider { flex:1; height:3px; background:#2A2A2A; border-radius:2px; position:relative; }
29
+ .gm-slider::after { content:""; position:absolute; top:-4px; width:10px; height:10px; background:#FFF; border-radius:50%; }
30
+ .gm-slider.p5::after { left:5%; }
31
+ .gm-slider.p10::after { left:10%; }
32
+ .gm-slider.p15::after { left:15%; }
33
+ .gm-slider.p20::after { left:20%; }
34
+ .gm-slider.p25::after { left:25%; }
35
+ .gm-slider.p33::after { left:33%; }
36
+ .gm-slider.p40::after { left:40%; }
37
+ .gm-slider.p50::after { left:50%; }
38
+ .gm-slider.p60::after { left:60%; }
39
+ .gm-slider.p65::after { left:65%; }
40
+ .gm-slider.p70::after { left:70%; }
41
+ .gm-slider.p85::after { left:85%; }
42
+ .gm-slider.p90::after { left:90%; }
43
+ .gm-slider.p95::after { left:95%; }
44
+ .gm-slider-row .val { color:#FFF; font-family:monospace; font-size:11px; min-width:42px; text-align:right; }
45
+ .gm-toggle { display:flex; align-items:center; gap:8px; padding:6px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; cursor:pointer; }
46
+ .gm-toggle .box { width:14px; height:14px; border:1px solid #2A2A2A; border-radius:3px; display:inline-flex; align-items:center; justify-content:center; font-size:9px; }
47
+ .gm-toggle.on { color:#FFF; border-color:#FFF; }
48
+ .gm-toggle.on .box { background:#FFF; color:#0A0A0A; border-color:#FFF; }
49
+ .gm-pills { display:flex; gap:0; background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:2px; margin-bottom:12px; }
50
+ .gm-pill { flex:1; text-align:center; padding:6px 10px; font-size:11px; color:#6B6B6B; border-radius:3px; cursor:pointer; }
51
+ .gm-pill.on { background:#FFF; color:#0A0A0A; }
52
+ .gm-select { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; font-size:11px; display:flex; justify-content:space-between; align-items:center; margin-bottom:8px; }
53
+ .gm-select .arrow { color:#6B6B6B; }
54
+ .gm-section { border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-top:14px; background:#0F0F0F; }
55
+ .gm-section-h { display:flex; justify-content:space-between; align-items:center; margin-bottom:12px; font-size:11px; font-weight:600; }
56
+ .gm-section-h .arrow { color:#FFF; }
57
+ .gm-section-h .meta { color:#6B6B6B; font-weight:400; font-size:10px; }
58
+ .gm-chip { display:inline-block; padding:5px 10px; border-radius:14px; font-size:10px; margin-right:5px; margin-bottom:5px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; cursor:pointer; }
59
+ .gm-chip.on { border-color:#FFF; color:#FFF; }
60
+ .gm-chip.upload { border-style:dashed; color:#FFF; }
61
+ .gm-lora-row { display:flex; align-items:center; gap:10px; padding:8px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:6px; font-size:11px; }
62
+ .gm-lora-name { flex:1; }
63
+ .gm-lora-name small { color:#6B6B6B; font-weight:400; margin-left:4px; }
64
+ .gm-x { color:#6B6B6B; cursor:pointer; padding:0 4px; }
65
+ .gm-btn { background:#FFF; color:#0A0A0A; padding:12px 18px; border-radius:4px; font-weight:600; display:block; font-size:13px; text-align:center; cursor:pointer; margin-top:16px; }
66
+ .gm-waveform { height:60px; background:#000; border:1px solid #2A2A2A; border-radius:4px; display:flex; align-items:center; justify-content:center; gap:2px; padding:8px; margin-bottom:10px; }
67
+ .gm-bar { width:2px; background:#E5E5E5; }
68
+ .gm-player-controls { display:flex; align-items:center; gap:10px; color:#6B6B6B; font-size:10px; margin-bottom:14px; }
69
+ .gm-play { width:28px; height:28px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:11px; }
70
+ .gm-meta-block { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:8px 10px; font-size:9px; color:#6B6B6B; font-family:monospace; line-height:1.6; max-height:160px; overflow:hidden; margin-top:8px; }
71
+ .gm-actions { display:flex; flex-wrap:wrap; gap:6px; margin-bottom:10px; }
72
+ .gm-secondary { border:1px solid #2A2A2A; color:#E5E5E5; padding:6px 12px; border-radius:4px; font-size:10px; cursor:pointer; }
73
+ .gm-stems { display:grid; grid-template-columns:1fr 1fr; gap:6px; margin-bottom:10px; }
74
+ .gm-stem { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between; align-items:center; }
75
+ .gm-stem .dl { color:#FFF; cursor:pointer; }
76
+ </style>
77
+
78
+
79
+ <h3 style="margin-top:14px">🎡 Generate β€” fully expanded Β· psytrance preset stacked with custom LoRA</h3>
80
+
81
+ <div class="gm">
82
+ <div class="gm-header">
83
+ <div>
84
+ <div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
85
+ <div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">β™₯</span>. <strong>Drop a like</strong> Β· Follow <strong>@techfreakworm</strong> for what's next.</div>
86
+ </div>
87
+ <div class="gm-status">ready Β· MPS Β· M5 Max</div>
88
+ </div>
89
+
90
+ <div class="gm-row">
91
+ <div class="gm-sidebar">
92
+ <div class="gm-side active"><span class="em">🎡</span>Generate</div>
93
+ <div class="gm-side"><span class="em">🎀</span>Cover</div>
94
+ <div class="gm-side"><span class="em">⏩</span>Extend</div>
95
+ <div class="gm-side"><span class="em">✏️</span>Edit</div>
96
+ <div class="gm-side"><span class="em">✍️</span>Lyrics</div>
97
+ <div style="border-top:1px solid #1F1F1F; margin:14px 0 10px; padding-top:10px; font-size:9px; color:#6B6B6B; text-transform:uppercase; letter-spacing:0.1em;">History Β· session</div>
98
+ <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">β–Ά psytrance Β· just now</div>
99
+ <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">β–Ά ambient_v4 Β· 2m</div>
100
+ <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">β–Ά chinese_rap Β· 7m</div>
101
+ <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">β–Ά lofi_vocal Β· 14m</div>
102
+ </div>
103
+
104
+ <div class="gm-main">
105
+ <div class="gm-form">
106
+
107
+ <div class="gm-label">1 Β· Style prompt <span class="hint">describe the song Β· genre, instruments, mood</span></div>
108
+ <div class="gm-input">psytrance, rolling triplet bassline, acid squelch, metallic leads, atmospheric pads, high quality</div>
109
+
110
+ <div class="gm-label">2 Β· Lyrics <span class="hint">use [verse] [chorus] [bridge] tags Β· β†— open Lyrics tab to draft with Qwen 2.5</span></div>
111
+ <div class="gm-input gm-textarea" style="min-height:64px">[intro - atmospheric pads &amp; ambient synth]<br><br>[verse 1] six in the morning, the sun's still pretending<br>kick drum carries what the night was sending<br>shoes off, eyes closed, the city's still bending<br><br>[chorus] we let go, we let go, we let go</div>
112
+
113
+ <div class="gm-grid2">
114
+ <div>
115
+ <div class="gm-label">Duration <span class="hint">5 – 240 s</span></div>
116
+ <div class="gm-slider-row"><span class="name">seconds</span><span class="gm-slider p15"></span><span class="val">30</span></div>
117
+ </div>
118
+ <div>
119
+ <div class="gm-label">Vocal mode</div>
120
+ <div class="gm-pills">
121
+ <div class="gm-pill on">With vocals</div>
122
+ <div class="gm-pill">Instrumental</div>
123
+ </div>
124
+ </div>
125
+ </div>
126
+
127
+ <!-- LoRA section, expanded -->
128
+ <div class="gm-section">
129
+ <div class="gm-section-h">
130
+ <span>LoRA stack <span class="meta">Β· 2 active Β· order matters</span></span>
131
+ <span class="arrow">β–Ύ</span>
132
+ </div>
133
+
134
+ <div class="gm-label">Bundled presets <span class="hint">click to toggle</span></div>
135
+ <div style="margin-bottom:12px;">
136
+ <span class="gm-chip">RapMachine</span>
137
+ <span class="gm-chip">Chinese Rap</span>
138
+ <span class="gm-chip on">Lyric2Vocal</span>
139
+ <span class="gm-chip">Text2Samples</span>
140
+ </div>
141
+
142
+ <div class="gm-label">Active stack <span class="hint">↑↓ to reorder Β· Γ— to remove</span></div>
143
+ <div class="gm-lora-row">
144
+ <span class="gm-lora-name">Lyric2Vocal <small>Β· preset Β· 28 MB</small></span>
145
+ <span class="gm-slider p65" style="width:100px"></span>
146
+ <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.65</span>
147
+ <span class="gm-x">Γ—</span>
148
+ </div>
149
+ <div class="gm-lora-row">
150
+ <span class="gm-lora-name">psytrance_v2 <small>Β· custom Β· 47 MB Β· rank 64 Β· sha 0c94…</small></span>
151
+ <span class="gm-slider p95" style="width:100px"></span>
152
+ <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.95</span>
153
+ <span class="gm-x">Γ—</span>
154
+ </div>
155
+
156
+ <div style="margin-top:10px;">
157
+ <span class="gm-chip upload">↑ drop .safetensors here or click</span>
158
+ </div>
159
+ </div>
160
+
161
+ <!-- Advanced section, expanded -->
162
+ <div class="gm-section">
163
+ <div class="gm-section-h">
164
+ <span>Advanced <span class="meta">Β· generation parameters</span></span>
165
+ <span class="arrow">β–Ύ</span>
166
+ </div>
167
+
168
+ <div class="gm-grid3">
169
+ <div><div class="gm-label">BPM</div><div class="gm-input" style="margin-bottom:0">135</div></div>
170
+ <div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">auto</div></div>
171
+ <div><div class="gm-label">Time signature</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
172
+ </div>
173
+
174
+ <div class="gm-grid2">
175
+ <div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">β–Ύ</span></div></div>
176
+ <div><div class="gm-label">Vocal language</div><div class="gm-select">auto <span class="arrow">β–Ύ</span></div></div>
177
+ </div>
178
+
179
+ <div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p25"></span><span class="val">50</span></div>
180
+ <div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
181
+ <div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p33"></span><span class="val">3</span></div>
182
+ <div class="gm-slider-row"><span class="name">CFG interval start</span><span class="gm-slider p5"></span><span class="val">0.0</span></div>
183
+ <div class="gm-slider-row"><span class="name">CFG interval end</span><span class="gm-slider p95"></span><span class="val">1.0</span></div>
184
+
185
+ <div class="gm-label" style="margin-top:8px">Negative prompt <span class="hint">things to avoid</span></div>
186
+ <div class="gm-input gm-textarea" style="font-size:10px">bitcrushed, aliasing, quantizing noise, digital clipping, glitchy, mp3 artifacts, jazz, funk, pop, acoustic, lo-fi, orchestral, dubstep, vocal hooks, electric guitar, slow tempo, jazz chords, blues scale</div>
187
+
188
+ <div class="gm-grid2">
189
+ <div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
190
+ <div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">βœ“</span> -14 LUFS</div></div>
191
+ </div>
192
+
193
+ <div class="gm-grid2">
194
+ <div><div class="gm-label">Fade in</div><div class="gm-slider-row"><span class="name">seconds</span><span class="gm-slider p5"></span><span class="val">0.0</span></div></div>
195
+ <div><div class="gm-label">Fade out</div><div class="gm-slider-row"><span class="name">seconds</span><span class="gm-slider p5"></span><span class="val">0.0</span></div></div>
196
+ </div>
197
+
198
+ <div class="gm-grid2">
199
+ <div><div class="gm-label">Latent shift</div><div class="gm-input" style="margin-bottom:0">0</div></div>
200
+ <div><div class="gm-label">Latent rescale</div><div class="gm-input" style="margin-bottom:0">1</div></div>
201
+ </div>
202
+
203
+ <div class="gm-grid2">
204
+ <div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">1297183202</div></div>
205
+ <div><div class="gm-label">&nbsp;</div><div class="gm-toggle"><span class="box"></span> Lock seed</div></div>
206
+ </div>
207
+ </div>
208
+
209
+ <!-- LM planner section, expanded -->
210
+ <div class="gm-section">
211
+ <div class="gm-section-h">
212
+ <span>LM planner Β· Qwen3 thinking <span class="meta">Β· chain-of-thought structure</span></span>
213
+ <span class="arrow">β–Ύ</span>
214
+ </div>
215
+
216
+ <div class="gm-toggle on"><span class="box">βœ“</span> Thinking enabled <span style="color:#6B6B6B; font-size:9px; margin-left:auto">+ slower but better structure</span></div>
217
+ <div class="gm-toggle on"><span class="box">βœ“</span> Constrained decoding</div>
218
+
219
+ <div class="gm-grid4" style="margin-top:8px">
220
+ <div><div class="gm-label">Temperature</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
221
+ <div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
222
+ <div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
223
+ <div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
224
+ </div>
225
+
226
+ <div class="gm-label">CoT pipeline toggles <span class="hint">which fields the LM rewrites pre-generation</span></div>
227
+ <div class="gm-grid4">
228
+ <div class="gm-toggle"><span class="box"></span> metas</div>
229
+ <div class="gm-toggle"><span class="box"></span> caption</div>
230
+ <div class="gm-toggle"><span class="box"></span> lyrics</div>
231
+ <div class="gm-toggle"><span class="box"></span> language</div>
232
+ </div>
233
+
234
+ <div class="gm-label">LM negative prompt</div>
235
+ <div class="gm-input" style="font-size:10px">happy chords, major scale, uplifting melody</div>
236
+
237
+ <div class="gm-label">CoT override fields <span class="hint">if a CoT toggle is on, the LM rewrites these</span></div>
238
+ <div class="gm-grid2">
239
+ <div><div class="gm-label">cot_bpm</div><div class="gm-input" style="margin-bottom:0; opacity:0.5">(blank β†’ use main BPM)</div></div>
240
+ <div><div class="gm-label">cot_keyscale</div><div class="gm-input" style="margin-bottom:0; opacity:0.5">(blank β†’ use main key)</div></div>
241
+ </div>
242
+ </div>
243
+
244
+ <!-- DCW section, expanded -->
245
+ <div class="gm-section">
246
+ <div class="gm-section-h">
247
+ <span>DCW Β· dynamic CFG warping <span class="meta">Β· wavelet-based</span></span>
248
+ <span class="arrow">β–Ύ</span>
249
+ </div>
250
+
251
+ <div class="gm-toggle on"><span class="box">βœ“</span> DCW enabled</div>
252
+
253
+ <div class="gm-grid3">
254
+ <div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">β–Ύ</span></div></div>
255
+ <div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">β–Ύ</span></div></div>
256
+ <div><div class="gm-label">&nbsp;</div><div style="font-size:9px; color:#6B6B6B; padding-top:8px;">leave defaults if unsure</div></div>
257
+ </div>
258
+
259
+ <div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p5"></span><span class="val">0.02</span></div>
260
+ <div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
261
+ </div>
262
+
263
+ <div class="gm-btn">β–Ά Generate Β· est. ~30 s on M5 Max</div>
264
+ </div>
265
+
266
+ <!-- Output panel -->
267
+ <div class="gm-output">
268
+ <div class="gm-label" style="margin-bottom:10px">Output Β· psytrance Β· 30 s Β· seed 1297183202</div>
269
+
270
+ <div class="gm-waveform">
271
+ <div class="gm-bar" style="height:18%"></div><div class="gm-bar" style="height:32%"></div><div class="gm-bar" style="height:54%"></div><div class="gm-bar" style="height:72%"></div><div class="gm-bar" style="height:88%"></div><div class="gm-bar" style="height:62%"></div><div class="gm-bar" style="height:42%"></div><div class="gm-bar" style="height:78%"></div><div class="gm-bar" style="height:92%"></div><div class="gm-bar" style="height:66%"></div><div class="gm-bar" style="height:48%"></div><div class="gm-bar" style="height:30%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:80%"></div><div class="gm-bar" style="height:70%"></div><div class="gm-bar" style="height:44%"></div><div class="gm-bar" style="height:24%"></div><div class="gm-bar" style="height:50%"></div>
272
+ </div>
273
+
274
+ <div class="gm-player-controls">
275
+ <span class="gm-play">β–Ά</span>
276
+ <span>0:00 / 0:30</span>
277
+ <span style="margin-left:auto; cursor:pointer; color:#FFF">↻ retake Β· new seed</span>
278
+ </div>
279
+
280
+ <div class="gm-label">Stems Β· Demucs htdemucs_ft</div>
281
+ <div class="gm-stems">
282
+ <div class="gm-stem"><span>vocals Β· 1.8 MB</span><span class="dl">↓</span></div>
283
+ <div class="gm-stem"><span>drums Β· 1.6 MB</span><span class="dl">↓</span></div>
284
+ <div class="gm-stem"><span>bass Β· 1.4 MB</span><span class="dl">↓</span></div>
285
+ <div class="gm-stem"><span>other Β· 1.7 MB</span><span class="dl">↓</span></div>
286
+ </div>
287
+
288
+ <div class="gm-label">Export</div>
289
+ <div class="gm-actions">
290
+ <span class="gm-secondary">↓ mp3 Β· 1.2 MB</span>
291
+ <span class="gm-secondary">↓ wav Β· 5.3 MB</span>
292
+ <span class="gm-secondary">↓ stems zip</span>
293
+ <span class="gm-secondary">{ } meta</span>
294
+ <span class="gm-secondary">β†— share</span>
295
+ </div>
296
+
297
+ <div class="gm-label" style="margin-top:14px">Metadata</div>
298
+ <div class="gm-meta-block">
299
+ {<br>
300
+ &nbsp;&nbsp;"mode": "generate",<br>
301
+ &nbsp;&nbsp;"prompt": "psytrance, rolling triplet bassline...",<br>
302
+ &nbsp;&nbsp;"lyrics_first_line": "[intro - atmospheric pads...",<br>
303
+ &nbsp;&nbsp;"duration_s": 30, "instrumental": false,<br>
304
+ &nbsp;&nbsp;"bpm": 135, "key": "auto", "time_sig": "4/4",<br>
305
+ &nbsp;&nbsp;"sampler": "heun", "steps": 50, "cfg": 5.0, "shift": 3,<br>
306
+ &nbsp;&nbsp;"cfg_interval": [0.0, 1.0],<br>
307
+ &nbsp;&nbsp;"lm": {"thinking": true, "temp": 0.85, "top_p": 0.9, "cfg": 2,<br>
308
+ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"cot": {"metas":false,"caption":false,"lyrics":false,"language":false}},<br>
309
+ &nbsp;&nbsp;"dcw": {"enabled":true,"mode":"double","scaler":0.02,"high_scaler":0.06,"wavelet":"haar"},<br>
310
+ &nbsp;&nbsp;"loras": [<br>
311
+ &nbsp;&nbsp;&nbsp;&nbsp;{"name":"Lyric2Vocal","scale":0.65,"sha256":"7e1f..."},<br>
312
+ &nbsp;&nbsp;&nbsp;&nbsp;{"name":"psytrance_v2","scale":0.95,"sha256":"0c94..."}<br>
313
+ &nbsp;&nbsp;],<br>
314
+ &nbsp;&nbsp;"seed": 1297183202,<br>
315
+ &nbsp;&nbsp;"output_sha256": "f33a..."<br>
316
+ }
317
+ </div>
318
+ </div>
319
+ </div>
320
+ </div>
321
+ </div>
322
+
323
+
324
+ <h3 style="margin-top:30px">πŸ“± Mobile β€” phone screens</h3>
325
+ <p class="subtitle">Horizontal scroll tab strip at the top replaces the sidebar. Output stacks below form. Same Brutalist Mono.</p>
326
+
327
+ <style>
328
+ .mob-frame { display:flex; gap:24px; flex-wrap:wrap; justify-content:center; align-items:flex-start; }
329
+ .mob-phone { background:#222; border-radius:18px; padding:8px; }
330
+ .mob-screen { width:200px; background:#0A0A0A; color:#E5E5E5; border-radius:12px; padding:10px; }
331
+ .mob-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:6px; border-bottom:1px solid #1F1F1F; margin-bottom:8px; }
332
+ .mob-brand { font-size:11px; font-weight:600; }
333
+ .mob-cta { font-size:8px; color:#6B6B6B; }
334
+ .mob-tabs { display:flex; gap:6px; overflow-x:auto; padding:4px 0; margin-bottom:8px; border-bottom:1px solid #1F1F1F; }
335
+ .mob-tab { font-size:9px; color:#6B6B6B; white-space:nowrap; padding:4px 6px; }
336
+ .mob-tab.active { color:#FFF; border-bottom:1px solid #FFF; }
337
+ .mob-form { background:#141414; padding:10px; border-radius:5px; }
338
+ .mob-label { font-size:8px; text-transform:uppercase; letter-spacing:0.06em; color:#6B6B6B; margin-bottom:4px; }
339
+ .mob-input { background:#000; border:1px solid #2A2A2A; padding:5px 8px; border-radius:3px; font-size:9px; margin-bottom:8px; }
340
+ .mob-textarea { min-height:30px; }
341
+ .mob-chips { margin-bottom:8px; }
342
+ .mob-chip { display:inline-block; padding:2px 7px; border-radius:9px; font-size:8px; margin-right:3px; margin-bottom:3px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; }
343
+ .mob-chip.on { border-color:#FFF; color:#FFF; }
344
+ .mob-accordion { background:#000; border:1px solid #2A2A2A; border-radius:3px; padding:5px 8px; margin-bottom:6px; font-size:9px; color:#6B6B6B; display:flex; justify-content:space-between; }
345
+ .mob-btn { background:#FFF; color:#0A0A0A; padding:6px 10px; border-radius:3px; font-weight:600; font-size:9px; text-align:center; }
346
+ .mob-output { background:#141414; padding:10px; border-radius:5px; margin-top:8px; }
347
+ .mob-wave { height:30px; background:#000; border:1px solid #2A2A2A; border-radius:3px; display:flex; align-items:center; gap:1px; padding:4px; margin-bottom:6px; }
348
+ .mob-wave-bar { width:1px; background:#FFF; }
349
+ .mob-controls { display:flex; align-items:center; gap:6px; font-size:8px; color:#6B6B6B; margin-bottom:8px; }
350
+ .mob-play { width:20px; height:20px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:9px; }
351
+ .mob-export { display:flex; flex-wrap:wrap; gap:3px; }
352
+ .mob-secondary { border:1px solid #2A2A2A; padding:3px 8px; border-radius:3px; font-size:8px; color:#E5E5E5; }
353
+ .mob-dropzone { background:#000; border:2px solid #FFF; border-radius:4px; padding:6px 8px; margin-bottom:8px; font-size:8px; }
354
+ .mob-caption { text-align:center; color:#6B6B6B; font-size:10px; margin-top:8px; }
355
+ /* Mobile slider β€” bounded inside its container, no overflow */
356
+ .mob-slider { height:3px; background:#2A2A2A; border-radius:2px; position:relative; margin:6px 0 10px; box-sizing:border-box; }
357
+ .mob-slider::after { content:""; position:absolute; top:-3px; width:9px; height:9px; background:#FFF; border-radius:50%; transform:translateX(-50%); }
358
+ .mob-slider.p15::after { left:15%; }
359
+ .mob-slider.p93::after { left:93%; }
360
+ </style>
361
+
362
+ <div class="mob-frame">
363
+
364
+ <!-- Phone 1: Generate -->
365
+ <div>
366
+ <div class="mob-phone">
367
+ <div class="mob-screen">
368
+ <div class="mob-header">
369
+ <div class="mob-brand">ACE Music.</div>
370
+ <div class="mob-cta">β™₯ @tfw</div>
371
+ </div>
372
+ <div class="mob-tabs">
373
+ <div class="mob-tab active">🎡 Generate</div>
374
+ <div class="mob-tab">🎀 Cover</div>
375
+ <div class="mob-tab">⏩</div>
376
+ <div class="mob-tab">✏️</div>
377
+ <div class="mob-tab">✍️</div>
378
+ </div>
379
+ <div class="mob-form">
380
+ <div class="mob-label">Style</div>
381
+ <div class="mob-input">psytrance, acid leads</div>
382
+ <div class="mob-label">Lyrics</div>
383
+ <div class="mob-input mob-textarea">[verse] six in the morning...</div>
384
+ <div class="mob-label">Duration Β· 30 s</div>
385
+ <div class="mob-slider p15"></div>
386
+ <div class="mob-chips">
387
+ <span class="mob-chip on">psytrance_v2</span>
388
+ <span class="mob-chip">+ upload</span>
389
+ </div>
390
+ <div class="mob-accordion">β–Έ Advanced Β· BPM 135, sampler heun</div>
391
+ <div class="mob-accordion">β–Έ LM planner</div>
392
+ <div class="mob-accordion">β–Έ DCW</div>
393
+ <div class="mob-btn" style="margin-top:6px">β–Ά Generate</div>
394
+ </div>
395
+ <div class="mob-output">
396
+ <div class="mob-wave">
397
+ <div class="mob-wave-bar" style="height:30%"></div><div class="mob-wave-bar" style="height:60%"></div><div class="mob-wave-bar" style="height:80%"></div><div class="mob-wave-bar" style="height:50%"></div><div class="mob-wave-bar" style="height:70%"></div><div class="mob-wave-bar" style="height:90%"></div><div class="mob-wave-bar" style="height:40%"></div><div class="mob-wave-bar" style="height:65%"></div><div class="mob-wave-bar" style="height:80%"></div><div class="mob-wave-bar" style="height:55%"></div><div class="mob-wave-bar" style="height:75%"></div><div class="mob-wave-bar" style="height:45%"></div><div class="mob-wave-bar" style="height:35%"></div><div class="mob-wave-bar" style="height:60%"></div><div class="mob-wave-bar" style="height:25%"></div>
398
+ </div>
399
+ <div class="mob-controls">
400
+ <span class="mob-play">β–Ά</span>
401
+ <span>0:00 / 0:30</span>
402
+ <span style="margin-left:auto; color:#FFF">↻</span>
403
+ </div>
404
+ <div class="mob-export">
405
+ <span class="mob-secondary">↓ mp3</span>
406
+ <span class="mob-secondary">↓ wav</span>
407
+ <span class="mob-secondary">stems</span>
408
+ </div>
409
+ </div>
410
+ </div>
411
+ </div>
412
+ <div class="mob-caption">Generate Β· 360 Γ— 720 mobile</div>
413
+ </div>
414
+
415
+ <!-- Phone 2: Cover with file picked -->
416
+ <div>
417
+ <div class="mob-phone">
418
+ <div class="mob-screen">
419
+ <div class="mob-header">
420
+ <div class="mob-brand">ACE Music.</div>
421
+ <div class="mob-cta">β™₯ @tfw</div>
422
+ </div>
423
+ <div class="mob-tabs">
424
+ <div class="mob-tab">🎡</div>
425
+ <div class="mob-tab active">🎀 Cover</div>
426
+ <div class="mob-tab">⏩</div>
427
+ <div class="mob-tab">✏️</div>
428
+ <div class="mob-tab">✍️</div>
429
+ </div>
430
+ <div class="mob-form">
431
+ <div class="mob-label">1 Β· Reference</div>
432
+ <div class="mob-dropzone">
433
+ <strong style="color:#FFF">↑ ref_psy.wav</strong><br>
434
+ <span style="color:#6B6B6B">44.1k Β· 28 s Β· 2.1 MB</span>
435
+ </div>
436
+ <div class="mob-label">2 Β· New prompt</div>
437
+ <div class="mob-input">faster, more aggressive</div>
438
+ <div class="mob-label">3 Β· New lyrics</div>
439
+ <div class="mob-input mob-textarea">[verse] new lyrics over ref...</div>
440
+ <div class="mob-label">Cover strength Β· 0.93</div>
441
+ <div class="mob-slider p93"></div>
442
+ <div class="mob-chips">
443
+ <span class="mob-chip on">RapMachine</span>
444
+ </div>
445
+ <div class="mob-accordion">β–Έ Advanced</div>
446
+ <div class="mob-accordion">β–Έ LM planner</div>
447
+ <div class="mob-btn" style="margin-top:6px">β–Ά Cover</div>
448
+ </div>
449
+ </div>
450
+ </div>
451
+ <div class="mob-caption">Cover Β· with ref audio loaded</div>
452
+ </div>
453
+
454
+ <!-- Phone 3: Lyrics output -->
455
+ <div>
456
+ <div class="mob-phone">
457
+ <div class="mob-screen">
458
+ <div class="mob-header">
459
+ <div class="mob-brand">ACE Music.</div>
460
+ <div class="mob-cta">β™₯ @tfw</div>
461
+ </div>
462
+ <div class="mob-tabs">
463
+ <div class="mob-tab">🎡</div>
464
+ <div class="mob-tab">🎀</div>
465
+ <div class="mob-tab">⏩</div>
466
+ <div class="mob-tab">✏️</div>
467
+ <div class="mob-tab active">✍️ Lyrics</div>
468
+ </div>
469
+ <div class="mob-form">
470
+ <div class="mob-label">Brief</div>
471
+ <div class="mob-input mob-textarea">psytrance anthem about sunrise...</div>
472
+ <div class="mob-label">Structure</div>
473
+ <div class="mob-input">intro, verse, chorus...</div>
474
+ <div class="mob-label">Language Β· en Β· 0.85 temp</div>
475
+ <div class="mob-accordion">β–Έ LM parameters</div>
476
+ <div class="mob-btn" style="margin-top:6px">β–Ά Draft</div>
477
+ </div>
478
+ <div class="mob-output">
479
+ <div style="font-size:9px; line-height:1.5;">
480
+ <strong style="color:#FFF">[intro]</strong><br>
481
+ <span style="color:#B8B0A4">the lights start low...</span><br>
482
+ <strong style="color:#FFF">[verse 1]</strong><br>
483
+ <span style="color:#B8B0A4">six in the morning,<br>the sun's still pretending...</span>
484
+ </div>
485
+ <div class="mob-export" style="margin-top:8px">
486
+ <span class="mob-secondary" style="border-color:#FFF; color:#FFF">↑ Use in Generate</span>
487
+ <span class="mob-secondary">↻</span>
488
+ </div>
489
+ </div>
490
+ </div>
491
+ </div>
492
+ <div class="mob-caption">Lyrics Β· draft visible</div>
493
+ </div>
494
+
495
+ </div>
496
+
497
+
498
+ <h3 style="margin-top:30px">⚠️ Error and edge-case states</h3>
499
+
500
+ <style>
501
+ .err { background:#0A0A0A; border:1px solid #1F1F1F; border-radius:8px; padding:14px; margin-bottom:10px; }
502
+ .err-row { display:flex; align-items:flex-start; gap:14px; }
503
+ .err-icon { width:28px; height:28px; flex-shrink:0; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:14px; font-weight:600; }
504
+ .err-icon.warn { background:#0A0A0A; color:#FFF; border:1px solid #FFF; }
505
+ .err-icon.info { background:transparent; color:#6B6B6B; border:1px solid #6B6B6B; }
506
+ .err-body { flex:1; }
507
+ .err-title { font-size:12px; font-weight:600; color:#FFF; margin-bottom:4px; }
508
+ .err-msg { font-size:11px; color:#B8B0A4; line-height:1.5; margin-bottom:6px; }
509
+ .err-action { display:inline-block; border:1px solid #FFF; color:#FFF; padding:4px 10px; border-radius:3px; font-size:10px; cursor:pointer; margin-right:4px; }
510
+ .err-action.secondary { border-color:#2A2A2A; color:#B8B0A4; }
511
+ .err-tag { display:inline-block; background:#1A1A1A; color:#6B6B6B; padding:2px 6px; border-radius:3px; font-size:9px; font-family:monospace; margin-left:6px; }
512
+
513
+ .progress { background:#0A0A0A; border:1px solid #1F1F1F; border-radius:8px; padding:18px; margin-bottom:10px; }
514
+ .progress-bar { height:4px; background:#1A1A1A; border-radius:2px; overflow:hidden; margin:10px 0 6px; }
515
+ .progress-bar .fill { height:100%; background:#FFF; width:42%; }
516
+ .progress-meta { display:flex; justify-content:space-between; font-size:10px; color:#6B6B6B; font-family:monospace; }
517
+ .progress-title { font-size:12px; font-weight:600; color:#FFF; margin-bottom:4px; }
518
+ .progress-sub { font-size:10px; color:#6B6B6B; }
519
+ </style>
520
+
521
+ <div class="err">
522
+ <div class="err-row">
523
+ <div class="err-icon">!</div>
524
+ <div class="err-body">
525
+ <div class="err-title">LoRA not compatible <span class="err-tag">LoRAValidationError</span></div>
526
+ <div class="err-msg">This LoRA was trained against <code>SDXL</code>, not ACE-Step 1.5 XL SFT. Expected DiT modules: <code>to_q, to_k, to_v, to_out.0, ff.net.0.proj, ff.net.2</code>. Got: <code>unet.down_blocks…</code>.</div>
527
+ <div class="err-action">Remove from stack</div>
528
+ <span class="err-action secondary">View header diagnostics</span>
529
+ </div>
530
+ </div>
531
+ </div>
532
+
533
+ <div class="err">
534
+ <div class="err-row">
535
+ <div class="err-icon warn">⚠</div>
536
+ <div class="err-body">
537
+ <div class="err-title">ZeroGPU timed out Β· auto-retried at 2Γ— duration</div>
538
+ <div class="err-msg">First attempt aborted at the 60 s shared-A10G cap. Second attempt at 120 s also aborted. Try a shorter duration, fewer steps, or fewer active LoRAs. <span style="color:#6B6B6B">last seen: 70 s wall, step 41/50</span></div>
539
+ <div class="err-action">Lower steps to 30</div>
540
+ <span class="err-action secondary">Reduce duration to 20 s</span>
541
+ </div>
542
+ </div>
543
+ </div>
544
+
545
+ <div class="err">
546
+ <div class="err-row">
547
+ <div class="err-icon warn">⚠</div>
548
+ <div class="err-body">
549
+ <div class="err-title">MPS op not implemented Β· falling back to CPU <span class="err-tag">aten::_fft_r2c</span></div>
550
+ <div class="err-msg">An ACE-Step kernel hit a PyTorch MPS gap. CPU fallback engaged via <code>PYTORCH_ENABLE_MPS_FALLBACK=1</code>. Generation will continue but be ~2–3Γ— slower for the affected segments.</div>
551
+ <div class="err-action secondary">Continue anyway</div>
552
+ <span class="err-action secondary">Open issue on GitHub</span>
553
+ </div>
554
+ </div>
555
+ </div>
556
+
557
+ <div class="err">
558
+ <div class="err-row">
559
+ <div class="err-icon">!</div>
560
+ <div class="err-body">
561
+ <div class="err-title">Reference audio rejected <span class="err-tag">unsupported format</span></div>
562
+ <div class="err-msg">Cover mode needs <code>wav</code>, <code>mp3</code>, or <code>flac</code>, ≀ 60 s, ≀ 50 MB. Got <code>m4a</code>, 4:12 long, 87 MB.</div>
563
+ <div class="err-action">Pick a different file</div>
564
+ <span class="err-action secondary">Auto-convert + trim to first 60 s</span>
565
+ </div>
566
+ </div>
567
+ </div>
568
+
569
+ <div class="err">
570
+ <div class="err-row">
571
+ <div class="err-icon info">i</div>
572
+ <div class="err-body">
573
+ <div class="err-title">First request β€” warming up the pipeline (~45 s)</div>
574
+ <div class="err-msg">Loading <code>ACE-Step v1.5 XL SFT</code> weights into MPS memory. Subsequent generations in this session start instantly.</div>
575
+ </div>
576
+ </div>
577
+ </div>
578
+
579
+ <div class="progress">
580
+ <div class="progress-title">Generating… <span style="color:#6B6B6B; font-weight:400; font-size:10px;">step 21 / 50 Β· ETA 14 s</span></div>
581
+ <div class="progress-sub">heun sampler Β· CFG 5.0 Β· 2 LoRAs active Β· seed 1297183202</div>
582
+ <div class="progress-bar"><div class="fill"></div></div>
583
+ <div class="progress-meta">
584
+ <span>0:08 elapsed</span>
585
+ <span>↻ cancel</span>
586
+ </div>
587
+ </div>
588
+
589
+ <div class="options" style="margin-top:24px">
590
+ <div class="option" data-choice="approve" onclick="toggleSelect(this)">
591
+ <div class="letter">βœ“</div>
592
+ <div class="content">
593
+ <h3>All mockups approved β€” bake them into the spec</h3>
594
+ <p>Move every approved mockup into <code>docs/superpowers/specs/mockups/</code> and reference them from Β§8 of the spec. Then hand off to writing-plans.</p>
595
+ </div>
596
+ </div>
597
+ <div class="option" data-choice="revise" onclick="toggleSelect(this)">
598
+ <div class="letter">✎</div>
599
+ <div class="content">
600
+ <h3>Revise something specific</h3>
601
+ <p>Tell me which mockup / control / error needs work.</p>
602
+ </div>
603
+ </div>
604
+ </div>
docs/superpowers/specs/mockups/02_cover_extend.html ADDED
@@ -0,0 +1,572 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h2>Cover and Extend Β· everything expanded</h2>
2
+ <p class="subtitle">Every accordion open. Every control visible. Showing the actual depth of options. In production, "Advanced", "LM planner", "DCW" stay collapsed by default β€” but this is the full surface so you can verify nothing is missing.</p>
3
+
4
+ <style>
5
+ /* base */
6
+ .gm { background:#0A0A0A; color:#E5E5E5; border:1px solid #1F1F1F; border-radius:10px; padding:18px; font-size:12px; line-height:1.5; margin-top:14px; }
7
+ .gm-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:10px; border-bottom:1px solid #1F1F1F; margin-bottom:14px; }
8
+ .gm-brand { font-size:15px; font-weight:600; }
9
+ .gm-cta { font-size:11px; color:#6B6B6B; }
10
+ .gm-cta strong { color:#E5E5E5; }
11
+ .gm-status { font-size:10px; color:#6B6B6B; letter-spacing:0.08em; text-transform:uppercase; }
12
+ .gm-row { display:flex; gap:16px; align-items:flex-start; }
13
+ .gm-sidebar { background:#000; padding:14px 10px; border-radius:6px; min-width:170px; position:sticky; top:0; }
14
+ .gm-side { display:block; padding:8px 10px; border-radius:4px; margin-bottom:3px; font-size:12px; color:#6B6B6B; }
15
+ .gm-side.active { background:#1A1A1A; color:#FFF; border-left:2px solid #FFF; padding-left:8px; }
16
+ .gm-side .em { margin-right:6px; }
17
+ .gm-main { flex:1; display:flex; gap:14px; align-items:flex-start; }
18
+ .gm-form { flex:1.3; background:#141414; padding:16px; border-radius:6px; }
19
+ .gm-output { flex:1; background:#141414; padding:16px; border-radius:6px; min-width:260px; position:sticky; top:0; }
20
+
21
+ /* form controls */
22
+ .gm-label { font-size:10px; text-transform:uppercase; letter-spacing:0.08em; color:#6B6B6B; margin-bottom:6px; display:flex; justify-content:space-between; align-items:center; }
23
+ .gm-label .hint { color:#5A5048; font-size:9px; text-transform:none; letter-spacing:normal; font-weight:400; }
24
+ .gm-input { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; margin-bottom:12px; font-size:11px; }
25
+ .gm-textarea { min-height:46px; }
26
+ .gm-grid2 { display:grid; grid-template-columns:1fr 1fr; gap:12px; margin-bottom:12px; }
27
+ .gm-grid3 { display:grid; grid-template-columns:1fr 1fr 1fr; gap:10px; margin-bottom:12px; }
28
+ .gm-grid4 { display:grid; grid-template-columns:1fr 1fr 1fr 1fr; gap:8px; margin-bottom:12px; }
29
+
30
+ /* slider */
31
+ .gm-slider-row { display:flex; align-items:center; gap:10px; padding:6px 8px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; }
32
+ .gm-slider-row .name { color:#6B6B6B; font-size:10px; min-width:90px; }
33
+ .gm-slider { flex:1; height:3px; background:#2A2A2A; border-radius:2px; position:relative; }
34
+ .gm-slider::after { content:""; position:absolute; top:-4px; width:10px; height:10px; background:#FFF; border-radius:50%; }
35
+ .gm-slider.p10::after { left:10%; }
36
+ .gm-slider.p20::after { left:20%; }
37
+ .gm-slider.p30::after { left:30%; }
38
+ .gm-slider.p40::after { left:40%; }
39
+ .gm-slider.p50::after { left:50%; }
40
+ .gm-slider.p60::after { left:60%; }
41
+ .gm-slider.p70::after { left:70%; }
42
+ .gm-slider.p85::after { left:85%; }
43
+ .gm-slider.p93::after { left:93%; }
44
+ .gm-slider.p95::after { left:95%; }
45
+ .gm-slider-row .val { color:#FFF; font-family:monospace; font-size:11px; min-width:42px; text-align:right; }
46
+
47
+ /* toggle */
48
+ .gm-toggle { display:flex; align-items:center; gap:8px; padding:6px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; cursor:pointer; }
49
+ .gm-toggle .box { width:14px; height:14px; border:1px solid #2A2A2A; border-radius:3px; display:inline-flex; align-items:center; justify-content:center; font-size:9px; }
50
+ .gm-toggle.on { color:#FFF; border-color:#FFF; }
51
+ .gm-toggle.on .box { background:#FFF; color:#0A0A0A; border-color:#FFF; }
52
+
53
+ /* radio pill */
54
+ .gm-pills { display:flex; gap:0; background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:2px; margin-bottom:12px; }
55
+ .gm-pill { flex:1; text-align:center; padding:6px 10px; font-size:11px; color:#6B6B6B; border-radius:3px; cursor:pointer; }
56
+ .gm-pill.on { background:#FFF; color:#0A0A0A; }
57
+
58
+ /* select */
59
+ .gm-select { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; font-size:11px; display:flex; justify-content:space-between; align-items:center; margin-bottom:8px; }
60
+ .gm-select .arrow { color:#6B6B6B; }
61
+
62
+ /* section divider */
63
+ .gm-section { border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-top:14px; background:#0F0F0F; }
64
+ .gm-section-h { display:flex; justify-content:space-between; align-items:center; margin-bottom:12px; font-size:11px; font-weight:600; }
65
+ .gm-section-h .arrow { color:#FFF; }
66
+ .gm-section-h .meta { color:#6B6B6B; font-weight:400; font-size:10px; }
67
+
68
+ .gm-chip { display:inline-block; padding:5px 10px; border-radius:14px; font-size:10px; margin-right:5px; margin-bottom:5px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; cursor:pointer; }
69
+ .gm-chip.on { border-color:#FFF; color:#FFF; }
70
+ .gm-chip.upload { border-style:dashed; color:#FFF; }
71
+
72
+ .gm-lora-row { display:flex; align-items:center; gap:10px; padding:8px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:6px; font-size:11px; }
73
+ .gm-lora-name { flex:1; }
74
+ .gm-lora-name small { color:#6B6B6B; font-weight:400; margin-left:4px; }
75
+ .gm-x { color:#6B6B6B; cursor:pointer; padding:0 4px; }
76
+
77
+ .gm-btn { background:#FFF; color:#0A0A0A; padding:12px 18px; border-radius:4px; font-weight:600; display:block; font-size:13px; text-align:center; cursor:pointer; margin-top:16px; }
78
+
79
+ /* drop zone */
80
+ .gm-dropzone { background:#000; border:2px dashed #2A2A2A; border-radius:6px; padding:14px; margin-bottom:12px; text-align:center; font-size:11px; color:#6B6B6B; }
81
+ .gm-dropzone.has-file { border-style:solid; border-color:#FFF; color:#FFF; text-align:left; padding:10px 12px; }
82
+ .gm-dropzone .filename { font-weight:600; }
83
+ .gm-dropzone .meta { color:#6B6B6B; font-size:9px; margin-top:2px; font-weight:400; }
84
+ .gm-dropzone .miniwave { height:18px; background:repeating-linear-gradient(90deg, currentColor 0 1px, transparent 1px 3px); margin-top:6px; opacity:0.5; }
85
+
86
+ /* output */
87
+ .gm-waveform { height:60px; background:#000; border:1px solid #2A2A2A; border-radius:4px; display:flex; align-items:center; justify-content:center; gap:2px; padding:8px; margin-bottom:10px; }
88
+ .gm-bar { width:2px; background:#E5E5E5; }
89
+ .gm-player-controls { display:flex; align-items:center; gap:10px; color:#6B6B6B; font-size:10px; margin-bottom:14px; }
90
+ .gm-play { width:28px; height:28px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:11px; }
91
+ .gm-stems { display:grid; grid-template-columns:1fr 1fr; gap:6px; margin-bottom:10px; }
92
+ .gm-stem { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between; align-items:center; }
93
+ .gm-stem .dl { color:#FFF; cursor:pointer; }
94
+ .gm-meta-block { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:8px 10px; font-size:9px; color:#6B6B6B; font-family:monospace; line-height:1.6; max-height:140px; overflow:hidden; margin-top:8px; }
95
+ .gm-actions { display:flex; flex-wrap:wrap; gap:6px; margin-bottom:10px; }
96
+ .gm-secondary { border:1px solid #2A2A2A; color:#E5E5E5; padding:6px 12px; border-radius:4px; font-size:10px; cursor:pointer; }
97
+ </style>
98
+
99
+
100
+ <h3 style="margin-top:14px">🎀 Cover β€” fully expanded</h3>
101
+
102
+ <div class="gm">
103
+ <div class="gm-header">
104
+ <div>
105
+ <div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
106
+ <div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">β™₯</span>. <strong>Drop a like</strong> Β· Follow <strong>@techfreakworm</strong> for what's next.</div>
107
+ </div>
108
+ <div class="gm-status">ready Β· MPS Β· M5 Max</div>
109
+ </div>
110
+
111
+ <div class="gm-row">
112
+ <div class="gm-sidebar">
113
+ <div class="gm-side"><span class="em">🎡</span>Generate</div>
114
+ <div class="gm-side active"><span class="em">🎀</span>Cover</div>
115
+ <div class="gm-side"><span class="em">⏩</span>Extend</div>
116
+ <div class="gm-side"><span class="em">✏️</span>Edit</div>
117
+ <div class="gm-side"><span class="em">✍️</span>Lyrics</div>
118
+ <div style="border-top:1px solid #1F1F1F; margin:14px 0 10px; padding-top:10px; font-size:9px; color:#6B6B6B; text-transform:uppercase; letter-spacing:0.1em;">History Β· session</div>
119
+ <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">β–Ά psy_cover Β· just now</div>
120
+ <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">β–Ά lofi_remix Β· 3m</div>
121
+ <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">β–Ά ambient_v2 Β· 12m</div>
122
+ </div>
123
+
124
+ <div class="gm-main">
125
+ <div class="gm-form">
126
+
127
+ <div class="gm-label">1 Β· Reference audio <span class="hint">wav / mp3 / flac Β· ≀ 60 s Β· matters most for first 12 s</span></div>
128
+ <div class="gm-dropzone has-file">
129
+ <div class="filename">↑ reference_psy_track.wav</div>
130
+ <div class="meta">44.1 kHz Β· stereo Β· 28.4 s Β· 2.1 MB</div>
131
+ <div class="miniwave"></div>
132
+ </div>
133
+
134
+ <div class="gm-label">2 Β· New style prompt <span class="hint">leave blank to fully inherit reference style</span></div>
135
+ <div class="gm-input">faster, more aggressive leads, club-ready</div>
136
+
137
+ <div class="gm-label">3 Β· New lyrics <span class="hint">use [verse] [chorus] [bridge] tags Β· open Lyrics tab to draft with AI</span></div>
138
+ <div class="gm-input gm-textarea">[intro] driving acid bassline<br>[verse] new lyrics over the reference style<br>[chorus] one more time, one more time<br>[outro] ...</div>
139
+
140
+ <div class="gm-grid2">
141
+ <div>
142
+ <div class="gm-label">Duration <span class="hint">seconds</span></div>
143
+ <div class="gm-slider-row"><span class="name">5 – 240 s</span><span class="gm-slider p10"></span><span class="val">30</span></div>
144
+ </div>
145
+ <div>
146
+ <div class="gm-label">Vocal mode</div>
147
+ <div class="gm-pills">
148
+ <div class="gm-pill on">With vocals</div>
149
+ <div class="gm-pill">Instrumental</div>
150
+ </div>
151
+ </div>
152
+ </div>
153
+
154
+ <div class="gm-label">Cover-specific <span class="hint">how the reference influences the output</span></div>
155
+ <div class="gm-slider-row"><span class="name">Cover strength</span><span class="gm-slider p93"></span><span class="val">0.93</span></div>
156
+ <div class="gm-slider-row"><span class="name">Cover noise</span><span class="gm-slider p10" style="--p:0.05;"></span><span class="val">0.00</span></div>
157
+
158
+ <!-- LoRA section, expanded -->
159
+ <div class="gm-section">
160
+ <div class="gm-section-h">
161
+ <span>LoRA stack <span class="meta">Β· 2 active</span></span>
162
+ <span class="arrow">β–Ύ</span>
163
+ </div>
164
+
165
+ <div class="gm-label">Bundled presets <span class="hint">click to toggle</span></div>
166
+ <div style="margin-bottom:12px;">
167
+ <span class="gm-chip on">RapMachine</span>
168
+ <span class="gm-chip">Chinese Rap</span>
169
+ <span class="gm-chip">Lyric2Vocal</span>
170
+ <span class="gm-chip">Text2Samples</span>
171
+ </div>
172
+
173
+ <div class="gm-label">Active stack <span class="hint">applied in order, top first</span></div>
174
+ <div class="gm-lora-row">
175
+ <span class="gm-lora-name">RapMachine <small>Β· preset</small></span>
176
+ <span class="gm-slider p85" style="width:100px"></span>
177
+ <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.85</span>
178
+ <span class="gm-x">Γ—</span>
179
+ </div>
180
+ <div class="gm-lora-row">
181
+ <span class="gm-lora-name">psytrance_v2 <small>Β· custom Β· 47 MB Β· rank 64</small></span>
182
+ <span class="gm-slider p95" style="width:100px"></span>
183
+ <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.95</span>
184
+ <span class="gm-x">Γ—</span>
185
+ </div>
186
+
187
+ <div style="margin-top:10px;">
188
+ <span class="gm-chip upload">↑ drop .safetensors here or click</span>
189
+ </div>
190
+ </div>
191
+
192
+ <!-- Advanced section, expanded -->
193
+ <div class="gm-section">
194
+ <div class="gm-section-h">
195
+ <span>Advanced</span>
196
+ <span class="arrow">β–Ύ</span>
197
+ </div>
198
+
199
+ <div class="gm-grid3">
200
+ <div><div class="gm-label">BPM</div><div class="gm-input" style="margin-bottom:0">135</div></div>
201
+ <div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">auto</div></div>
202
+ <div><div class="gm-label">Time sig</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
203
+ </div>
204
+
205
+ <div class="gm-grid2">
206
+ <div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">β–Ύ</span></div></div>
207
+ <div><div class="gm-label">Vocal language</div><div class="gm-select">auto <span class="arrow">β–Ύ</span></div></div>
208
+ </div>
209
+
210
+ <div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p20"></span><span class="val">50</span></div>
211
+ <div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
212
+ <div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p30"></span><span class="val">3</span></div>
213
+
214
+ <div class="gm-label" style="margin-top:8px">Negative prompt <span class="hint">things to avoid in the output</span></div>
215
+ <div class="gm-input gm-textarea" style="font-size:10px">bitcrushed, aliasing, jazz, pop, vocal hooks, slow tempo</div>
216
+
217
+ <div class="gm-grid2">
218
+ <div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
219
+ <div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">βœ“</span> Normalize to -14 LUFS</div></div>
220
+ </div>
221
+
222
+ <div class="gm-grid2">
223
+ <div><div class="gm-label">Fade in</div><div class="gm-input" style="margin-bottom:0">0.0 s</div></div>
224
+ <div><div class="gm-label">Fade out</div><div class="gm-input" style="margin-bottom:0">0.0 s</div></div>
225
+ </div>
226
+
227
+ <div class="gm-grid2">
228
+ <div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">1297183202</div></div>
229
+ <div><div class="gm-label">&nbsp;</div><div class="gm-toggle"><span class="box"></span> Lock seed Β· re-use across retakes</div></div>
230
+ </div>
231
+ </div>
232
+
233
+ <!-- LM planner section, expanded -->
234
+ <div class="gm-section">
235
+ <div class="gm-section-h">
236
+ <span>LM planner Β· Qwen3 thinking</span>
237
+ <span class="arrow">β–Ύ</span>
238
+ </div>
239
+
240
+ <div class="gm-toggle on"><span class="box">βœ“</span> Thinking enabled <span style="color:#6B6B6B; font-size:9px; margin-left:auto">+ slower but better structure</span></div>
241
+ <div class="gm-toggle on"><span class="box">βœ“</span> Constrained decoding</div>
242
+
243
+ <div class="gm-grid4" style="margin-top:8px">
244
+ <div><div class="gm-label">Temp</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
245
+ <div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
246
+ <div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
247
+ <div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
248
+ </div>
249
+
250
+ <div class="gm-label">CoT pipeline toggles <span class="hint">which fields the LM rewrites</span></div>
251
+ <div class="gm-grid4">
252
+ <div class="gm-toggle"><span class="box"></span> metas</div>
253
+ <div class="gm-toggle"><span class="box"></span> caption</div>
254
+ <div class="gm-toggle"><span class="box"></span> lyrics</div>
255
+ <div class="gm-toggle"><span class="box"></span> language</div>
256
+ </div>
257
+
258
+ <div class="gm-label">LM negative prompt</div>
259
+ <div class="gm-input" style="font-size:10px">happy chords, major scale</div>
260
+ </div>
261
+
262
+ <!-- DCW section, expanded -->
263
+ <div class="gm-section">
264
+ <div class="gm-section-h">
265
+ <span>DCW Β· dynamic CFG warping</span>
266
+ <span class="arrow">β–Ύ</span>
267
+ </div>
268
+
269
+ <div class="gm-toggle on"><span class="box">βœ“</span> DCW enabled</div>
270
+
271
+ <div class="gm-grid3">
272
+ <div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">β–Ύ</span></div></div>
273
+ <div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">β–Ύ</span></div></div>
274
+ <div><div class="gm-label">&nbsp;</div><div style="font-size:9px; color:#6B6B6B; padding-top:8px;">leave defaults if unsure</div></div>
275
+ </div>
276
+
277
+ <div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p10"></span><span class="val">0.02</span></div>
278
+ <div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
279
+ </div>
280
+
281
+ <div class="gm-btn">β–Ά Generate cover Β· est. ~35 s on M5 Max</div>
282
+ </div>
283
+
284
+ <!-- Output panel -->
285
+ <div class="gm-output">
286
+ <div class="gm-label" style="margin-bottom:10px">Output Β· cover Β· 30 s Β· seed 1297183202</div>
287
+
288
+ <div class="gm-toggle"><span class="box"></span> Compare side-by-side with reference</div>
289
+
290
+ <div class="gm-waveform">
291
+ <div class="gm-bar" style="height:22%"></div><div class="gm-bar" style="height:54%"></div><div class="gm-bar" style="height:78%"></div><div class="gm-bar" style="height:42%"></div><div class="gm-bar" style="height:62%"></div><div class="gm-bar" style="height:88%"></div><div class="gm-bar" style="height:32%"></div><div class="gm-bar" style="height:70%"></div><div class="gm-bar" style="height:50%"></div><div class="gm-bar" style="height:84%"></div><div class="gm-bar" style="height:64%"></div><div class="gm-bar" style="height:38%"></div><div class="gm-bar" style="height:74%"></div><div class="gm-bar" style="height:46%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:80%"></div><div class="gm-bar" style="height:36%"></div><div class="gm-bar" style="height:68%"></div>
292
+ </div>
293
+
294
+ <div class="gm-player-controls">
295
+ <span class="gm-play">β–Ά</span>
296
+ <span>0:00 / 0:30</span>
297
+ <span style="margin-left:auto; cursor:pointer; color:#FFF">↻ retake Β· new seed</span>
298
+ </div>
299
+
300
+ <div class="gm-label">Stems Β· Demucs htdemucs_ft</div>
301
+ <div class="gm-stems">
302
+ <div class="gm-stem"><span>vocals Β· 1.8 MB</span><span class="dl">↓</span></div>
303
+ <div class="gm-stem"><span>drums Β· 1.6 MB</span><span class="dl">↓</span></div>
304
+ <div class="gm-stem"><span>bass Β· 1.4 MB</span><span class="dl">↓</span></div>
305
+ <div class="gm-stem"><span>other Β· 1.7 MB</span><span class="dl">↓</span></div>
306
+ </div>
307
+
308
+ <div class="gm-label">Export</div>
309
+ <div class="gm-actions">
310
+ <span class="gm-secondary">↓ mp3 Β· 320k Β· 1.2 MB</span>
311
+ <span class="gm-secondary">↓ wav Β· 44.1k Β· 5.3 MB</span>
312
+ <span class="gm-secondary">↓ stems zip</span>
313
+ <span class="gm-secondary">{ } meta json</span>
314
+ <span class="gm-secondary">β†— copy share link</span>
315
+ </div>
316
+
317
+ <div class="gm-label" style="margin-top:14px">Metadata Β· for reproducibility</div>
318
+ <div class="gm-meta-block">
319
+ {<br>
320
+ &nbsp;&nbsp;"mode": "cover",<br>
321
+ &nbsp;&nbsp;"prompt": "faster, more aggressive leads, club-ready",<br>
322
+ &nbsp;&nbsp;"lyrics_first_line": "[intro] driving acid bassline...",<br>
323
+ &nbsp;&nbsp;"ref_audio_sha256": "a4f1...d29c",<br>
324
+ &nbsp;&nbsp;"duration_s": 30, "bpm": 135, "key": "auto",<br>
325
+ &nbsp;&nbsp;"sampler": "heun", "steps": 50, "cfg": 5.0, "shift": 3,<br>
326
+ &nbsp;&nbsp;"audio_cover_strength": 0.93, "cover_noise_strength": 0.0,<br>
327
+ &nbsp;&nbsp;"lm": {"thinking": true, "temp": 0.85, "top_p": 0.9, "cfg": 2,<br>
328
+ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"cot": {"metas": false, "caption": false, "lyrics": false}},<br>
329
+ &nbsp;&nbsp;"dcw": {"enabled": true, "mode": "double", "scaler": 0.02, "high_scaler": 0.06, "wavelet": "haar"},<br>
330
+ &nbsp;&nbsp;"loras": [<br>
331
+ &nbsp;&nbsp;&nbsp;&nbsp;{"name": "RapMachine", "scale": 0.85, "sha256": "b7e2..."},<br>
332
+ &nbsp;&nbsp;&nbsp;&nbsp;{"name": "psytrance_v2", "scale": 0.95, "sha256": "0c94..."}<br>
333
+ &nbsp;&nbsp;],<br>
334
+ &nbsp;&nbsp;"seed": 1297183202,<br>
335
+ &nbsp;&nbsp;"output_sha256": "f33a...19b8"<br>
336
+ }
337
+ </div>
338
+ </div>
339
+ </div>
340
+ </div>
341
+ </div>
342
+
343
+ <h3 style="margin-top:30px">⏩ Extend β€” fully expanded</h3>
344
+
345
+ <div class="gm">
346
+ <div class="gm-header">
347
+ <div>
348
+ <div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
349
+ <div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">β™₯</span>. <strong>Drop a like</strong> Β· Follow <strong>@techfreakworm</strong> for what's next.</div>
350
+ </div>
351
+ <div class="gm-status">ready Β· MPS Β· M5 Max</div>
352
+ </div>
353
+
354
+ <div class="gm-row">
355
+ <div class="gm-sidebar">
356
+ <div class="gm-side"><span class="em">🎡</span>Generate</div>
357
+ <div class="gm-side"><span class="em">🎀</span>Cover</div>
358
+ <div class="gm-side active"><span class="em">⏩</span>Extend</div>
359
+ <div class="gm-side"><span class="em">✏️</span>Edit</div>
360
+ <div class="gm-side"><span class="em">✍️</span>Lyrics</div>
361
+ </div>
362
+
363
+ <div class="gm-main">
364
+ <div class="gm-form">
365
+
366
+ <div class="gm-label">1 Β· Seed audio <span class="hint">what to continue Β· wav / mp3 / flac Β· ≀ 240 s</span></div>
367
+ <div class="gm-dropzone has-file">
368
+ <div class="filename">↑ unfinished_track_v3.wav</div>
369
+ <div class="meta">44.1 kHz Β· stereo Β· 1:42 Β· 18.0 MB Β· BPM detected 135 Β· key C minor</div>
370
+ <div class="miniwave"></div>
371
+ </div>
372
+
373
+ <div class="gm-label">2 Β· Extension prompt <span class="hint">style hint for what comes next</span></div>
374
+ <div class="gm-input">build to climax, layered acid leads, then breakdown</div>
375
+
376
+ <div class="gm-label">3 Β· Extension lyrics <span class="hint">optional Β· use [verse] [chorus] tags Β· blank = instrumental continuation</span></div>
377
+ <div class="gm-input gm-textarea">[bridge] the drop is coming...<br>[chorus] one more time, one more time</div>
378
+
379
+ <div class="gm-grid2">
380
+ <div>
381
+ <div class="gm-label">Extra duration <span class="hint">seconds</span></div>
382
+ <div class="gm-slider-row"><span class="name">5 – 120 s</span><span class="gm-slider p50"></span><span class="val">60</span></div>
383
+ </div>
384
+ <div>
385
+ <div class="gm-label">Vocal mode</div>
386
+ <div class="gm-pills"><div class="gm-pill on">With vocals</div><div class="gm-pill">Instrumental</div></div>
387
+ </div>
388
+ </div>
389
+
390
+ <div class="gm-label">Extend-specific <span class="hint">how the seam is handled</span></div>
391
+ <div class="gm-grid2">
392
+ <div><div class="gm-label">Repaint mode</div><div class="gm-select">balanced <span class="arrow">β–Ύ</span></div></div>
393
+ <div><div class="gm-label">Chunk mask</div><div class="gm-select">auto <span class="arrow">β–Ύ</span></div></div>
394
+ </div>
395
+ <div class="gm-slider-row"><span class="name">Repaint strength</span><span class="gm-slider p50"></span><span class="val">0.50</span></div>
396
+ <div class="gm-slider-row"><span class="name">Latent crossfade frames</span><span class="gm-slider p20"></span><span class="val">10</span></div>
397
+ <div class="gm-slider-row"><span class="name">WAV crossfade seconds</span><span class="gm-slider p10"></span><span class="val">2.0</span></div>
398
+
399
+ <!-- LoRA section, expanded -->
400
+ <div class="gm-section">
401
+ <div class="gm-section-h"><span>LoRA stack <span class="meta">Β· 1 active</span></span><span class="arrow">β–Ύ</span></div>
402
+
403
+ <div class="gm-label">Bundled presets</div>
404
+ <div style="margin-bottom:12px;">
405
+ <span class="gm-chip">RapMachine</span>
406
+ <span class="gm-chip">Chinese Rap</span>
407
+ <span class="gm-chip">Lyric2Vocal</span>
408
+ <span class="gm-chip">Text2Samples</span>
409
+ </div>
410
+
411
+ <div class="gm-label">Active stack</div>
412
+ <div class="gm-lora-row">
413
+ <span class="gm-lora-name">psytrance_v2 <small>Β· custom Β· 47 MB</small></span>
414
+ <span class="gm-slider p95" style="width:100px"></span>
415
+ <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.95</span>
416
+ <span class="gm-x">Γ—</span>
417
+ </div>
418
+
419
+ <div style="margin-top:10px;">
420
+ <span class="gm-chip upload">↑ drop .safetensors here</span>
421
+ </div>
422
+ </div>
423
+
424
+ <!-- Advanced section, expanded -->
425
+ <div class="gm-section">
426
+ <div class="gm-section-h"><span>Advanced</span><span class="arrow">β–Ύ</span></div>
427
+
428
+ <div class="gm-grid3">
429
+ <div><div class="gm-label">BPM <span class="hint">inherits from seed if blank</span></div><div class="gm-input" style="margin-bottom:0">135</div></div>
430
+ <div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">C minor</div></div>
431
+ <div><div class="gm-label">Time sig</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
432
+ </div>
433
+
434
+ <div class="gm-grid2">
435
+ <div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">β–Ύ</span></div></div>
436
+ <div><div class="gm-label">Vocal language</div><div class="gm-select">en <span class="arrow">β–Ύ</span></div></div>
437
+ </div>
438
+
439
+ <div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p20"></span><span class="val">50</span></div>
440
+ <div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
441
+ <div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p30"></span><span class="val">3</span></div>
442
+
443
+ <div class="gm-label" style="margin-top:8px">Negative prompt</div>
444
+ <div class="gm-input gm-textarea" style="font-size:10px">bitcrushed, aliasing, lo-fi hiss</div>
445
+
446
+ <div class="gm-grid2">
447
+ <div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
448
+ <div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">βœ“</span> -14 LUFS</div></div>
449
+ </div>
450
+
451
+ <div class="gm-grid2">
452
+ <div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">9911</div></div>
453
+ <div><div class="gm-label">&nbsp;</div><div class="gm-toggle"><span class="box"></span> Lock seed</div></div>
454
+ </div>
455
+ </div>
456
+
457
+ <!-- LM planner section, expanded -->
458
+ <div class="gm-section">
459
+ <div class="gm-section-h"><span>LM planner Β· Qwen3 thinking</span><span class="arrow">β–Ύ</span></div>
460
+ <div class="gm-toggle on"><span class="box">βœ“</span> Thinking enabled</div>
461
+ <div class="gm-toggle on"><span class="box">βœ“</span> Constrained decoding</div>
462
+ <div class="gm-grid4" style="margin-top:8px">
463
+ <div><div class="gm-label">Temp</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
464
+ <div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
465
+ <div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
466
+ <div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
467
+ </div>
468
+ <div class="gm-label">CoT pipeline toggles</div>
469
+ <div class="gm-grid4">
470
+ <div class="gm-toggle"><span class="box"></span> metas</div>
471
+ <div class="gm-toggle"><span class="box"></span> caption</div>
472
+ <div class="gm-toggle"><span class="box"></span> lyrics</div>
473
+ <div class="gm-toggle"><span class="box"></span> language</div>
474
+ </div>
475
+ </div>
476
+
477
+ <!-- DCW section, expanded -->
478
+ <div class="gm-section">
479
+ <div class="gm-section-h"><span>DCW Β· dynamic CFG warping</span><span class="arrow">β–Ύ</span></div>
480
+ <div class="gm-toggle on"><span class="box">βœ“</span> DCW enabled</div>
481
+ <div class="gm-grid2">
482
+ <div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">β–Ύ</span></div></div>
483
+ <div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">β–Ύ</span></div></div>
484
+ </div>
485
+ <div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p10"></span><span class="val">0.02</span></div>
486
+ <div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
487
+ </div>
488
+
489
+ <div class="gm-btn">β–Ά Extend Β· est. ~50 s Β· output 2:42 total</div>
490
+ </div>
491
+
492
+ <!-- Output panel -->
493
+ <div class="gm-output">
494
+ <div class="gm-label" style="margin-bottom:10px">Output Β· extended Β· 2:42 Β· seed 9911</div>
495
+
496
+ <div class="gm-toggle on"><span class="box">βœ“</span> Show seed boundary marker</div>
497
+
498
+ <div class="gm-waveform" style="position:relative">
499
+ <div class="gm-bar" style="height:32%"></div><div class="gm-bar" style="height:48%"></div><div class="gm-bar" style="height:64%"></div><div class="gm-bar" style="height:42%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:38%"></div><div class="gm-bar" style="height:52%"></div><div class="gm-bar" style="height:46%"></div><div class="gm-bar" style="height:34%; opacity:0.5"></div>
500
+ <div style="border-left:1px dashed #FFF; height:48px;"></div>
501
+ <div class="gm-bar" style="height:62%"></div><div class="gm-bar" style="height:78%"></div><div class="gm-bar" style="height:92%"></div><div class="gm-bar" style="height:84%"></div><div class="gm-bar" style="height:70%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:40%"></div>
502
+ <div style="position:absolute; bottom:-2px; left:50%; transform:translateX(-50%); font-size:8px; color:#FFF; background:#0A0A0A; padding:0 4px;">↑ seed ends Β· 1:42</div>
503
+ </div>
504
+
505
+ <div class="gm-player-controls">
506
+ <span class="gm-play">β–Ά</span>
507
+ <span>0:00 / 2:42</span>
508
+ <span style="margin-left:auto; cursor:pointer; color:#FFF">↻ retake</span>
509
+ </div>
510
+
511
+ <div class="gm-label">Stems Β· Demucs</div>
512
+ <div class="gm-stems">
513
+ <div class="gm-stem"><span>vocals</span><span class="dl">↓</span></div>
514
+ <div class="gm-stem"><span>drums</span><span class="dl">↓</span></div>
515
+ <div class="gm-stem"><span>bass</span><span class="dl">↓</span></div>
516
+ <div class="gm-stem"><span>other</span><span class="dl">↓</span></div>
517
+ </div>
518
+
519
+ <div class="gm-label">Export</div>
520
+ <div class="gm-actions">
521
+ <span class="gm-secondary">↓ full mp3 Β· 6.3 MB</span>
522
+ <span class="gm-secondary">↓ extension-only mp3 Β· 2.4 MB</span>
523
+ <span class="gm-secondary">↓ full wav</span>
524
+ <span class="gm-secondary">↓ stems zip</span>
525
+ <span class="gm-secondary">{ } meta json</span>
526
+ <span class="gm-secondary">β†— share link</span>
527
+ </div>
528
+
529
+ <div class="gm-label" style="margin-top:14px">Metadata</div>
530
+ <div class="gm-meta-block">
531
+ {<br>
532
+ &nbsp;&nbsp;"mode": "extend",<br>
533
+ &nbsp;&nbsp;"seed_audio_sha256": "e5c0...21ed",<br>
534
+ &nbsp;&nbsp;"seed_duration_s": 102,<br>
535
+ &nbsp;&nbsp;"extension_prompt": "build to climax, layered acid leads...",<br>
536
+ &nbsp;&nbsp;"extension_lyrics_first_line": "[bridge] the drop is coming...",<br>
537
+ &nbsp;&nbsp;"extra_duration_s": 60,<br>
538
+ &nbsp;&nbsp;"repaint_mode": "balanced",<br>
539
+ &nbsp;&nbsp;"repaint_strength": 0.5,<br>
540
+ &nbsp;&nbsp;"latent_crossfade_frames": 10,<br>
541
+ &nbsp;&nbsp;"wav_crossfade_s": 2.0,<br>
542
+ &nbsp;&nbsp;"chunk_mask_mode": "auto",<br>
543
+ &nbsp;&nbsp;"bpm": 135, "key": "C minor",<br>
544
+ &nbsp;&nbsp;"sampler": "heun", "steps": 50, "cfg": 5.0, "shift": 3,<br>
545
+ &nbsp;&nbsp;"lm": {"thinking": true, "temp": 0.85, "top_p": 0.9},<br>
546
+ &nbsp;&nbsp;"dcw": {"enabled": true, "mode": "double", "scaler": 0.02},<br>
547
+ &nbsp;&nbsp;"loras": [{"name": "psytrance_v2", "scale": 0.95, "sha256": "0c94..."}],<br>
548
+ &nbsp;&nbsp;"seed": 9911,<br>
549
+ &nbsp;&nbsp;"output_sha256": "9fbc...4071"<br>
550
+ }
551
+ </div>
552
+ </div>
553
+ </div>
554
+ </div>
555
+ </div>
556
+
557
+ <div class="options" style="margin-top:24px">
558
+ <div class="option" data-choice="approve" onclick="toggleSelect(this)">
559
+ <div class="letter">βœ“</div>
560
+ <div class="content">
561
+ <h3>Both look right β€” show Edit + Lyrics + Generate (refreshed) next</h3>
562
+ <p>Cover and Extend hierarchies + control depth are correct. Continue.</p>
563
+ </div>
564
+ </div>
565
+ <div class="option" data-choice="revise" onclick="toggleSelect(this)">
566
+ <div class="letter">✎</div>
567
+ <div class="content">
568
+ <h3>Revise β€” tell me which control / section</h3>
569
+ <p>Reply in terminal with specifics. I'll redo a single section without re-pushing the whole thing.</p>
570
+ </div>
571
+ </div>
572
+ </div>
docs/superpowers/specs/mockups/03_edit_lyrics.html ADDED
@@ -0,0 +1,517 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h2>Edit and Lyrics Β· everything expanded</h2>
2
+ <p class="subtitle">Edit has two sub-modes β€” Repaint (segment regeneration) and Flow Morph (caption-to-caption transformation). Lyrics tab uses Qwen 2.5 7B Instruct to draft structurally-tagged lyrics for the song modes.</p>
3
+
4
+ <style>
5
+ .gm { background:#0A0A0A; color:#E5E5E5; border:1px solid #1F1F1F; border-radius:10px; padding:18px; font-size:12px; line-height:1.5; margin-top:14px; }
6
+ .gm-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:10px; border-bottom:1px solid #1F1F1F; margin-bottom:14px; }
7
+ .gm-brand { font-size:15px; font-weight:600; }
8
+ .gm-cta { font-size:11px; color:#6B6B6B; }
9
+ .gm-cta strong { color:#E5E5E5; }
10
+ .gm-status { font-size:10px; color:#6B6B6B; letter-spacing:0.08em; text-transform:uppercase; }
11
+ .gm-row { display:flex; gap:16px; align-items:flex-start; }
12
+ .gm-sidebar { background:#000; padding:14px 10px; border-radius:6px; min-width:170px; }
13
+ .gm-side { display:block; padding:8px 10px; border-radius:4px; margin-bottom:3px; font-size:12px; color:#6B6B6B; }
14
+ .gm-side.active { background:#1A1A1A; color:#FFF; border-left:2px solid #FFF; padding-left:8px; }
15
+ .gm-side .em { margin-right:6px; }
16
+ .gm-main { flex:1; display:flex; gap:14px; align-items:flex-start; }
17
+ .gm-form { flex:1.3; background:#141414; padding:16px; border-radius:6px; }
18
+ .gm-output { flex:1; background:#141414; padding:16px; border-radius:6px; min-width:260px; }
19
+ .gm-label { font-size:10px; text-transform:uppercase; letter-spacing:0.08em; color:#6B6B6B; margin-bottom:6px; display:flex; justify-content:space-between; align-items:center; }
20
+ .gm-label .hint { color:#5A5048; font-size:9px; text-transform:none; letter-spacing:normal; font-weight:400; }
21
+ .gm-input { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; margin-bottom:12px; font-size:11px; }
22
+ .gm-textarea { min-height:46px; }
23
+ .gm-grid2 { display:grid; grid-template-columns:1fr 1fr; gap:12px; margin-bottom:12px; }
24
+ .gm-grid3 { display:grid; grid-template-columns:1fr 1fr 1fr; gap:10px; margin-bottom:12px; }
25
+ .gm-grid4 { display:grid; grid-template-columns:1fr 1fr 1fr 1fr; gap:8px; margin-bottom:12px; }
26
+ .gm-slider-row { display:flex; align-items:center; gap:10px; padding:6px 8px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; }
27
+ .gm-slider-row .name { color:#6B6B6B; font-size:10px; min-width:130px; }
28
+ .gm-slider { flex:1; height:3px; background:#2A2A2A; border-radius:2px; position:relative; }
29
+ .gm-slider::after { content:""; position:absolute; top:-4px; width:10px; height:10px; background:#FFF; border-radius:50%; }
30
+ .gm-slider.p10::after { left:10%; }
31
+ .gm-slider.p20::after { left:20%; }
32
+ .gm-slider.p25::after { left:25%; }
33
+ .gm-slider.p33::after { left:33%; }
34
+ .gm-slider.p40::after { left:40%; }
35
+ .gm-slider.p50::after { left:50%; }
36
+ .gm-slider.p60::after { left:60%; }
37
+ .gm-slider.p70::after { left:70%; }
38
+ .gm-slider.p85::after { left:85%; }
39
+ .gm-slider.p95::after { left:95%; }
40
+ .gm-slider-row .val { color:#FFF; font-family:monospace; font-size:11px; min-width:42px; text-align:right; }
41
+ .gm-toggle { display:flex; align-items:center; gap:8px; padding:6px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; cursor:pointer; }
42
+ .gm-toggle .box { width:14px; height:14px; border:1px solid #2A2A2A; border-radius:3px; display:inline-flex; align-items:center; justify-content:center; font-size:9px; }
43
+ .gm-toggle.on { color:#FFF; border-color:#FFF; }
44
+ .gm-toggle.on .box { background:#FFF; color:#0A0A0A; border-color:#FFF; }
45
+ .gm-pills { display:flex; gap:0; background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:2px; margin-bottom:12px; }
46
+ .gm-pill { flex:1; text-align:center; padding:6px 10px; font-size:11px; color:#6B6B6B; border-radius:3px; cursor:pointer; }
47
+ .gm-pill.on { background:#FFF; color:#0A0A0A; }
48
+ .gm-select { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; font-size:11px; display:flex; justify-content:space-between; align-items:center; margin-bottom:8px; }
49
+ .gm-select .arrow { color:#6B6B6B; }
50
+ .gm-section { border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-top:14px; background:#0F0F0F; }
51
+ .gm-section.dim { opacity:0.4; }
52
+ .gm-section-h { display:flex; justify-content:space-between; align-items:center; margin-bottom:12px; font-size:11px; font-weight:600; }
53
+ .gm-section-h .arrow { color:#FFF; }
54
+ .gm-section-h .meta { color:#6B6B6B; font-weight:400; font-size:10px; }
55
+ .gm-chip { display:inline-block; padding:5px 10px; border-radius:14px; font-size:10px; margin-right:5px; margin-bottom:5px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; cursor:pointer; }
56
+ .gm-chip.on { border-color:#FFF; color:#FFF; }
57
+ .gm-chip.upload { border-style:dashed; color:#FFF; }
58
+ .gm-lora-row { display:flex; align-items:center; gap:10px; padding:8px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:6px; font-size:11px; }
59
+ .gm-lora-name { flex:1; }
60
+ .gm-lora-name small { color:#6B6B6B; font-weight:400; margin-left:4px; }
61
+ .gm-x { color:#6B6B6B; cursor:pointer; padding:0 4px; }
62
+ .gm-btn { background:#FFF; color:#0A0A0A; padding:12px 18px; border-radius:4px; font-weight:600; display:block; font-size:13px; text-align:center; cursor:pointer; margin-top:16px; }
63
+ .gm-dropzone { background:#000; border:2px dashed #2A2A2A; border-radius:6px; padding:14px; margin-bottom:12px; text-align:center; font-size:11px; color:#6B6B6B; }
64
+ .gm-dropzone.has-file { border-style:solid; border-color:#FFF; color:#FFF; text-align:left; padding:10px 12px; }
65
+ .gm-dropzone .filename { font-weight:600; }
66
+ .gm-dropzone .meta { color:#6B6B6B; font-size:9px; margin-top:2px; font-weight:400; }
67
+ .gm-dropzone .miniwave { height:18px; background:repeating-linear-gradient(90deg, currentColor 0 1px, transparent 1px 3px); margin-top:6px; opacity:0.5; }
68
+ .gm-waveform { height:60px; background:#000; border:1px solid #2A2A2A; border-radius:4px; display:flex; align-items:center; justify-content:center; gap:2px; padding:8px; margin-bottom:10px; position:relative; }
69
+ .gm-bar { width:2px; background:#E5E5E5; }
70
+ .gm-bar.muted { opacity:0.35; }
71
+ .gm-bar.highlight { background:#FFF; }
72
+ .gm-player-controls { display:flex; align-items:center; gap:10px; color:#6B6B6B; font-size:10px; margin-bottom:14px; }
73
+ .gm-play { width:28px; height:28px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:11px; }
74
+ .gm-meta-block { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:8px 10px; font-size:9px; color:#6B6B6B; font-family:monospace; line-height:1.6; max-height:160px; overflow:hidden; margin-top:8px; }
75
+ .gm-actions { display:flex; flex-wrap:wrap; gap:6px; margin-bottom:10px; }
76
+ .gm-secondary { border:1px solid #2A2A2A; color:#E5E5E5; padding:6px 12px; border-radius:4px; font-size:10px; cursor:pointer; }
77
+ .gm-segment-bar { position:relative; height:18px; background:#0F0F0F; border:1px solid #2A2A2A; border-radius:3px; margin:8px 0 12px; }
78
+ .gm-segment-bar .sel { position:absolute; top:0; bottom:0; background:#FFF; opacity:0.85; }
79
+ .gm-segment-bar .ticks { position:absolute; top:0; left:0; right:0; bottom:0; display:flex; justify-content:space-between; padding:0 2px; align-items:center; font-size:8px; color:#6B6B6B; font-family:monospace; pointer-events:none; }
80
+ .gm-segment-bar .label { position:absolute; top:-14px; font-size:8px; color:#FFF; font-family:monospace; }
81
+
82
+ /* Lyrics-specific */
83
+ .gm-lyrics-output { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-bottom:10px; font-family:Inter, system-ui, sans-serif; font-size:11px; line-height:1.7; color:#E5E5E5; min-height:240px; }
84
+ .gm-lyrics-output .section-tag { color:#FFF; font-weight:600; display:block; margin-top:10px; }
85
+ .gm-lyrics-output .section-tag:first-child { margin-top:0; }
86
+ .gm-lyrics-output .body { color:#B8B0A4; margin-left:0; }
87
+ </style>
88
+
89
+
90
+ <h3 style="margin-top:14px">✏️ Edit β€” fully expanded Β· Repaint sub-mode active</h3>
91
+
92
+ <div class="gm">
93
+ <div class="gm-header">
94
+ <div>
95
+ <div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
96
+ <div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">β™₯</span>. <strong>Drop a like</strong> Β· Follow <strong>@techfreakworm</strong></div>
97
+ </div>
98
+ <div class="gm-status">ready Β· MPS Β· M5 Max</div>
99
+ </div>
100
+
101
+ <div class="gm-row">
102
+ <div class="gm-sidebar">
103
+ <div class="gm-side"><span class="em">🎡</span>Generate</div>
104
+ <div class="gm-side"><span class="em">🎀</span>Cover</div>
105
+ <div class="gm-side"><span class="em">⏩</span>Extend</div>
106
+ <div class="gm-side active"><span class="em">✏️</span>Edit</div>
107
+ <div class="gm-side"><span class="em">✍️</span>Lyrics</div>
108
+ </div>
109
+
110
+ <div class="gm-main">
111
+ <div class="gm-form">
112
+
113
+ <div class="gm-label">1 Β· Source audio <span class="hint">the song you want to modify Β· ≀ 240 s</span></div>
114
+ <div class="gm-dropzone has-file">
115
+ <div class="filename">↑ my_song_draft.wav</div>
116
+ <div class="meta">44.1 kHz Β· stereo Β· 2:30 Β· 26.4 MB Β· BPM 138 Β· key A minor</div>
117
+ <div class="miniwave"></div>
118
+ </div>
119
+
120
+ <div class="gm-label">2 Β· Edit sub-mode</div>
121
+ <div class="gm-pills">
122
+ <div class="gm-pill on">Repaint segment</div>
123
+ <div class="gm-pill">Flow morph</div>
124
+ </div>
125
+
126
+ <div class="gm-label">3 Β· Source lyrics <span class="hint">paste the existing lyrics for context</span></div>
127
+ <div class="gm-input gm-textarea">[verse 1] original lyric line one<br>[chorus] original chorus<br>[verse 2] original lyric line two<br>[bridge] ...</div>
128
+
129
+ <div class="gm-label">4 Β· Target lyrics <span class="hint">replace only the segment selected below</span></div>
130
+ <div class="gm-input gm-textarea">[chorus] new chorus replaces the old<br>more punchy, more melodic</div>
131
+
132
+ <div class="gm-label">5 Β· Segment selection <span class="hint">drag handles on the waveform Β· or set timestamps</span></div>
133
+ <div class="gm-segment-bar">
134
+ <div class="sel" style="left:33%; width:25%;"></div>
135
+ <div class="ticks"><span>0:00</span><span>0:30</span><span>1:00</span><span>1:30</span><span>2:00</span><span>2:30</span></div>
136
+ <div class="label" style="left:33%">0:50</div>
137
+ <div class="label" style="left:58%">1:30</div>
138
+ </div>
139
+ <div class="gm-grid2">
140
+ <div><div class="gm-label">Segment start</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">50.0 s</div></div>
141
+ <div><div class="gm-label">Segment end</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">90.0 s</div></div>
142
+ </div>
143
+
144
+ <!-- Repaint sub-options -->
145
+ <div class="gm-section">
146
+ <div class="gm-section-h">
147
+ <span>Repaint options <span class="meta">Β· segment regeneration</span></span>
148
+ <span class="arrow">β–Ύ</span>
149
+ </div>
150
+
151
+ <div class="gm-grid2">
152
+ <div><div class="gm-label">Repaint mode</div><div class="gm-select">balanced <span class="arrow">β–Ύ</span></div></div>
153
+ <div><div class="gm-label">Chunk mask</div><div class="gm-select">auto <span class="arrow">β–Ύ</span></div></div>
154
+ </div>
155
+ <div class="gm-slider-row"><span class="name">Repaint strength</span><span class="gm-slider p50"></span><span class="val">0.50</span></div>
156
+ <div class="gm-slider-row"><span class="name">Latent crossfade frames</span><span class="gm-slider p20"></span><span class="val">10</span></div>
157
+ <div class="gm-slider-row"><span class="name">WAV crossfade seconds</span><span class="gm-slider p10"></span><span class="val">0.0</span></div>
158
+ <div class="gm-toggle on"><span class="box">βœ“</span> Preserve segment boundary phase</div>
159
+ </div>
160
+
161
+ <!-- Flow Morph sub-options, dimmed since Repaint is active -->
162
+ <div class="gm-section dim">
163
+ <div class="gm-section-h">
164
+ <span>Flow morph options <span class="meta">Β· caption-to-caption transformation Β· select "Flow morph" above to use</span></span>
165
+ <span class="arrow">β–Ύ</span>
166
+ </div>
167
+
168
+ <div class="gm-label">Source caption <span class="hint">describes what the segment currently is</span></div>
169
+ <div class="gm-input">acoustic ballad, gentle piano</div>
170
+
171
+ <div class="gm-label">Target caption <span class="hint">what to morph it into Β· prompt above is reused</span></div>
172
+ <div class="gm-input" style="opacity:0.5">(uses style prompt from step 2)</div>
173
+
174
+ <div class="gm-grid3">
175
+ <div><div class="gm-label">n_min</div><div class="gm-input" style="margin-bottom:0">0.0</div></div>
176
+ <div><div class="gm-label">n_max</div><div class="gm-input" style="margin-bottom:0">1.0</div></div>
177
+ <div><div class="gm-label">n_avg</div><div class="gm-input" style="margin-bottom:0">1</div></div>
178
+ </div>
179
+ <div class="gm-toggle"><span class="box"></span> Enable flow_edit_morph</div>
180
+ </div>
181
+
182
+ <!-- LoRA section, expanded -->
183
+ <div class="gm-section">
184
+ <div class="gm-section-h"><span>LoRA stack <span class="meta">Β· 1 active</span></span><span class="arrow">β–Ύ</span></div>
185
+ <div class="gm-label">Bundled presets</div>
186
+ <div style="margin-bottom:12px;">
187
+ <span class="gm-chip">RapMachine</span>
188
+ <span class="gm-chip">Chinese Rap</span>
189
+ <span class="gm-chip on">Lyric2Vocal</span>
190
+ <span class="gm-chip">Text2Samples</span>
191
+ </div>
192
+ <div class="gm-label">Active stack</div>
193
+ <div class="gm-lora-row">
194
+ <span class="gm-lora-name">Lyric2Vocal <small>Β· preset</small></span>
195
+ <span class="gm-slider p70" style="width:100px"></span>
196
+ <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.70</span>
197
+ <span class="gm-x">Γ—</span>
198
+ </div>
199
+ <div style="margin-top:10px;">
200
+ <span class="gm-chip upload">↑ drop .safetensors here</span>
201
+ </div>
202
+ </div>
203
+
204
+ <!-- Advanced section, expanded -->
205
+ <div class="gm-section">
206
+ <div class="gm-section-h"><span>Advanced</span><span class="arrow">β–Ύ</span></div>
207
+ <div class="gm-grid3">
208
+ <div><div class="gm-label">BPM <span class="hint">inherits from source</span></div><div class="gm-input" style="margin-bottom:0">138</div></div>
209
+ <div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">A minor</div></div>
210
+ <div><div class="gm-label">Time sig</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
211
+ </div>
212
+ <div class="gm-grid2">
213
+ <div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">β–Ύ</span></div></div>
214
+ <div><div class="gm-label">Vocal language</div><div class="gm-select">en <span class="arrow">β–Ύ</span></div></div>
215
+ </div>
216
+ <div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p20"></span><span class="val">50</span></div>
217
+ <div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
218
+ <div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p33"></span><span class="val">3</span></div>
219
+ <div class="gm-label" style="margin-top:8px">Negative prompt</div>
220
+ <div class="gm-input" style="font-size:10px; margin-bottom:8px">bitcrushed, aliasing, off-key</div>
221
+ <div class="gm-grid2">
222
+ <div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
223
+ <div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">βœ“</span> -14 LUFS</div></div>
224
+ </div>
225
+ <div class="gm-grid2">
226
+ <div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">7331</div></div>
227
+ <div><div class="gm-label">&nbsp;</div><div class="gm-toggle"><span class="box"></span> Lock seed</div></div>
228
+ </div>
229
+ </div>
230
+
231
+ <!-- LM planner -->
232
+ <div class="gm-section">
233
+ <div class="gm-section-h"><span>LM planner Β· Qwen3 thinking</span><span class="arrow">β–Ύ</span></div>
234
+ <div class="gm-toggle on"><span class="box">βœ“</span> Thinking enabled</div>
235
+ <div class="gm-toggle on"><span class="box">βœ“</span> Constrained decoding</div>
236
+ <div class="gm-grid4" style="margin-top:8px">
237
+ <div><div class="gm-label">Temp</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
238
+ <div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
239
+ <div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
240
+ <div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
241
+ </div>
242
+ <div class="gm-label">CoT toggles</div>
243
+ <div class="gm-grid4">
244
+ <div class="gm-toggle"><span class="box"></span> metas</div>
245
+ <div class="gm-toggle"><span class="box"></span> caption</div>
246
+ <div class="gm-toggle on"><span class="box">βœ“</span> lyrics</div>
247
+ <div class="gm-toggle"><span class="box"></span> language</div>
248
+ </div>
249
+ </div>
250
+
251
+ <!-- DCW -->
252
+ <div class="gm-section">
253
+ <div class="gm-section-h"><span>DCW Β· dynamic CFG warping</span><span class="arrow">β–Ύ</span></div>
254
+ <div class="gm-toggle on"><span class="box">βœ“</span> DCW enabled</div>
255
+ <div class="gm-grid2">
256
+ <div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">β–Ύ</span></div></div>
257
+ <div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">β–Ύ</span></div></div>
258
+ </div>
259
+ <div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p10"></span><span class="val">0.02</span></div>
260
+ <div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
261
+ </div>
262
+
263
+ <div class="gm-btn">β–Ά Repaint segment 0:50 – 1:30 Β· est. ~25 s on M5 Max</div>
264
+ </div>
265
+
266
+ <!-- Output -->
267
+ <div class="gm-output">
268
+ <div class="gm-label" style="margin-bottom:10px">Output Β· edited Β· 2:30 Β· seed 7331 Β· segment 0:50 – 1:30</div>
269
+
270
+ <div class="gm-toggle on"><span class="box">βœ“</span> Show edited region (highlighted on waveform)</div>
271
+
272
+ <div class="gm-waveform">
273
+ <div class="gm-bar muted" style="height:32%"></div>
274
+ <div class="gm-bar muted" style="height:48%"></div>
275
+ <div class="gm-bar muted" style="height:60%"></div>
276
+ <div class="gm-bar muted" style="height:42%"></div>
277
+ <div class="gm-bar muted" style="height:54%"></div>
278
+ <div class="gm-bar highlight" style="height:78%"></div>
279
+ <div class="gm-bar highlight" style="height:92%"></div>
280
+ <div class="gm-bar highlight" style="height:84%"></div>
281
+ <div class="gm-bar highlight" style="height:70%"></div>
282
+ <div class="gm-bar highlight" style="height:88%"></div>
283
+ <div class="gm-bar highlight" style="height:62%"></div>
284
+ <div class="gm-bar muted" style="height:48%"></div>
285
+ <div class="gm-bar muted" style="height:36%"></div>
286
+ <div class="gm-bar muted" style="height:42%"></div>
287
+ <div class="gm-bar muted" style="height:30%"></div>
288
+ <div class="gm-bar muted" style="height:38%"></div>
289
+ </div>
290
+
291
+ <div class="gm-player-controls">
292
+ <span class="gm-play">β–Ά</span>
293
+ <span>0:00 / 2:30</span>
294
+ <span style="margin-left:auto; cursor:pointer; color:#FFF">↻ retake segment</span>
295
+ </div>
296
+
297
+ <div class="gm-label">A / B comparison</div>
298
+ <div class="gm-grid2">
299
+ <div class="gm-secondary" style="text-align:center">β–Ά original</div>
300
+ <div class="gm-secondary" style="text-align:center; border-color:#FFF; color:#FFF">β–Ά edited</div>
301
+ </div>
302
+
303
+ <div class="gm-label" style="margin-top:10px">Stems Β· Demucs</div>
304
+ <div style="display:grid; grid-template-columns:1fr 1fr; gap:6px; margin-bottom:10px;">
305
+ <div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>vocals</span><span style="color:#FFF">↓</span></div>
306
+ <div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>drums</span><span style="color:#FFF">↓</span></div>
307
+ <div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>bass</span><span style="color:#FFF">↓</span></div>
308
+ <div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>other</span><span style="color:#FFF">↓</span></div>
309
+ </div>
310
+
311
+ <div class="gm-label">Export</div>
312
+ <div class="gm-actions">
313
+ <span class="gm-secondary">↓ full mp3</span>
314
+ <span class="gm-secondary">↓ segment-only mp3</span>
315
+ <span class="gm-secondary">↓ wav</span>
316
+ <span class="gm-secondary">↓ stems zip</span>
317
+ <span class="gm-secondary">{ } meta</span>
318
+ </div>
319
+
320
+ <div class="gm-label" style="margin-top:14px">Metadata</div>
321
+ <div class="gm-meta-block">
322
+ {<br>
323
+ &nbsp;&nbsp;"mode": "edit", "sub_mode": "repaint",<br>
324
+ &nbsp;&nbsp;"source_audio_sha256": "1a4f...8e7d",<br>
325
+ &nbsp;&nbsp;"segment_start_s": 50.0, "segment_end_s": 90.0,<br>
326
+ &nbsp;&nbsp;"repaint_mode": "balanced", "repaint_strength": 0.5,<br>
327
+ &nbsp;&nbsp;"latent_crossfade_frames": 10, "wav_crossfade_s": 0.0,<br>
328
+ &nbsp;&nbsp;"chunk_mask_mode": "auto",<br>
329
+ &nbsp;&nbsp;"source_lyrics_hash": "3c2e...44ab",<br>
330
+ &nbsp;&nbsp;"target_lyrics_first_line": "[chorus] new chorus replaces the old...",<br>
331
+ &nbsp;&nbsp;"bpm": 138, "key": "A minor", "sampler": "heun", "steps": 50,<br>
332
+ &nbsp;&nbsp;"loras": [{"name": "Lyric2Vocal", "scale": 0.7}],<br>
333
+ &nbsp;&nbsp;"seed": 7331,<br>
334
+ &nbsp;&nbsp;"output_sha256": "b7a2...c019"<br>
335
+ }
336
+ </div>
337
+ </div>
338
+ </div>
339
+ </div>
340
+ </div>
341
+
342
+
343
+ <h3 style="margin-top:30px">✍️ Lyrics β€” fully expanded Β· Qwen 2.5 7B Instruct</h3>
344
+
345
+ <div class="gm">
346
+ <div class="gm-header">
347
+ <div>
348
+ <div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
349
+ <div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">β™₯</span>. <strong>Drop a like</strong> Β· Follow <strong>@techfreakworm</strong></div>
350
+ </div>
351
+ <div class="gm-status">ready Β· MPS Β· M5 Max Β· Qwen 2.5 7B</div>
352
+ </div>
353
+
354
+ <div class="gm-row">
355
+ <div class="gm-sidebar">
356
+ <div class="gm-side"><span class="em">🎡</span>Generate</div>
357
+ <div class="gm-side"><span class="em">🎀</span>Cover</div>
358
+ <div class="gm-side"><span class="em">⏩</span>Extend</div>
359
+ <div class="gm-side"><span class="em">✏️</span>Edit</div>
360
+ <div class="gm-side active"><span class="em">✍️</span>Lyrics</div>
361
+ </div>
362
+
363
+ <div class="gm-main">
364
+ <div class="gm-form">
365
+
366
+ <div class="gm-label">1 Β· Brief <span class="hint">describe the song in plain language</span></div>
367
+ <div class="gm-input gm-textarea" style="min-height:80px">A driving psytrance anthem about losing yourself on the dancefloor at sunrise. First-person, present tense, references to lights, kick drum, transcendence. Avoid clichΓ©s like "feel the beat".</div>
368
+
369
+ <div class="gm-grid2">
370
+ <div>
371
+ <div class="gm-label">Target structure <span class="hint">section sequence</span></div>
372
+ <div class="gm-input" style="margin-bottom:0">intro, verse, chorus, verse, chorus, bridge, chorus, outro</div>
373
+ </div>
374
+ <div>
375
+ <div class="gm-label">Language</div>
376
+ <div class="gm-select" style="margin-bottom:0">English (en) <span class="arrow">β–Ύ</span></div>
377
+ </div>
378
+ </div>
379
+
380
+ <div class="gm-grid3">
381
+ <div>
382
+ <div class="gm-label">Verse lines</div>
383
+ <div class="gm-input" style="margin-bottom:0">6</div>
384
+ </div>
385
+ <div>
386
+ <div class="gm-label">Chorus lines</div>
387
+ <div class="gm-input" style="margin-bottom:0">4</div>
388
+ </div>
389
+ <div>
390
+ <div class="gm-label">Bridge lines</div>
391
+ <div class="gm-input" style="margin-bottom:0">2</div>
392
+ </div>
393
+ </div>
394
+
395
+ <div class="gm-label">Tone / mood <span class="hint">optional Β· comma-separated descriptors</span></div>
396
+ <div class="gm-input">euphoric, hypnotic, transcendent, not cheesy</div>
397
+
398
+ <div class="gm-label">Rhyme preference</div>
399
+ <div class="gm-pills">
400
+ <div class="gm-pill">Strict (AABB)</div>
401
+ <div class="gm-pill on">Loose (ABAB / free)</div>
402
+ <div class="gm-pill">None</div>
403
+ </div>
404
+
405
+ <!-- LM parameters, expanded -->
406
+ <div class="gm-section">
407
+ <div class="gm-section-h">
408
+ <span>LM parameters <span class="meta">Β· Qwen 2.5 7B Instruct (Apache 2.0)</span></span>
409
+ <span class="arrow">β–Ύ</span>
410
+ </div>
411
+
412
+ <div class="gm-grid4">
413
+ <div><div class="gm-label">Temperature</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
414
+ <div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
415
+ <div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">40</div></div>
416
+ <div><div class="gm-label">Rep. penalty</div><div class="gm-input" style="margin-bottom:0">1.10</div></div>
417
+ </div>
418
+
419
+ <div class="gm-grid2">
420
+ <div><div class="gm-label">Max new tokens</div><div class="gm-input" style="margin-bottom:0">600</div></div>
421
+ <div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">42</div></div>
422
+ </div>
423
+
424
+ <div class="gm-toggle"><span class="box"></span> Show system prompt</div>
425
+ <div class="gm-toggle on"><span class="box">βœ“</span> Enforce structural-tag format <span style="color:#6B6B6B; font-size:9px; margin-left:auto">stop at [end]</span></div>
426
+ </div>
427
+
428
+ <div class="gm-btn">β–Ά Draft lyrics Β· est. ~8 s on M5 Max</div>
429
+ </div>
430
+
431
+ <!-- Lyrics output -->
432
+ <div class="gm-output">
433
+ <div class="gm-label" style="margin-bottom:10px">Draft Β· 1 of 1 Β· 312 tokens Β· 6.2 s</div>
434
+
435
+ <div class="gm-lyrics-output">
436
+ <span class="section-tag">[intro]</span>
437
+ <span class="body">the lights start low, the bass starts slow<br>another night, another holy show</span>
438
+
439
+ <span class="section-tag">[verse 1]</span>
440
+ <span class="body">six in the morning, the sun's still pretending<br>kick drum carries what the night was sending<br>shoes off, eyes closed, the city's still bending<br>no one's here for an ending<br>just one more lift, one more descending<br>the room is breathing, the floor is mending</span>
441
+
442
+ <span class="section-tag">[chorus]</span>
443
+ <span class="body">we let go, we let go, we let go<br>oh the morning, oh the morning<br>arms up, head down, no warning<br>we let go, we let go, we let go</span>
444
+
445
+ <span class="section-tag">[verse 2]</span>
446
+ <span class="body">...</span>
447
+
448
+ <span class="section-tag">[bridge]</span>
449
+ <span class="body">...</span>
450
+
451
+ <span class="section-tag">[outro]</span>
452
+ <span class="body">...</span>
453
+ </div>
454
+
455
+ <div class="gm-actions" style="margin-bottom:14px">
456
+ <span class="gm-secondary" style="border-color:#FFF; color:#FFF">↑ Use these in Generate</span>
457
+ <span class="gm-secondary">↻ regenerate</span>
458
+ <span class="gm-secondary">↻ continue from cursor</span>
459
+ <span class="gm-secondary">✎ edit inline</span>
460
+ <span class="gm-secondary">↓ .txt</span>
461
+ </div>
462
+
463
+ <div class="gm-label">Quick refinements <span class="hint">click to apply to next regeneration</span></div>
464
+ <div style="margin-bottom:14px;">
465
+ <span class="gm-chip">more cryptic</span>
466
+ <span class="gm-chip">less rhyme</span>
467
+ <span class="gm-chip">more concrete imagery</span>
468
+ <span class="gm-chip">shorter lines</span>
469
+ <span class="gm-chip">add chorus hook</span>
470
+ </div>
471
+
472
+ <div class="gm-label">Variants</div>
473
+ <div class="gm-grid2">
474
+ <div class="gm-secondary" style="text-align:center; border-color:#FFF; color:#FFF">v1 Β· current</div>
475
+ <div class="gm-secondary" style="text-align:center">+ generate v2</div>
476
+ </div>
477
+
478
+ <div class="gm-label" style="margin-top:14px">Metadata</div>
479
+ <div class="gm-meta-block">
480
+ {<br>
481
+ &nbsp;&nbsp;"mode": "lyrics",<br>
482
+ &nbsp;&nbsp;"model": "Qwen2.5-7B-Instruct",<br>
483
+ &nbsp;&nbsp;"brief_first_line": "A driving psytrance anthem about losing yourself...",<br>
484
+ &nbsp;&nbsp;"structure": ["intro", "verse", "chorus", "verse", "chorus", "bridge", "chorus", "outro"],<br>
485
+ &nbsp;&nbsp;"language": "en",<br>
486
+ &nbsp;&nbsp;"tone": "euphoric, hypnotic, transcendent, not cheesy",<br>
487
+ &nbsp;&nbsp;"verse_lines": 6, "chorus_lines": 4, "bridge_lines": 2,<br>
488
+ &nbsp;&nbsp;"rhyme_preference": "loose",<br>
489
+ &nbsp;&nbsp;"temperature": 0.85, "top_p": 0.9, "top_k": 40,<br>
490
+ &nbsp;&nbsp;"repetition_penalty": 1.1, "max_new_tokens": 600,<br>
491
+ &nbsp;&nbsp;"seed": 42,<br>
492
+ &nbsp;&nbsp;"tokens_generated": 312, "wall_seconds": 6.2,<br>
493
+ &nbsp;&nbsp;"output_sha256": "f1a3...88e2"<br>
494
+ }
495
+ </div>
496
+ </div>
497
+ </div>
498
+ </div>
499
+ </div>
500
+
501
+
502
+ <div class="options" style="margin-top:24px">
503
+ <div class="option" data-choice="approve" onclick="toggleSelect(this)">
504
+ <div class="letter">βœ“</div>
505
+ <div class="content">
506
+ <h3>Both look right β€” refresh Generate next, then mobile + error states</h3>
507
+ <p>Edit (with both sub-modes visible) and Lyrics (with LM params + quick-refinement chips) work. Continue.</p>
508
+ </div>
509
+ </div>
510
+ <div class="option" data-choice="revise" onclick="toggleSelect(this)">
511
+ <div class="letter">✎</div>
512
+ <div class="content">
513
+ <h3>Revise β€” tell me which control or section</h3>
514
+ <p>Reply in terminal with specifics.</p>
515
+ </div>
516
+ </div>
517
+ </div>
docs/superpowers/specs/mockups/README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE Music Studio β€” UI mockups
2
+
3
+ Visual source-of-truth for the design spec at `../2026-05-18-ace-music-studio-design.md`. Open the HTML files in a browser to see the rendered Brutalist Mono interface.
4
+
5
+ | File | Tabs / screens covered | Source |
6
+ |---|---|---|
7
+ | [`01_generate_mobile_errors.html`](./01_generate_mobile_errors.html) | **Generate** tab fully expanded Β· 3 phone screens (Generate, Cover, Lyrics) Β· 6 error / edge-case states Β· in-progress generation banner | brainstorm session 24743 |
8
+ | [`02_cover_extend.html`](./02_cover_extend.html) | **Cover** tab fully expanded Β· **Extend** tab fully expanded | brainstorm session 24743 |
9
+ | [`03_edit_lyrics.html`](./03_edit_lyrics.html) | **Edit** tab fully expanded with both sub-modes (Repaint active, Flow Morph dimmed) Β· **Lyrics** tab fully expanded with Qwen 2.5 LM params | brainstorm session 24743 |
10
+
11
+ ## What every tab shares
12
+
13
+ - Sticky header with brand "ACE Music Studio." and CTA: *Built with β™₯. Drop a like Β· Follow @techfreakworm for what's next.*
14
+ - Sidebar with 5 mode pills + session History list (desktop β‰₯ 1024 px)
15
+ - 2-column body: form on left, output on right
16
+ - LoRA stack section with 4 bundled preset chips + active stack rows (per-row strength slider + Γ—) + custom upload zone
17
+ - Advanced accordion: BPM, key/scale, time sig, sampler, language, steps, CFG, shift, negative prompt, audio format, loudness, fade in/out, seed + lock
18
+ - LM planner accordion: thinking, constrained decoding, temp / top-k / top-p / LM CFG, CoT toggles (metas / caption / lyrics / language), LM negative prompt, CoT override fields
19
+ - DCW accordion: enabled, mode (single / double), wavelet, scaler, high scaler
20
+ - Output panel: waveform Β· play/scrub Β· retake Β· stems (Demucs htdemucs_ft) Β· export (mp3 / wav / stems zip / meta JSON / share link) Β· full metadata JSON
21
+
22
+ ## What each tab adds
23
+
24
+ - **Generate** β€” duration slider, vocals/instrumental pills, CFG-interval start/end, latent shift/rescale
25
+ - **Cover** β€” reference-audio dropzone, cover-strength slider, cover-noise slider, compare-side-by-side toggle in output
26
+ - **Extend** β€” seed-audio dropzone with auto-detected BPM/key, extension prompt, extra-duration slider, repaint mode, repaint strength, latent crossfade frames, WAV crossfade seconds, chunk mask mode, seed-boundary marker on output waveform, separate "extension-only" download
27
+ - **Edit** β€” source audio + source/target lyrics, repaint-vs-flow-morph sub-mode pills, segment-selection bar with start/end timestamps, repaint sub-options (mode / chunk-mask / strength / crossfade), flow-morph sub-options (source caption / n_min / n_max / n_avg), A/B comparison in output
28
+ - **Lyrics** β€” brief, structure sequence, language, per-section line counts (verse / chorus / bridge), tone descriptors, rhyme preference pills (strict / loose / none), LM params accordion (temp / top-p / top-k / rep penalty / max tokens / seed / show system prompt / enforce-tag-format), quick-refinement chips (more cryptic, less rhyme, etc.), variants
29
+
30
+ ## Mobile (phone)
31
+
32
+ - Native `gr.Tabs` horizontal scroll strip at top (icons + first label visible)
33
+ - Sidebar hidden via CSS media query at `< 640 px`
34
+ - Output stacks below form
35
+ - Sliders bounded by parent width (the desktop's pixel-art `━` characters were replaced with proper CSS slider tracks for mobile)
36
+
37
+ ## Error / edge states
38
+
39
+ - **LoRAValidationError** β€” toast with module-mismatch diagnostics + "Remove from stack" / "View header diagnostics" actions
40
+ - **ZeroGPU timeout** β€” auto-retry once at 2Γ— duration, then warning toast with "Lower steps" / "Reduce duration" hints
41
+ - **MPS op fallback** β€” info toast naming the op (e.g., `aten::_fft_r2c`), CPU fallback engaged via `PYTORCH_ENABLE_MPS_FALLBACK=1`
42
+ - **Audio format rejected** β€” clear constraints (wav/mp3/flac, ≀ 60 s for Cover, ≀ 50 MB) + "Auto-convert + trim" action
43
+ - **First-request warm-up** β€” informational banner ("Loading ACE-Step v1.5 XL SFT into MPS memory ~45 s")
44
+ - **In-progress generation** β€” `gr.Progress`-driven banner with step / total, ETA, elapsed, cancel link
45
+
46
+ ## Note on the "approve / revise" cards
47
+
48
+ Each HTML file has a card-options block at the bottom β€” vestigial from the visual-companion brainstorm flow. It's harmless when viewed outside the companion (the `toggleSelect` call is a no-op without the companion's helper.js).
49
+
50
+ If they bother you, delete the trailing `<div class="options">…</div>` block from each file. Otherwise leave them β€” they document which question each mockup answered.
research/00_executive_summary.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Open-Source Song Generation for a Suno-Like Platform β€” Executive Summary
2
+
3
+ *Research compiled 2026-05-18. Target hardware: Apple M5 Max, 128 GB unified memory, MPS backend. Deployment target: **free non-profit Hugging Face Space.** Commercial license is NOT a constraint.*
4
+
5
+ ---
6
+
7
+ ## TL;DR
8
+
9
+ **Use ACE-Step 1.5 XL as the default base model.** It is the open-source full-song-with-vocals foundation model in May 2026 that combines:
10
+
11
+ 1. **First-class Apple Silicon support** (hybrid MLX + PyTorch MPS, dedicated `clockworksquirrel/ace-step-apple-silicon` fork) β€” best local-dev experience.
12
+ 2. **MIT license** β€” clean for forks, attribution, and weight redistribution on the HF Space.
13
+ 3. **State-of-art-or-better quality** β€” 4.4/5 vs Suno v4's 4.1/5 vocal naturalness in a 50-person blind test (folk, classical, jazz; Suno still wins pop/EDM polish).
14
+ 4. **Sub-minute generation** on M5 Max (projected ~30 – 50 s for a 4-min song). Sub-2 s/song on A100 β€” fits inside HF ZeroGPU's free 60 s budget.
15
+ 5. **Cheap LoRA fine-tuning** β€” 8 songs trainable in ~1 hour on a single 3090, LoRA training works on MPS.
16
+ 6. **50+ languages**, vocals + instrumentation natively, **<4 GB VRAM minimum** β€” runs on free ZeroGPU Spaces.
17
+ 7. **Active 10.4 k-star repo**, native ComfyUI integration, AMD vendor-blessed for production.
18
+
19
+ **Now that commercial use is not a constraint** (free non-profit HF Space deployment), **SongGeneration 2 / LeVo 2** comes back into contention as a premium-quality alternative β€” its Tencent non-commercial license permits academic/research/education use. Vendor benchmarks (unverified) put it ahead of Suno v5 on lyric accuracy. The trade-off is **22 – 28 GB VRAM** (needs paid Space tier, not free ZeroGPU) and no first-party MPS path (only a buggy community `SongGen-Mac` fork) β€” meaning M5 Max local dev is painful.
20
+
21
+ Pair the primary pick with **HeartMuLa-MLX** as an alternate-quality choice (Apache 2.0, 2.1Γ— faster than ACE-Step on M-series via Apple's MLX) and **YuE on Replicate** as the multilingual fallback.
22
+
23
+ ---
24
+
25
+ ## Ranking (non-profit HF Space context)
26
+
27
+ | Rank | Model | Params | bf16 weights | License | MPS | Vocal Quality vs Suno | LoRA | Verdict |
28
+ |---|---|---|---|---|---|---|---|---|
29
+ | **1** | **ACE-Step 1.5 XL** | ~8 B (4 B DiT + 4 B planner) | ~16 GB | MIT | First-class | 4.4/5 vs Suno v4 4.1 (blind test) | βœ… 1h on 3090 | **Default base.** Fits free ZeroGPU. |
30
+ | **2** | **SongGeneration 2 / LeVo 2** | 4 B | ~8 GB | Tencent non-commercial (OK for non-profit Space) | Buggy community fork only | Vendor PER 8.55 % vs Suno v5 12.4 % | ❌ | Premium quality. Needs paid Space (22 – 28 GB VRAM). |
31
+ | **3** | **HeartMuLa** | ~6.8 B (4 B MuLa + 2 B Codec + 0.8 B ASR) | ~13.6 GB | Apache 2.0 | Strong MLX port | Vendor: lowest PER per-language, unverified | ❌ public | Strong A/B alternate. |
32
+ | **4** | **DiffRhythm 2** | ~1.17 B (1 B DiT + 170 M VAE-dec) | ~2.4 GB | Apache 2.0 | Likely OK, untested | Authors admit gap vs Suno v4.5 | ❌ no training code | Speed tier. 210 s ceiling. Cheapest to host. |
33
+ | **5** | **YuE** | ~8 B (7 B + 1 B + upsampler) | ~16 GB | Apache 2.0 | ❌ broken (flash-attn hard dep) | Vocal range matches Suno v4 | βœ… LoRA, CUDA-only | Multilingual specialist; via Replicate only. |
34
+ | β€” | SongBloom | 2 B | ~4 GB | Custom (likely NC) | Reported OK | unknown | ❌ | Research baseline. |
35
+ | β€” | InspireMusic / FunMusic | 1.5 B | ~3 GB | Apache 2.0 | ❌ CUDA-only deps | No vocals yet | n/a | Skip until vocal release. |
36
+
37
+ ---
38
+
39
+ ## Decision tree (non-profit HF Space deployment)
40
+
41
+ ```
42
+ HF Space tier?
43
+ β”œβ”€β”€ Free ZeroGPU (60s/req on shared A100) ─┐
44
+ β”‚ β”œβ”€β”€ ACE-Step 1.5 (turbo workflow generates a song well under 60 s)
45
+ β”‚ └── DiffRhythm 2 (smallest, fastest, fits easily)
46
+ β”‚
47
+ └── Paid GPU Space (A10G / A100 dedicated) ─┐
48
+ β”œβ”€β”€ Default: ACE-Step 1.5 XL (best speed-quality, MPS for local dev)
49
+ β”œβ”€β”€ Premium tier: SongGeneration 2 v2-large (best vendor benchmarks)
50
+ β”œβ”€β”€ Multilingual breadth: YuE (50+ via Replicate; local broken)
51
+ └── Alternate: HeartMuLa via heartlib-mlx
52
+ ```
53
+
54
+ ---
55
+
56
+ ## What the research surfaced that changes the picture
57
+
58
+ 1. **Non-profit HF Space deployment removes the Tencent-license blocker.** SongGeneration 2 / LeVo 2 is back in contention as a premium-quality alternative. Its custom license permits "academic, research, and education purposes" β€” a free non-profit Space sits comfortably inside that scope. Practical blockers remain (22 – 28 GB VRAM means paid Space tier, no working MPS) but the licence is no longer a no-go.
59
+
60
+ 2. **The YuE team migrated to ACE-Step.** The ACE-Step paper (Jun 2025) explicitly critiques YuE for "slow inference and structural artifacts." YuE's repo has been dormant since 2025-06-04. Treat YuE as a frozen capability, not a developing one.
61
+
62
+ 3. **Vocal-support contradiction on ACE-Step is resolved: yes, it does vocals.** Several search results said "instrumental only" β€” that's confused with the `Text2Samples` LoRA. The base model produces vocals + instruments natively, lyric-conditioned, with `[verse] [chorus] [bridge]` structural tags.
63
+
64
+ 4. **DiffRhythm 2's biggest fix is structural coherence**, not raw quality. Its v1's brutal Hacker News thread complained "no identifiable chorus in any of the demo songs"; v2's block flow-matching (semi-autoregressive over 2 s blocks) closes that gap. Its **210 s ceiling is a regression** from v1-full's 4m45s.
65
+
66
+ 5. **HeartMuLa is the dark-horse 2026 entrant.** Apache 2.0, 4 B params, modular (CLAP + Transcriptor + Codec + MuLa LM), MLX port available. Vendor PER claims are aggressive (0.09 EN / 0.12 ZH) but not in comparable units to LeVo's 8.55 % β€” direct comparison unreliable until somebody runs a neutral A/B.
67
+
68
+ 6. **Every "beats Suno v5" claim is vendor-published.** The only neutral preference study located ([arXiv 2506.19085](https://arxiv.org/html/2506.19085v1)) stops at Suno v3.5. **Plan an in-house blind A/B before betting product positioning on any vendor number.**
69
+
70
+ 7. **Apple Silicon is fine for music gen β€” much friendlier than LTX-Video 2.3.** No complex64, no SDPA-on-meta-tensor traps, no multimodal-Gemma gotchas. The mundane MPS issues here are: `flash-attn` substitution with SDPA, fp16 conv1d β†’ fp32 in audio decoders, `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` for OOM tuning. Three of the five candidate models already ship a working MPS or MLX path.
71
+
72
+ 8. **HF Space hardware tier dictates the model choice as much as quality does.** Free ZeroGPU = 60 s budget per request, shared A100 β€” only ACE-Step or DiffRhythm 2 finish in time. Paid A10G/A100 Spaces unlock SongGeneration 2 v2-large but the user has to pay (or get an HF community grant).
73
+
74
+ ---
75
+
76
+ ## Recommended starting setup for the M5 Max (with HF Space deploy in mind)
77
+
78
+ ```bash
79
+ # 1. Primary base model β€” ACE-Step 1.5 XL via the Apple Silicon fork
80
+ git clone https://github.com/clockworksquirrel/ace-step-apple-silicon \
81
+ ~/Projects/llm/music-generator/ace-step
82
+ cd ~/Projects/llm/music-generator/ace-step
83
+ python3.11 -m venv .venv && source .venv/bin/activate
84
+ pip install -r requirements.txt
85
+ # Hybrid backend: Qwen3 planner β†’ MLX, DiT decoder β†’ PyTorch MPS, bf16 throughout
86
+ # ~16 GB bf16 weights for the XL stack; M5 Max 128 GB has massive headroom
87
+
88
+ # 2. Production UI β€” ace-step-ui (stem extraction, library, LAN access)
89
+ git clone https://github.com/fspecii/ace-step-ui \
90
+ ~/Projects/llm/music-generator/ace-step-ui
91
+
92
+ # 3. Alternate model β€” HeartMuLa via MLX port (~13.6 GB bf16)
93
+ git clone https://github.com/Acelogic/heartlib-mlx \
94
+ ~/Projects/llm/music-generator/heartlib-mlx
95
+
96
+ # 4. (Optional) Premium-quality experiment β€” SongGeneration 2 / LeVo 2
97
+ # Mac fork has a pre-chorus bug; only do this if you're OK developing on a rented
98
+ # Linux+CUDA box and the M5 Max becomes just your control plane.
99
+ git clone https://github.com/tencent-ailab/SongGeneration \
100
+ ~/Projects/llm/music-generator/songgeneration
101
+ ```
102
+
103
+ For the throughput-sensitive **multilingual fallback (YuE)**, use Replicate's `fofr/yue` endpoint β€” do *not* attempt local inference on M5 Max until somebody ports Stage-1 to MPS. Treat YuE as remote-only for now.
104
+
105
+ **HF Space deployment notes:**
106
+ - **Free ZeroGPU Space** β†’ only ACE-Step or DiffRhythm 2 will finish a song inside the 60 s shared-A100 budget. Use ACE-Step's turbo workflow.
107
+ - **Paid GPU Space** β†’ A10G (24 GB) handles ACE-Step XL comfortably; A100 (40 GB) opens the door to SongGeneration 2 v2-large.
108
+ - **Apply for a [Community GPU Grant](https://huggingface.co/docs/hub/en/spaces-gpus#community-gpu-grants)** if budget is the deciding factor β€” HF approves these regularly for non-profit demos.
109
+
110
+ ---
111
+
112
+ ## Sources
113
+
114
+ All claims are cited inline in the per-model deep-dives:
115
+
116
+ - [01_yue.md](./01_yue.md)
117
+ - [02_diffrhythm.md](./02_diffrhythm.md)
118
+ - [03_acestep.md](./03_acestep.md)
119
+ - [04_newcomers_and_survey.md](./04_newcomers_and_survey.md)
120
+ - [05_apple_silicon_mps_audit.md](./05_apple_silicon_mps_audit.md)
121
+ - [06_comparison_matrix.md](./06_comparison_matrix.md) β€” side-by-side spec table
122
+ - [07_platform_architecture.md](./07_platform_architecture.md) β€” Suno-clone system design with ACE-Step at the core
research/01_yue.md ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # YuE β€” Open Full-Song Music Generation Foundation Model
2
+
3
+ *Research date: 2026-05-18*
4
+
5
+ ---
6
+
7
+ ## 1. Overview
8
+
9
+ **YuE** (乐, "yue" β€” Chinese for "music") is an open-source family of long-form, lyrics-to-song foundation models that produce vocals + accompaniment end-to-end, explicitly positioned as the open competitor to Suno.ai and Udio. It was built by the **M-A-P (Multimodal Art Projection) collective**, led by researchers at **HKUST (Hong Kong University of Science and Technology)** with collaborators from multiple academic and industry institutions (58 authors are credited on the paper, with hardware support from Geely and Moonshot AI) ([arXiv 2503.08638](https://arxiv.org/abs/2503.08638), [HF model card](https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-icl)).
10
+
11
+ **Release timeline:**
12
+
13
+ - **2025-01-26** β€” Initial YuE-s1-7B series released ([GitHub README](https://github.com/multimodal-art-projection/YuE))
14
+ - **2025-01-30** β€” Apache 2.0 license adopted; dual-track ICL mode added
15
+ - **2025-02-07** β€” Windows / Pinokio support
16
+ - **2025-02-17** β€” Music continuation + Google Colab support
17
+ - **2025-03-11/12** β€” Anneal checkpoints + technical report on arXiv (v1)
18
+ - **2025-06-04** β€” LoRA fine-tuning code merged (PR #126)
19
+ - **ICLR 2026** β€” Paper presented
20
+
21
+ **Current status (May 2026): effectively frozen / community-maintained.** The official `multimodal-art-projection/YuE` repo's last commit is **2025-06-04** (GitHub API, retrieved 2026-05-18), nearly 12 months stale. There is no announced YuE-2 or successor from the M-A-P org. All forward development (quantization, ComfyUI, GUI, MPS attempts, exllama, mp3 extension) now happens in community forks like [YuEGP](https://github.com/deepbeepmeep/YuEGP), [YuE-exllamav2](https://github.com/sgsdxzy/YuE-exllamav2), and [YuE-extend](https://github.com/Mozer/YuE-extend). The space the team itself has moved into is **ACE-Step** (released January 2026), which the ACE-Step paper explicitly critiques YuE for "slow inference and structural artifacts" ([arXiv 2506.00045](https://arxiv.org/abs/2506.00045)).
22
+
23
+ ---
24
+
25
+ ## 2. Architecture
26
+
27
+ YuE is a **two-stage autoregressive LLM** pipeline built on the **LLaMA2** decoder-only transformer backbone β€” *not* a diffusion model ([paper](https://arxiv.org/html/2503.08638v1)).
28
+
29
+ **Stage-1 LM (the headline 7B model):**
30
+ - LLaMA2-style decoder, ~6B–7B parameters (HF metadata reports 6B for the s1 checkpoints).
31
+ - Performs **track-decoupled next-token prediction**: interleaves *vocal* and *instrumental* token streams in a single sequence, so a single AR pass produces both tracks rather than mixing them. This is YuE's central architectural innovation.
32
+ - Conditioned on (genre tags || lyrics) using **structural progressive conditioning** β€” lyrics are chunked per section (verse/chorus/bridge) and re-injected so attention does not lose alignment over a 5-minute generation.
33
+ - Native context: 8192 tokens (~163 s of mix-track audio, ~81 s of dual-track); extended to **16384** in the anneal phase.
34
+
35
+ **Stage-2 LM:**
36
+ - 1B-parameter LLaMA2 model (HF reports ~2B for `YuE-s2-1B-general`).
37
+ - Predicts the **residual RVQ codebooks (layers 1–7)** conditioned on Stage-1's codebook-0 output, restoring acoustic fidelity that the semantic-rich layer-0 tokens omit.
38
+ - Context length 8192.
39
+
40
+ **Audio tokenizer β€” X-Codec:**
41
+ - YuE uses **X-Codec** (from the same M-A-P lineage as MERT), a *semantic-acoustic fused* RVQ codec that bolts a HuBERT-based semantic stream onto an RVQ-VAE acoustic stream.
42
+ - 12 RVQ codebooks total; YuE uses the first **8** (codebook size 1024 each).
43
+ - 50 Hz frame rate over 16 kHz audio.
44
+ - A separate **YuE-upsampler** (GAN-based) converts the 16 kHz output up to higher sample rate / better fidelity for delivery ([paper Β§3](https://arxiv.org/html/2503.08638v1), [HF Transformers X-Codec docs](https://huggingface.co/docs/transformers/main/model_doc/xcodec)).
45
+
46
+ **Track handling:** Dual-track. Vocal and accompaniment are *separately tokenized* via X-Codec, then interleaved in the AR sequence β€” this is the paper's claimed advantage over single-track-mixture baselines (less information loss, cleaner vocal/inst separation).
47
+
48
+ **Max generation length:** Up to **~5 minutes** per song, generated in chunks/sessions and stitched.
49
+
50
+ **Lyrics conditioning:** Plain text lyrics with section tags ([verse], [chorus], etc.) + a genre tag prompt (a vocabulary from `top_200_tags.json` such as "pop", "female vocal", "energetic", "120 bpm"). The progressive conditioning means each new section re-references the relevant lyric chunk.
51
+
52
+ **Training scale:** Stage-1 used ~**2T tokens** across phases; data includes ~**650K hours of in-the-wild music** plus ~**70K hours of TTS** for vocal grounding ([paper](https://arxiv.org/html/2503.08638v1)).
53
+
54
+ ---
55
+
56
+ ## 3. Variants and Sizes
57
+
58
+ From the [M-A-P YuE collection on HuggingFace](https://huggingface.co/collections/m-a-p/yue-6797d55e22990ae89b90a3d6) (downloads accurate as of mid-2026):
59
+
60
+ | Model | Params | Stage | Language | Mode | Downloads (last month) |
61
+ |---|---|---|---|---|---|
62
+ | `YuE-s1-7B-anneal-en-cot` | 6B | 1 | English | Chain-of-Thought (default) | 8.48k |
63
+ | `YuE-s1-7B-anneal-en-icl` | 6B | 1 | English | In-Context Learning (style cloning) | 805 |
64
+ | `YuE-s1-7B-anneal-zh-cot` | 6B | 1 | Mandarin/Cantonese | CoT | 203 |
65
+ | `YuE-s1-7B-anneal-zh-icl` | 6B | 1 | Mandarin/Cantonese | ICL | 89 |
66
+ | `YuE-s1-7B-anneal-jp-kr-cot` | 6B | 1 | Japanese/Korean | CoT | 95 |
67
+ | `YuE-s1-7B-anneal-jp-kr-icl` | 6B | 1 | Japanese/Korean | ICL | 25 |
68
+ | `YuE-s2-1B-general` | 2B | 2 | language-agnostic | residual decoder | 6.01k |
69
+ | `YuE-s1-0.5B` | 0.5B | 1 | research/ablation | partial training | 94 |
70
+ | `YuE-upsampler` | – | post | n/a | GAN upsampler | – |
71
+ | `xcodec_mini_infer` | – | tokenizer | n/a | X-Codec encoder/decoder | – |
72
+
73
+ **Naming key:**
74
+ - `s1` / `s2` = Stage-1 (semantic) / Stage-2 (acoustic residual).
75
+ - `anneal` = checkpoints after the final "annealing" pretraining phase (highest quality public weights).
76
+ - `cot` = chain-of-thought prompting variant; `icl` = in-context learning variant (used for *style/voice cloning* from a reference audio).
77
+ - A community **GGUF quantization** of the Stage-2 model exists at [`multimodalart/YuE-s2-1B-general-Q8_0-GGUF`](https://huggingface.co/multimodalart/YuE-s2-1B-general-Q8_0-GGUF) β€” useful for Mac llama.cpp paths.
78
+
79
+ There is **no official "YuE-2" or major version bump**. The team's successor effort is the separately branded ACE-Step.
80
+
81
+ ---
82
+
83
+ ## 4. License
84
+
85
+ **Apache License 2.0** for code *and* weights β€” switched on 2025-01-30 in response to community pressure ([GitHub README news entry](https://github.com/multimodal-art-projection/YuE), [HF model card](https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-icl)).
86
+
87
+ - **Commercial use:** *Permitted and explicitly encouraged.* The model card says: "Artists and content creators are encouraged to sample and incorporate outputs into their own works, and even monetize them, with attribution to the model's name (\"YuE by HKUST/M-A-P\")."
88
+ - **Attribution:** Required for public / commercial outputs.
89
+ - **Recommended labeling:** outputs should be marked "AI-generated", "YuE-generated", "AI-assisted", or "AI-auxiliated".
90
+ - **No training-data redistribution clause** β€” Apache 2.0 covers code and the released weights; training data itself was *not* released, so no redistribution permission is granted on data.
91
+ - **Liability:** users bear sole responsibility for any copyright infringement, plagiarism, or misuse. Likely β€” no explicit watermarking or content-credentials are baked into output (no direct confirmation in docs).
92
+
93
+ Practical takeaway for the user's Suno-like platform: **YuE is one of the very few music-generation foundation models with a clean, no-strings commercial license**, which is the single most valuable thing about it.
94
+
95
+ ---
96
+
97
+ ## 5. Languages Supported
98
+
99
+ Five officially: **English, Mandarin Chinese, Cantonese, Japanese, Korean** ([GitHub README](https://github.com/multimodal-art-projection/YuE), [demo page](https://map-yue.github.io/)).
100
+
101
+ - English has the deepest training and the most-downloaded checkpoint.
102
+ - `zh` covers Mandarin and Cantonese (sharing a checkpoint).
103
+ - `jp-kr` shares one checkpoint for Japanese and Korean.
104
+ - The demo site shows code-switching (English ↔ Mandarin within the same song) working.
105
+ - No official support for Spanish, French, German, Hindi, Arabic, etc. β€” outputs in those languages will likely be poor or accented (no direct user reports confirm, but architecturally the model has never seen them at scale).
106
+
107
+ ---
108
+
109
+ ## 6. Quality Assessment
110
+
111
+ **Strengths (from paper + demos):**
112
+ - Wide vocal range β€” the paper reports YuE "closely matching top-performing closed-source systems like Suno V4" on vocal-range metrics ([WhiteFiber summary](https://www.whitefiber.com/blog/yue-ai-music-generator)).
113
+ - Strong **musical structure** β€” verse/chorus/bridge transitions are coherent over 3–5 min, which most diffusion music models still struggle with.
114
+ - Demos show death-growl metal, scatting jazz, Beijing opera, rap, ballad, country, and soul β€” *genre breadth* is genuinely impressive ([map-yue.github.io](https://map-yue.github.io/)).
115
+ - ICL mode can clone the timbre/style of a reference clip β€” closest open-source analogue to Suno's "cover" or Udio's style transfer.
116
+
117
+ **Weaknesses (from paper's own discussion + community feedback):**
118
+ - **Acoustic fidelity gap.** Multiple sources, including the paper itself, note "clear deficiencies in vocal and accompaniment acoustic quality, likely due to limitations of its current audio tokenization method"; the authors propose super-resolution / better decoders as future work.
119
+ - **Mono / narrow stereo image** β€” third-party reviews call out that output "lacks the production quality needed for commercial music platforms" and is essentially mono ([articlex review](https://www.articlex.com/open-source-ai-music-generation-breakthrough-with-yue-software/)).
120
+ - **Slow inference + structural artifacts** β€” the explicit critique from the ACE-Step authors (ICLR 2026 submission): "LLM-based models like YuE excel at lyrics alignment but suffer from slow inference and structural artifacts" ([ACE-Step paper](https://arxiv.org/abs/2506.00045)).
121
+ - **Mumbling / lyric drift** appears in long sections β€” there is no explicit Reddit thread surfacing here, but the paper's "Section 12 Unsuccessful Attempts" and `--repetition-penalty` / decoding-temperature emphasis in the GitHub Issues suggest users hit it.
122
+
123
+ **Quality verdict vs Suno v4 / v5:**
124
+ - Suno v4 β‰ˆ YuE on *vocal range and genre breadth.*
125
+ - Suno v4/v5 clearly ahead on *mix polish, stereo width, vocal clarity, and emotional nuance.*
126
+ - YuE ahead of Suno only on *openness, controllability via lyrics tags, and structural macro-form for niche genres*.
127
+
128
+ ---
129
+
130
+ ## 7. Inference Performance
131
+
132
+ From the README's official hardware table:
133
+
134
+ | GPU | Time for 30 s of audio (Stage-1 + Stage-2) |
135
+ |---|---|
136
+ | NVIDIA H800 80GB | **~150 s** |
137
+ | NVIDIA RTX 4090 24GB | **~360 s** |
138
+ | ≀24GB GPU | Max ~2 concurrent sessions; cannot generate a full song in one pass |
139
+ | β‰₯80GB GPU (H100/A100/H800) | Recommended for a full 4+ session song |
140
+
141
+ Extrapolating to a **3-minute song** (~6Γ— a 30 s clip, plus some overhead for stitching):
142
+ - H800: ~15–18 minutes
143
+ - A100 80GB: ~18–22 minutes (likely β€” close to H800 throughput)
144
+ - RTX 4090: ~35–45 minutes
145
+ - M5 Max MPS (user's machine): **no official support, no public benchmark.**
146
+
147
+ **VRAM:** Full-precision FP16 Stage-1 needs ~16–18 GB; Stage-2 + upsampler add ~4–6 GB. Single-pass full-song generation comfortably wants 40–80 GB.
148
+
149
+ **Quantized / community paths:**
150
+ - **YuEGP** ("YuE for the GPU Poor") brings VRAM down to **<10 GB** via 8-bit quantization and sequential offload ([YuEGP repo](https://github.com/deepbeepmeep/YuEGP)).
151
+ - **YuE-exllamav2** claims up to **5Γ— speedup** via ExLlamaV2 + FlashAttention-2 + BF16 ([YuE-exllamav2](https://github.com/sgsdxzy/YuE-exllamav2)) β€” NVIDIA-only.
152
+ - **GGUF Stage-2** exists ([multimodalart/YuE-s2-1B-general-Q8_0-GGUF](https://huggingface.co/multimodalart/YuE-s2-1B-general-Q8_0-GGUF)). Stage-1 7B GGUF is not officially published as of 2026-05.
153
+
154
+ **Apple Silicon / MPS:**
155
+ - **No official MPS support.** GitHub README references `--cuda_idx`, no `mps` or `mac` mentions.
156
+ - No HF Space or fork advertises working MPS inference. The architecture is plain LLaMA2 + standard transformer ops, so MPS port is *technically feasible* (likely β€” Stage-1 fits well within the user's 128GB unified memory), but the X-Codec encoder/decoder has Flash-Attention CUDA kernels that would need replacement. Realistic path on M5 Max today: run the Stage-2 GGUF via llama.cpp Metal backend, but Stage-1 has no public Metal/MPS port.
157
+ - A community attempt to MPS-port has *not* surfaced in any search or GitHub issue as of May 2026.
158
+
159
+ ---
160
+
161
+ ## 8. Repo Health
162
+
163
+ Data from the GitHub API on 2026-05-18 for `multimodal-art-projection/YuE`:
164
+
165
+ - **Stars:** 6,219
166
+ - **Forks:** 741
167
+ - **Open issues:** 86
168
+ - **License:** Apache-2.0
169
+ - **Default branch last push:** `2025-06-04T13:08:48Z` β€” **~11 months stale**
170
+ - **Most-recent commits:** all README edits and the finetune-merge PRs on the same day (2025-06-04).
171
+ - **Recent issue traffic (sampled 2025-Q4 through 2026-Q2):** install errors (CUDA / `codecmanipulator` missing), ComfyUI integration questions, attention-mask warnings, "how do I generate a full song" basics, a Feb-2026 PR proposing `SDPA as default attention` that received zero engagement. Maintainer responses are essentially absent in 2026.
172
+ - **Fine-tuning support:** present, merged June 2025 via PR #126 (LoRA, no QLoRA, requires CUDA 12.1+, PyTorch 2.4, Megatron-formatted JSONL data).
173
+ - **vLLM / SGLang:** listed in TODO, never implemented.
174
+ - **llama.cpp:** community Stage-2 GGUF exists but no official integration; Stage-1 not converted.
175
+ - **Tensor parallel / Stemgen mode:** TODO, never shipped.
176
+
177
+ **Verdict:** The repo is in **maintenance/abandonment limbo.** Apache 2.0 + open weights mean anyone can fork; community forks are where the energy is.
178
+
179
+ ---
180
+
181
+ ## 9. Real-World Adoption
182
+
183
+ - **Replicate:** Hosted at [`fofr/yue`](https://replicate.com/fofr/yue/api) with an official cog wrapper at [`replicate/cog-yue`](https://github.com/replicate/cog-yue) β€” production-ready pay-per-second API.
184
+ - **HuggingFace Spaces:** at least three live demos β€” [`fffiloni/YuE`](https://huggingface.co/spaces/fffiloni/YuE), [`innova-ai/YuE-music-generator-demo`](https://huggingface.co/spaces/innova-ai/YuE-music-generator-demo), `Harveyu/YuE-music-generator-demo`.
185
+ - **ComfyUI:** community node [`smthemex/ComfyUI_YuE`](https://github.com/smthemex/ComfyUI_YuE) exposes YuE as a node graph (issue #148 confirms active users in 2026).
186
+ - **Pinokio:** one-click Windows installer ships in the official Pinokio script directory ([pinokio.co](https://pinokio.co/)).
187
+ - **GPU-poor / consumer forks:** `deepbeepmeep/YuEGP` (sub-10 GB VRAM), `sgsdxzy/YuE-exllamav2` (5Γ— speedup), `Mozer/YuE-extend` (mp3 extension + GUI), `Sorrymakershen/YuE-for-windows`.
188
+ - **SiliconFlow:** no public listing found as of 2026-05 (likely β€” search returned no SiliconFlow YuE endpoint).
189
+ - **Forks:** 741 total, dominated by consumer-VRAM optimization rather than research extension.
190
+
191
+ For a Suno-like platform, the **Replicate `fofr/yue` endpoint is the lowest-friction starting point** to test quality before self-hosting.
192
+
193
+ ---
194
+
195
+ ## 10. Fine-Tuning
196
+
197
+ - **LoRA fine-tuning is documented and supported** since June 2025, in the [`finetune/` directory](https://github.com/multimodal-art-projection/YuE/tree/main/finetune) with `scripts/preprocess_data.sh` and `scripts/run_finetune.sh`.
198
+ - Configurable `LORA_R`, `LORA_ALPHA`, `LORA_DROPOUT`.
199
+ - **Training scripts are open** β€” Megatron-style data pipeline; data must be converted to JSONL containing X-Codec tokens + lyric/structure/genre metadata, then to Megatron binary.
200
+ - **QLoRA: not documented.** No 4-bit fine-tuning path is described in the official repo (likely β€” community forks may have hacked it together).
201
+ - Requires CUDA 12.1+, PyTorch 2.4, Python 3.10; GPU memory not explicitly stated but realistically wants β‰₯40 GB VRAM for the 7B Stage-1 LoRA.
202
+ - No published guide for full-parameter fine-tuning of Stage-1 β€” implied to need multi-node H100.
203
+
204
+ ---
205
+
206
+ ## 11. Pros and Cons
207
+
208
+ **Pros**
209
+ - True open weights (Apache 2.0), commercial-use-friendly, with strong attribution-only requirements.
210
+ - Genuine dual-track output (vocals + instrumentals as separable streams), not just a mix.
211
+ - Multilingual coverage of EN / ZH / Cantonese / JP / KR with code-switching demos.
212
+ - Strong macro-structure for 3–5 minute songs β€” verses, choruses, bridges hold together.
213
+ - Healthy ecosystem of quantized / consumer-VRAM forks and a turnkey Replicate endpoint.
214
+ - LoRA fine-tuning code is shipped and merged.
215
+ - Comparable vocal range to Suno v4 on the paper's metrics.
216
+
217
+ **Cons**
218
+ - **Repo is effectively dormant since June 2025** β€” no maintainer engagement on 2026 issues/PRs.
219
+ - Acoustic fidelity is noticeably below Suno v4/v5 β€” mono-ish, less polished mix, occasional vocal artifacts/mumbling on long passages.
220
+ - **No MPS / Apple Silicon support**, official or community β€” a real problem for the user's M5 Max workflow.
221
+ - Slow inference even on H800 (~150 s per 30 s clip, β†’ 15+ minutes per full song before quantization).
222
+ - VRAM hungry: full-song single-pass wants 80 GB; consumer GPUs need session-stitching tricks.
223
+ - No QLoRA / no vLLM / no SGLang / no tensor parallel β€” all in TODO purgatory.
224
+ - Training data not released β†’ fine-tuning needs you to bring your own licensed corpus.
225
+ - Tokenizer (X-Codec) is the bottleneck for fidelity, and YuE inherits this ceiling β€” no upgrade path planned in this codebase.
226
+ - An explicit successor effort (ACE-Step) from an adjacent team claims to fix YuE's specific weaknesses.
227
+
228
+ ---
229
+
230
+ ## 12. Verdict for the User's Suno-like Platform
231
+
232
+ **Best fit for the user's M5 Max / 128 GB platform if:**
233
+ - The product needs **commercial-grade licensing freedom** above all else β€” YuE is one of the very few open music models you can ship in a paid product without licensing carve-outs.
234
+ - You target **multilingual song generation (EN + Mandarin/Cantonese + JP/KR)** with code-switching β€” YuE is the strongest open option here.
235
+ - You can offload generation to a **rented H100/H800 (Replicate, Runpod, Lambda)** rather than insisting on local M5 Max inference β€” *MPS support is the blocker on the user's hardware.*
236
+ - You want a base to **LoRA fine-tune on a proprietary genre/voice corpus** β€” the official fine-tune scripts work today, and Apache 2.0 lets you keep your LoRA private and commercial.
237
+
238
+ **Where YuE will underperform competitors:**
239
+ - **Acoustic polish** β€” Suno v4/v5 and Udio will sound noticeably more professional out of the box. If your platform's selling point is "studio-quality vocals", YuE is not there.
240
+ - **Throughput per dollar** β€” diffusion-based ACE-Step and DiffRhythm-2 are dramatically faster (ACE-Step claims ~15Γ— speedup); for a high-volume product, the AR-LLM architecture is expensive.
241
+ - **Real-time / interactive generation** β€” not viable; YuE is batch-only.
242
+ - **Local Mac inference** β€” until somebody ports Stage-1 to MPS or ships a Stage-1 GGUF, the user's M5 Max can at best play around with the Stage-2 model in llama.cpp Metal mode.
243
+
244
+ **Concrete recommendation for the user:** use YuE via Replicate's `fofr/yue` endpoint as the **commercial-license-clean fallback / multilingual specialist** in the platform's model router, and seriously evaluate ACE-Step in parallel for the throughput-sensitive default path. Plan a future LoRA fine-tune on YuE only after the platform has clear vertical (genre, language, or vocal-style) demand that the closed APIs cannot serve.
245
+
246
+ ---
247
+
248
+ ## References
249
+
250
+ - GitHub repo: <https://github.com/multimodal-art-projection/YuE>
251
+ - Paper (arXiv): <https://arxiv.org/abs/2503.08638>
252
+ - Paper (HTML): <https://arxiv.org/html/2503.08638v1>
253
+ - OpenReview: <https://openreview.net/forum?id=hZy6YG2Ij8>
254
+ - Project / demos: <https://map-yue.github.io/>
255
+ - HF collection: <https://huggingface.co/collections/m-a-p/yue-6797d55e22990ae89b90a3d6>
256
+ - HF s1 English ICL card: <https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-icl>
257
+ - Replicate: <https://replicate.com/fofr/yue/api>
258
+ - Replicate cog: <https://github.com/replicate/cog-yue>
259
+ - YuEGP fork: <https://github.com/deepbeepmeep/YuEGP>
260
+ - YuE-exllamav2 fork: <https://github.com/sgsdxzy/YuE-exllamav2>
261
+ - YuE-extend fork: <https://github.com/Mozer/YuE-extend>
262
+ - ComfyUI node: <https://github.com/smthemex/ComfyUI_YuE>
263
+ - GGUF Stage-2: <https://huggingface.co/multimodalart/YuE-s2-1B-general-Q8_0-GGUF>
264
+ - HF X-Codec docs: <https://huggingface.co/docs/transformers/main/model_doc/xcodec>
265
+ - ACE-Step paper (successor-style critique): <https://arxiv.org/abs/2506.00045>
266
+ - WhiteFiber technical summary: <https://www.whitefiber.com/blog/yue-ai-music-generator>
267
+ - HF Space demo (fffiloni): <https://huggingface.co/spaces/fffiloni/YuE>
268
+ - HF Space demo (innova-ai): <https://huggingface.co/spaces/innova-ai/YuE-music-generator-demo>
research/02_diffrhythm.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DiffRhythm and DiffRhythm 2 β€” Deep Technical Review
2
+
3
+ *Compiled 2026-05-18. All claims cited; speculation flagged inline.*
4
+
5
+ ## 1. Overview
6
+
7
+ DiffRhythm is the first open-source **latent-diffusion full-song generator** β€” vocals + accompaniment, end-to-end, from lyrics and a style prompt β€” built by the **Audio, Speech and Language Processing (ASLP) Lab at Northwestern Polytechnical University (NWPU)** in Xi'an, China, with later contributions from **Xiaomi Research** ([arxiv.org/abs/2503.01183](https://arxiv.org/abs/2503.01183), [github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)). DiffRhythm v1 dropped on **arXiv 3 Mar 2025**; the full 4m45s variant followed on **15 Mar 2025**, and an iterative v1.2 fixed repetition and audio-quality issues mid-2025 ([HF v1.2 commit](https://huggingface.co/spaces/ASLP-lab/DiffRhythm/commit/f5b749d65f62e30bdaad11e6866edc8d3b078b71)). **DiffRhythm 2** appeared on **arXiv 27 Oct 2025** (v3 revised 3 Feb 2026) under [arxiv.org/abs/2510.22950](https://arxiv.org/abs/2510.22950), and was open-sourced at [github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2) (forked from `xiaomi-research/diffrhythm2`) on **30 Oct 2025**, with HuggingFace weights at [huggingface.co/ASLP-lab/DiffRhythm2](https://huggingface.co/ASLP-lab/DiffRhythm2). The series is the leading **diffusion-side** alternative to the LLM-style approach taken by Suno, YuE, and SongBloom.
8
+
9
+ ## 2. Architecture
10
+
11
+ DiffRhythm v1 is a **non-autoregressive (NAR) latent diffusion** model with two pieces: a music **VAE** that compresses raw 44.1 kHz stereo audio into a latent grid, and a **DiT** (Diffusion Transformer) that denoises that grid conditioned on lyrics + style ([nzqian.github.io/DiffRhythm](https://nzqian.github.io/DiffRhythm/)). The DiT uses **16 LLaMA-style decoder layers, 2048 hidden dim, 32 heads Γ— 64 dim, totaling ~1.1B parameters** ([arxiv.org/html/2503.01183](https://arxiv.org/html/2503.01183v1)). Vocals and accompaniment are produced **jointly in a single latent stream** β€” not dual-track β€” which is what makes it "embarrassingly simple" vs. cascaded systems. Lyric conditioning is **sentence-level via LRC (timestamped) phonemes**, with the diffusion model expected to align internally; style is conditioned either via a reference audio embedding or a text prompt. Inference uses a **32-step Euler ODE with CFG scale 4** and 20% dropout on both conditions during training to enable CFG ([diffrhythm.us](https://diffrhythm.us/)).
12
+
13
+ **DiffRhythm 2** replaces the pure-NAR DiT with a **semi-autoregressive block flow-matching** transformer: the latent sequence is sliced into **blocks of 10 frames (2s at 5 Hz)**, and "each block is generated with flow matching, while the dependency across blocks is handled autoregressively" ([alphaxiv.org/overview/2510.22950v3](https://www.alphaxiv.org/overview/2510.22950v3) β€” quoted via search snippet). This is the key innovation: it preserves NAR-style fast within-block parallelism while letting the model attend to prior blocks for **structural coherence** (verse β†’ chorus β†’ verse) and **lyric alignment without any external aligner**. The audio codec is a new **music VAE at 5 Hz frame rate** (vs. the much higher rates of EnCodec/DAC) with a **170M-param decoder**, enabling 210s of latent context to fit on a single GPU ([arxiv abs](https://arxiv.org/abs/2510.22950)). The full DiT is **~1B parameters**. Two new training objectives appear: **Stochastic Block Representation Alignment (REPA) loss** to align hidden states of clean vs. noisy blocks (improves musicality/structure), and **Cross-Pair Preference Optimization** β€” an RLHF variant that groups the four preference dimensions (musicality, style similarity, lyric alignment, audio quality) into pairs to dodge the merging-induced regression that plain DPO causes. **Max song length: 210 s** in v2 vs. **4m45s (~285 s)** in v1-full ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)).
14
+
15
+ ## 3. Variants and sizes
16
+
17
+ | Checkpoint | Duration | DiT params | Notes | Source |
18
+ |---|---|---|---|---|
19
+ | `DiffRhythm-base` | 1m35s | ~1.1B | Original Mar 2025 | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-base) |
20
+ | `DiffRhythm-full` | 4m45s | ~1.1B | Released 15 Mar 2025 | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-full) |
21
+ | `DiffRhythm-vae` | β€” | β€” | Shared audio VAE | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-vae) |
22
+ | `DiffRhythm-1_2-base` | 1m35s | ~1.1B | v1.2 quality fix | [GH README](https://github.com/ASLP-lab/DiffRhythm) |
23
+ | `DiffRhythm-1_2-full` | 4m45s | ~1.1B | v1.2, text-style + instrumental | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-1_2-full) |
24
+ | `DiffRhythm+` (paper) | full | ~1.1B | Adds DPO; not headlined as separate checkpoint | [arxiv 2507.12890](https://arxiv.org/html/2507.12890v2) |
25
+ | `DiffRhythm2` | 210 s | ~1B DiT + 170M VAE-dec | Block flow matching | [HF](https://huggingface.co/ASLP-lab/DiffRhythm2) |
26
+
27
+ (Speculation: I did not find an explicit param count posted for v2's DiT; the **~1B figure comes from a paper-extraction snippet** and aligns with v1's ~1.1B body. Treat as approximate.)
28
+
29
+ ## 4. License
30
+
31
+ **Apache 2.0** for both code and DiT weights, declared on the v1 GitHub README and reaffirmed on the v2 README ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm), [github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2)). **Commercial use is permitted** with attribution. The v2 model card adds a **non-binding ethical disclaimer** asking users to verify originality, disclose AI involvement, and respect stylistic copyright β€” this is a notice, not an enforceable license restriction ([HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)).
32
+
33
+ ## 5. Languages supported
34
+
35
+ Training is heavily **bilingual (Mandarin + English)** β€” v2's dataset is reported as **Chinese : English : Instrumental β‰ˆ 4 : 5 : 1** ([alphaXiv extract](https://www.alphaxiv.org/overview/2510.22950v3)). The v1 README and several mirrors claim **cross-lingual capability** for Japanese, Korean, Spanish ([diffrhythm.us](https://diffrhythm.us/), [diffrhythm.ai](https://diffrhythmai.com/)) β€” but these are demo-site marketing claims, **not benchmarked in the paper**. Verdict: production-safe for **EN and ZH**; treat JP/KR/ES as best-effort. Phoneme front-end is **espeak-ng**, which itself supports 100+ languages ([HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)).
36
+
37
+ ## 6. Quality assessment
38
+
39
+ **Objective (v2 paper, lower=better for PER, higher=better for Mulan-T):**
40
+
41
+ | Metric | DiffRhythm 2 | DiffRhythm+ | ACE-Step | LeVo |
42
+ |---|---|---|---|---|
43
+ | PER (lyric alignment) ↓ | **0.13** | 0.15 | 0.23 | 0.19 |
44
+ | Mulan-T (style match) ↑ | **0.40** | 0.25 | 0.28 | 0.35 |
45
+ | RTF (speed) ↓ | 0.213 | 0.153 | 0.127 | 1.225 |
46
+
47
+ So v2 has **best-in-open-source lyric alignment and style match**, slightly slower than v1+/ACE-Step but ~6Γ— faster than LeVo ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).
48
+
49
+ **Subjective:** v2 is the strongest open model by MOS in the paper's own user study, **but the authors explicitly state "in aspects such as musicality, it still shows a clear gap compared to commercial systems like SUNO V4.5"** ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)). The **block flow-matching does close the structural-coherence gap** that the original Hacker News thread criticized v1 for β€” multiple HN commenters complained "there's no identifiable chorus in any of the demo songs" and rhythm was unstable ([news.ycombinator.com/item?id=43255467](https://news.ycombinator.com/item?id=43255467)). v2 demos show real verse/chorus structure ([aslp-lab.github.io/DiffRhythm2.github.io](https://aslp-lab.github.io/DiffRhythm2.github.io/)). Specific Reddit reception threads in r/LocalLLaMA/r/StableDiffusion were not surfaced by search (low signal).
50
+
51
+ ## 7. Inference performance
52
+
53
+ - v1-full: **~10 s for a 4m45s song on a single RTX 4090** (claimed in paper abstract, [arxiv 2503.01183](https://arxiv.org/abs/2503.01183)) β€” 32 ODE steps. Real-world ComfyUI users report **~62 s for 4 min** on consumer GPUs ([comfyui.org](https://comfyui.org/en/generate-music-with-comfyui-diffrhythm)).
54
+ - **VRAM:** DiffRhythm-base needs β‰₯ **8 GB** with `--chunked`; full needs **24 GB** for headroom ([chutes.ai docs](https://chutes.ai/docs/examples/music-generation)).
55
+ - v2: **RTF 0.213 on RTX 4090** β†’ ~45 s for a 210 s song ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).
56
+ - **Apple Silicon / MPS:** The v1 README claims Apple Silicon is "supported as of March 2025" but the GitHub issues list does not surface dedicated MPS benchmarks, and the Pinokio launcher ([github.com/pinokiofactory/diffrhythm](https://github.com/pinokiofactory/diffrhythm)) does not advertise macOS in its description. **No published M3/M4/M5 numbers exist.** Speculation: on the user's **M5 Max with 128 GB unified memory**, v1-full should run via `PYTORCH_ENABLE_MPS_FALLBACK=1`, likely 3–5Γ— slower than 4090 β€” needs hands-on validation. v2 is newer and has not been tested on MPS publicly.
57
+
58
+ ## 8. DiffRhythm 2 specifics
59
+
60
+ What changed from v1 β†’ v2 ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950), [alphaxiv overview](https://www.alphaxiv.org/overview/2510.22950v3)):
61
+
62
+ 1. **Architecture shift:** pure NAR DiT β†’ **semi-AR block flow-matching** (2 s blocks).
63
+ 2. **New 5 Hz music VAE** (vs. v1's higher-rate codec) β€” enables 210 s context within budget.
64
+ 3. **Stochastic Block REPA loss:** aligns clean vs. noisy hidden states β†’ better musicality + structure.
65
+ 4. **Cross-Pair Preference Optimization:** four-dim RLHF without the model-merging regression that plain DPO causes.
66
+ 5. **Dataset scaling:** **~1.4 M songs / ~70,000 hours**, with a **20 k-hour high-quality subset** for SFT and **40 k preference pairs** for DPO β€” a step-change from v1's undisclosed-but-smaller corpus.
67
+ 6. **Lyric alignment without external constraints:** v1 needed LRC timestamps; v2 learns alignment end-to-end via the AR block dependency.
68
+ 7. **Quality numbers (paper):** PER **0.15 β†’ 0.13**, Mulan-T **0.25 β†’ 0.40** vs. DiffRhythm+ β€” i.e. **lyric-error reduced ~13 % and style-match nearly doubled**.
69
+
70
+ ## 9. Repo health
71
+
72
+ - **DiffRhythm v1:** ~**2.2–2.3 k stars**, **268 forks**, active through 2025, last major release Mar 2025 ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)).
73
+ - **DiffRhythm 2:** **157 stars / 11 forks / 27 commits** as of late Oct 2025 β€” young repo, recently pushed ([github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2)).
74
+ - Training/fine-tuning scripts: **"Coming soon"** is the status on v1; community has filed [Issue #46](https://github.com/ASLP-lab/DiffRhythm/issues/46) asking for fine-tuning docs. v2 ships **inference only** in the public repo as of writing.
75
+
76
+ ## 10. Real-world adoption
77
+
78
+ - **ComfyUI:** [billwuhao/ComfyUI_DiffRhythm](https://github.com/billwuhao/ComfyUI_DiffRhythm) β€” 153 stars, supports v1.2 + full, includes bilingual subtitle gen ([runcomfy.com node](https://www.runcomfy.com/comfyui-nodes/ComfyUI_DiffRhythm)).
79
+ - **Pinokio:** [pinokiofactory/diffrhythm](https://github.com/pinokiofactory/diffrhythm) β€” 19 stars, 69 commits, one-click installer.
80
+ - **Chutes.ai:** Public serverless endpoint for DiffRhythm-full ([chutes.ai/docs/examples/music-generation](https://chutes.ai/docs/examples/music-generation)).
81
+ - **Replicate:** No first-party DiffRhythm 2 model found in search β€” gap in the ecosystem (speculation).
82
+ - Multiple unofficial web frontends: diffrhythm.com, diffrhythm.us, diffrhythm.ai, diffrhythmai.com β€” quality and origin unverified, likely wrappers over the HF Space.
83
+
84
+ ## 11. Fine-tuning
85
+
86
+ The official answer is **none yet**. The v1 repo's training code is listed as "Coming soon," and v2 only ships inference. There is no LoRA support, no published fine-tuning recipe, and no `transformers`/`diffusers` integration as of May 2026. Community workaround would require reverse-engineering the DiT class β€” non-trivial for a 1 B-param flow-matching model. **For the user's Suno-clone platform, fine-tuning DiffRhythm today means forking + writing your own training loop.** This is the single biggest practical weakness.
87
+
88
+ ## 12. Pros and cons
89
+
90
+ **Pros**
91
+ - Permissive **Apache 2.0** for code + weights β€” clean commercial path.
92
+ - **Fastest open full-song model** (~10 s for 4 min on a 4090; v2's block-FM is competitive even with AR-like coherence).
93
+ - v2 has **state-of-the-art lyric alignment (PER 0.13)** in open source.
94
+ - Lightweight: 8 GB VRAM possible with chunking β€” runs on consumer GPUs.
95
+ - Strong ecosystem: ComfyUI nodes, Pinokio installer, Chutes serverless.
96
+ - v2's block flow-matching meaningfully **closes the structural-coherence gap** that doomed v1 demos on HN.
97
+
98
+ **Cons**
99
+ - Still a **clear musicality gap vs. Suno v4.5** (authors admit it; [arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).
100
+ - **No fine-tuning / LoRA path** β€” training code unreleased.
101
+ - v2's max length is **210 s** (3m30s), *shorter* than v1-full's 4m45s β€” a regression for radio-length pop.
102
+ - Multilingual claims (JP/KR/ES) are **unbenchmarked**; only EN/ZH have paper-backed quality.
103
+ - **No published MPS benchmarks** for Apple Silicon; v2 untested on Mac.
104
+ - Demo-site proliferation (`diffrhythm.us`, etc.) muddies the brand β€” confusing for product positioning.
105
+ - License disclaimer adds soft ethical obligations re. copyright that legal review may flag.
106
+
107
+ ## 13. Verdict for the user's platform
108
+
109
+ For a Suno-style platform on an **M5 Max (128 GB unified, MPS)**, DiffRhythm 2 is the **best diffusion-side open option in May 2026**, *but* it should be paired with an **AR-style backup** (YuE / SongBloom / LeVo) covering its weak points.
110
+
111
+ **Where DiffRhythm 2 wins:**
112
+ - Fast, cheap inference per song β€” viable for high-throughput web generation.
113
+ - Best-in-open lyric intelligibility β€” critical for a karaoke / lyrics-first UX.
114
+ - Stereo 44.1 kHz output out of the box.
115
+ - Apache-2.0 + commercial freedom.
116
+
117
+ **Where it underperforms:**
118
+ - **Pop musicality, hook quality, vocal timbre** are still below Suno v4.5 β€” premium-tier output is not there.
119
+ - **No fine-tuning** means you cannot specialize on a target sound or your platform's curated catalog without doing R&D.
120
+ - **210 s ceiling on v2** limits "full album track" formats β€” you'd fall back to v1-full (4m45s) at a quality cost.
121
+ - **MPS path is unproven** β€” the user should plan a same-week feasibility test on the M5 Max before committing v2 to the inference layer; CUDA cloud (Chutes / a 4090 server) is the safer near-term backend.
122
+
123
+ **Recommended posture:** ship v2 as the default *fast* generator behind a feature flag, keep v1.2-full for >3.5 min songs, evaluate Suno / YuE / SongBloom as quality-tier alternatives, and track the v2 repo for an eventual training-code release that would unlock fine-tuning on your platform's data.
124
+
125
+ ---
126
+
127
+ ### Primary sources
128
+ - [DiffRhythm 2 paper (arxiv 2510.22950)](https://arxiv.org/abs/2510.22950)
129
+ - [DiffRhythm v1 paper (arxiv 2503.01183)](https://arxiv.org/abs/2503.01183)
130
+ - [DiffRhythm v1 GitHub](https://github.com/ASLP-lab/DiffRhythm)
131
+ - [DiffRhythm 2 GitHub](https://github.com/ASLP-lab/DiffRhythm2)
132
+ - [DiffRhythm 2 HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)
133
+ - [alphaXiv overview v3](https://www.alphaxiv.org/overview/2510.22950v3)
134
+ - [HN thread on v1](https://news.ycombinator.com/item?id=43255467)
135
+ - [ComfyUI_DiffRhythm](https://github.com/billwuhao/ComfyUI_DiffRhythm)
136
+ - [Pinokio DiffRhythm](https://github.com/pinokiofactory/diffrhythm)
137
+ - [Chutes serving docs](https://chutes.ai/docs/examples/music-generation)
138
+ - [DiffRhythm+ paper (arxiv 2507.12890)](https://arxiv.org/html/2507.12890v2)
research/03_acestep.md ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACE-Step β€” Deep Technical Report
2
+
3
+ *Researched 2026-05-18 for a Suno-like platform build on M5 Max (128 GB unified) / MPS.*
4
+
5
+ ---
6
+
7
+ ## 1. Overview
8
+
9
+ ACE-Step is a foundation model for music generation jointly built by **ACE Studio** (the consumer music-tech outfit behind ACE Studio's vocal synth) and **StepFun** ("Step-AI"), a Beijing-based foundation-model lab. Core authors: Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, Joe Guo ([ace-step.github.io](https://ace-step.github.io/)).
10
+
11
+ Release timeline:
12
+ - **v1 (3.5B)** β€” open-sourced May 2025; technical report posted on arXiv on 2 Jun 2025 as 2506.00045 ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).
13
+ - **v1.5** β€” released **28 Jan 2026** as a separate repo, [`ace-step/ACE-Step-1.5`](https://github.com/ace-step/ACE-Step-1.5). Adds a hybrid Language-Model + Diffusion-Transformer planner.
14
+ - **XL series (4B DiT decoder)** β€” released 2 Apr 2026 as a higher-quality variant inside the v1.5 family.
15
+ - **Latest tag** β€” v0.1.7 on 24 Apr 2026 ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
16
+ - **v2** β€” **no public roadmap or announcement** as of 18 May 2026.
17
+
18
+ Current status: actively maintained, 10.4k stars on the v1.5 repo and 4.5k on the original v1 repo, with a thriving ComfyUI ecosystem and third-party UIs ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step), [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
19
+
20
+ ---
21
+
22
+ ## 2. Architecture
23
+
24
+ **v1 (3.5B):** a hybrid that fuses three pieces (per the paper, [arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)):
25
+ 1. **Sana Deep Compression AutoEncoder (DCAE)** β€” high-compression audio latent space borrowed from NVIDIA's Sana image work.
26
+ 2. **Lightweight linear transformer** β€” the diffusion backbone, deliberately linear-attention to keep RTF low.
27
+ 3. **Diffusion training** with **MERT + m-HuBERT** providing semantic-alignment supervision (REPA-style) during training so latents stay musically coherent.
28
+
29
+ This sits between LLM-token approaches (Suno/YuE, slow but lyric-tight) and pure diffusion (DiffRhythm, fast but structurally weak). The design goal stated in the paper is "a fast, general-purpose, efficient yet flexible architecture" β€” explicitly a *foundation model*, not just a text-to-song pipeline ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).
30
+
31
+ **v1.5:** a hybrid **LM-as-planner + Diffusion-Transformer (DiT)**. A small Qwen3-based LM (0.6B / 1.7B / 4B) turns the user prompt into a structured "song blueprint" (sections, key, bpm, lyrics, vocal style) which the DiT (2B standard or 4B XL) decodes into audio. This brings chain-of-thought reasoning to music structure, lifting long-range coherence β€” Suno's main historic advantage ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)).
32
+
33
+ **Parameter counts:**
34
+ | Variant | DiT | LM planner | Total |
35
+ |---|---|---|---|
36
+ | v1-3.5B | 3.5B (DiT only) | β€” | 3.5B |
37
+ | v1.5 standard | 2B | 0.6B / 1.7B | ~2.6 – 3.7B |
38
+ | v1.5 XL | 4B | up to 4B | up to 8B |
39
+
40
+ ---
41
+
42
+ ## 3. Variants and checkpoints
43
+
44
+ All on Hugging Face under the `ACE-Step/` org ([ACE-Step org on HF](https://huggingface.co/ACE-Step)):
45
+ - `ACE-Step-v1-3.5B` β€” the original generalist model ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)).
46
+ - `ACE-Step-v1-chinese-rap-LoRA` ("RapMachine") β€” genre-specific LoRA.
47
+ - **LoRA family** shipped by the team: `RapMachine`, `Lyric2Vocal` (vocal-only stem from lyrics), `Text2Samples` (instrumental loops/samples) ([ace-step.github.io](https://ace-step.github.io/)).
48
+ - **v1.5 DiT checkpoints:** 2B standard and 4B XL.
49
+ - **v1.5 LM planners:** 0.6B, 1.7B, 4B.
50
+ - A public **Space demo** at [huggingface.co/spaces/ACE-Step/ACE-Step](https://huggingface.co/spaces/ACE-Step/ACE-Step).
51
+
52
+ No v2 checkpoint exists yet.
53
+
54
+ ---
55
+
56
+ ## 4. License
57
+
58
+ **Apache 2.0** for v1 ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)) and **MIT** for v1.5 ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). Both are unambiguously **commercial-use-permitted, royalty-free**. This is the single biggest licensing advantage over Suno/Udio and even over YuE (which carries non-commercial clauses in parts of its weights chain).
59
+
60
+ ---
61
+
62
+ ## 5. Vocal support β€” CRITICAL VERIFICATION
63
+
64
+ **Verdict: YES β€” ACE-Step generates vocals natively. The "instrumental-only" claim circulating in some reviews is wrong (likely conflating it with `Text2Samples` LoRA or with DiffRhythm).**
65
+
66
+ Evidence:
67
+ - The **v1 HF model card** describes the model as full-song (vocals + instruments) with the explicit caveat: *"Coarse vocal synthesis lacking nuance"* and *"Rare instruments may not render perfectly"* ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)).
68
+ - The paper claims **lyric alignment across melody/harmony/rhythm metrics** β€” only meaningful for sung vocals ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).
69
+ - The ComfyUI native node `TextEncodeAceStepAudio` accepts lyrics with `[verse] [chorus] [bridge]` structural tags ([comfyui-wiki guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1)).
70
+ - `Lyric2Vocal` LoRA exists *because* the base model already does vocals β€” the LoRA isolates the vocal stem ([ace-step.github.io](https://ace-step.github.io/)).
71
+ - Blind-listening review of 50 participants scored ACE-Step v1.5 **4.4/5 on SongEval Vocal vs Suno v4 at 4.1/5** ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)).
72
+
73
+ **Quality reality check:** v1 vocals are admitted to be "coarse"; v1.5 markedly improves vocal clarity and now beats Suno v4 in blind tests on naturalness for folk/classical/jazz, while Suno still wins on "radio-ready polish" for pop/EDM ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)).
74
+
75
+ ---
76
+
77
+ ## 6. Languages supported
78
+
79
+ - **v1:** 19 languages, with the top 10 (English, Mandarin Chinese, Russian, Spanish, Japanese, German, French, Portuguese, Italian, Korean) performing best ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). Less-represented languages underperform due to training-data imbalance.
80
+ - **v1.5:** Expanded to **50+ languages** with lyric control, alongside the planner LM ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
81
+
82
+ Known weakness from the team itself: Chinese rap was historically weak, motivating the `chinese-rap-LoRA` ([ace-step.github.io](https://ace-step.github.io/)).
83
+
84
+ ---
85
+
86
+ ## 7. Speed claims β€” verified
87
+
88
+ The famous claim: *"synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPU β€” 15Γ— faster than LLM-based baselines"* ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045), [ace-step.github.io](https://ace-step.github.io/)). Hardware: **NVIDIA A100 80GB**.
89
+
90
+ Published RTF table from the v1 HF card ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)):
91
+
92
+ | Device | 27 steps RTF | 60 steps RTF |
93
+ |---|---|---|
94
+ | RTX 4090 | 34.48Γ— | 15.63Γ— |
95
+ | A100 | 27.27Γ— | 12.27Γ— |
96
+ | RTX 3090 | 12.76Γ— | 6.48Γ— |
97
+ | **M2 Max** | **2.27Γ—** | **1.03Γ—** |
98
+
99
+ v1.5 is faster still: *"under 2 seconds per full song on A100 and under 10 seconds on an RTX 3090"* ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
100
+
101
+ **Apple-Silicon equivalents** (from the dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) port):
102
+
103
+ | Task | M1 Pro 16 GB | M3 Pro 36 GB | A100 |
104
+ |---|---|---|---|
105
+ | 30 s turbo | ~45 s | ~25 s | ~2 s |
106
+ | 30 s SFT (full) | ~3 min | ~1.5 min | ~8 s |
107
+
108
+ **M5 Max projection:** The M5 Max's GPU TFLOPS lineage (MPS SGEMM scaled M1β†’M4: 1.36 β†’ 2.24 β†’ 2.47 β†’ 2.9 TFLOPS, per [arxiv 2502.05317](https://arxiv.org/html/2502.05317v1)) plus the M5 generation's ~30 % uplift suggests roughly **3.5–4Γ— the throughput of M2 Max**, i.e. an **estimated 8–10Γ— RTF at 27 steps** for v1, and full-song generation in **~30–50 s for a 4-minute song**. No M5-specific public benchmark exists yet.
109
+
110
+ ---
111
+
112
+ ## 8. Quality assessment
113
+
114
+ From the cross-model evaluation summarised in research-aggregator coverage ([researchgate paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model), [fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)):
115
+
116
+ | Dimension | Leader | Where ACE-Step sits |
117
+ |---|---|---|
118
+ | Aesthetic quality | Hailuo > DiffRhythm | mid-upper |
119
+ | Musicality (coherence) | Suno v3 | competitive, strong on memorability/clarity |
120
+ | Style alignment | Udio v1 > Hailuo | 3rd |
121
+ | Lyric alignment | Hailuo | strong, beats Suno v3, Udio, YuE |
122
+ | **Vocal naturalness (v1.5)** | **ACE-Step 4.4/5** | beats Suno v4 (4.1/5) |
123
+ | Speed (RTF) | **ACE-Step 15.63Γ—** | best in class; DiffRhythm 10.03Γ—, YuE 0.083Γ— |
124
+
125
+ User-facing reception is positive on customisability and speed; the most-cited weakness is "gacha"-style seed sensitivity β€” re-rolls produce noticeably different outputs ([ace-step.github.io](https://ace-step.github.io/)).
126
+
127
+ ---
128
+
129
+ ## 9. Inference performance & Apple Silicon
130
+
131
+ - **VRAM (v1):** minimum **8 GB with CPU offload**; comfortable on 12 GB+ ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)).
132
+ - **VRAM (v1.5):** **<4 GB** for 2B-turbo with offload; **β‰₯12 GB** for XL with offload; **β‰₯20 GB** without offload; **β‰₯24 GB optimal** ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)).
133
+ - **MPS support:** **first-class.** Use `--bf16 false` on M-series to avoid kernel issues ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). The dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) fork adds: bfloat16 throughout, MPS-safe pipeline with `torch.mps.empty_cache()` synchronisation, **MLX backend (567 LoC)** that auto-converts the Qwen3 planner LM to MLX with quantisation, and **LoRA training on MPS**.
134
+ - **ComfyUI:** **native nodes** ship in upstream ComfyUI (`TextEncodeAceStepAudio` etc.) plus the official [`ace-step/ACE-Step-ComfyUI`](https://github.com/ace-step/ACE-Step-ComfyUI). v1.5 has dedicated workflows (split-LLM and AIO checkpoint variants) on comfy.org ([Purz blog post](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui)).
135
+ - **128 GB unified on M5 Max** comfortably fits the full XL stack plus the 4B planner LM with no offload needed; user's hardware is essentially overkill for ACE-Step.
136
+
137
+ ---
138
+
139
+ ## 10. Repo health
140
+
141
+ | Repo | Stars | Forks | Last release |
142
+ |---|---|---|---|
143
+ | `ace-step/ACE-Step` (v1) | 4.5k | 568 | quiet since v1.5 fork |
144
+ | `ace-step/ACE-Step-1.5` | **10.4k** | 1.3k | v0.1.7 on 24 Apr 2026 |
145
+ | `fspecii/ace-step-ui` (popular community UI) | 3.8k | 561 | active |
146
+ | `clockworksquirrel/ace-step-apple-silicon` | β€” (smaller) | β€” | active |
147
+
148
+ The team also curates [`ace-step/awesome-ace-step`](https://github.com/ace-step/awesome-ace-step). Issue activity, ComfyUI integration cadence, and the LM-planner architectural jump in v1.5 all indicate a project that is healthier and growing faster than YuE or DiffRhythm.
149
+
150
+ ---
151
+
152
+ ## 11. Real-world adoption
153
+
154
+ - **AMD vendor-backed deployment:** AMD published a blog *"Commercial-grade AI music generation on AMD Ryzen AI processors and Radeon graphics with ACE Step 1.5"* in 2026, explicitly endorsing it for Ryzen AI / Radeon production stacks ([AMD blog](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html)).
155
+ - **Third-party SaaS:** `acestep.io` and `ace-step.app` run hosted song-generation services on the open weights ([acestep.io](https://acestep.io/), [ace-step.app](https://ace-step.app/)).
156
+ - **Production-grade UI:** `fspecii/ace-step-ui` brands itself as *"the Ultimate Open Source Suno Alternative"* with stem extraction (Demucs), batch generation, library/playlist management, LAN access ([fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui)).
157
+ - Heart-MuLa and similar music platforms cite ACE-Step 1.5 in their stack comparisons ([heart-mula.com/ace-step](https://heart-mula.com/ace-step)).
158
+
159
+ ---
160
+
161
+ ## 12. Fine-tuning + LoRA
162
+
163
+ - **Training code released**; documented in [`TRAIN_INSTRUCTION.md`](https://github.com/ace-step/ACE-Step) and `ZH_RAP_LORA.md` ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)).
164
+ - **Genre / task LoRAs from the team:** `RapMachine` (general rap), `Chinese-Rap-LoRA`, `Lyric2Vocal`, `Text2Samples` ([HF org](https://huggingface.co/ACE-Step), [ace-step.github.io](https://ace-step.github.io/)).
165
+ - v1.5 quotes **"8 songs trainable in ~1 hour on a single RTX 3090"** for LoRA personalisation ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
166
+ - LoRA training is verified working on **MPS** via the Apple-Silicon fork ([clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)).
167
+
168
+ ---
169
+
170
+ ## 13. Pros and cons
171
+
172
+ **Pros**
173
+ - Apache-2.0 / MIT β€” **fully commercial-friendly**, unique in this tier.
174
+ - **Fastest open music model**: 15.63Γ— RTF on a 4090; sub-2 s/song on A100 (v1.5).
175
+ - Vocals **and** instruments natively; v1.5 vocal quality now beats Suno v4 in blind tests.
176
+ - 50+ languages with lyric structural tags.
177
+ - First-class **MPS + MLX** support and a dedicated Apple-Silicon fork.
178
+ - ComfyUI native + thriving UI ecosystem (`ace-step-ui`).
179
+ - LoRA training is cheap (~1 hour for 8 songs on 3090), well-documented.
180
+ - Hybrid LM-planner (v1.5) closes the long-range structure gap with Suno.
181
+
182
+ **Cons**
183
+ - v1 vocals are admitted "coarse"; even v1.5 trails Suno on pop/EDM polish.
184
+ - High **seed sensitivity** β†’ "gacha" outputs; multiple re-rolls needed in production.
185
+ - Less-represented languages underperform.
186
+ - Memory for XL series can exceed 24 GB without offload.
187
+ - No official **v2** announced; the rapid v1 β†’ v1.5 β†’ XL fork hints at API/checkpoint churn.
188
+ - Smaller benchmark literature than Suno/YuE; some metrics still self-reported.
189
+
190
+ ---
191
+
192
+ ## 14. Verdict for the user's platform
193
+
194
+ For a **Suno-like platform on M5 Max with 128 GB unified memory**, ACE-Step is currently the **single strongest open-source choice** and should be the **default base model**:
195
+
196
+ - **Best for:** full-song generation with vocals in 50+ languages, fast iteration (sub-minute per song expected on M5 Max), genre-specific LoRA fine-tuning, and any deployment where commercial rights matter (Apache/MIT vs Suno's locked-down terms).
197
+ - **Recommended stack:** ACE-Step **v1.5 XL (4B DiT) + 1.7B Qwen3 planner**, run via the `clockworksquirrel/ace-step-apple-silicon` MPS/MLX fork, served behind the `fspecii/ace-step-ui` frontend, with ComfyUI workflows for power-user editing.
198
+ - **Weaknesses to mitigate:** budget for **n-of-k re-roll selection** in the product UX (the gacha problem); pair with a **Demucs stem-extraction post-process** (already in `ace-step-ui`) so users can mix-down; do not pitch the platform on pop/EDM polish alone β€” lean into folk/classical/jazz and rap, where ACE-Step now leads.
199
+ - **Where you may still need Suno-style commercial APIs:** clients demanding broadcast-radio pop polish; otherwise, ACE-Step is sufficient.
200
+
201
+ ---
202
+
203
+ ### Sources
204
+
205
+ - [ACE-Step paper, arXiv 2506.00045](https://arxiv.org/abs/2506.00045)
206
+ - [ace-step.github.io](https://ace-step.github.io/)
207
+ - [ace-step/ACE-Step (v1 repo)](https://github.com/ace-step/ACE-Step)
208
+ - [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)
209
+ - [ACE-Step v1-3.5B model card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)
210
+ - [ACE-Step org on Hugging Face](https://huggingface.co/ACE-Step)
211
+ - [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)
212
+ - [fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui)
213
+ - [ace-step/ACE-Step-ComfyUI](https://github.com/ace-step/ACE-Step-ComfyUI)
214
+ - [ace-step/awesome-ace-step](https://github.com/ace-step/awesome-ace-step)
215
+ - [ComfyUI native ACE-Step tutorial](https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1)
216
+ - [ComfyUI Wiki ACE-Step guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1)
217
+ - [Purz blog – ACE-Step 1.5 in ComfyUI](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui)
218
+ - [AMD blog – ACE-Step 1.5 on Ryzen AI / Radeon](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html)
219
+ - [FM9 – ACE-Step vs Suno blind test](https://fm9.ai/ace-step/vs-suno)
220
+ - [HeartMuLa – ACE-Step 1.5 review](https://heart-mula.com/ace-step)
221
+ - [ResearchGate – ACE-Step paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model)
222
+ - [Apple Silicon HPC benchmark, arXiv 2502.05317](https://arxiv.org/html/2502.05317v1)
223
+ - [acestep.io – hosted service](https://acestep.io/)
224
+ - [ace-step.app – hosted service](https://ace-step.app/)
research/04_newcomers_and_survey.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 2026 Open-Source Music Generation Models β€” Newcomers and Survey
2
+
3
+ *Date: 2026-05-18. Target hardware: M5 Max, 128 GB unified memory, MPS backend.*
4
+
5
+ This report investigates the freshest 2026 open-source song-with-vocals generators relevant to building a Suno-like platform locally. Primary focus: **SongGeneration 2 / LeVo 2** (Tencent, March 2026) and **HeartMuLa** (Jan 2026). Also covered: DiffRhythm 2, ACE-Step 1.5 XL, SongBloom, YuE, FunMusic/InspireMusic, NotaGen. Independent benchmark sources are sparse for releases this fresh; vendor claims are flagged.
6
+
7
+ ---
8
+
9
+ ## 1. SongGeneration 2 / LeVo 2 (Tencent AI Lab)
10
+
11
+ **Overview.** Builder: Tencent AI Lab. Release: 2026-03-01 (v2-large weights), arXiv paper "LeVo" appeared 2025-06-09 (2506.07520). Status: actively updated, v2 is the headline model on the repo ([GitHub](https://github.com/tencent-ailab/SongGeneration), [HF](https://huggingface.co/tencent/SongGeneration)).
12
+
13
+ **Architecture.** Hybrid LLM + Diffusion. The **LeLM** language model handles global structure and performance details with a hierarchical scheme that parallel-models *Mixed Tokens* (melody/structure) and *Dual-Track Tokens* (separate vocal vs. accompaniment streams). A downstream diffusion module synthesises the high-fidelity acoustic waveform from those tokens. Multi-preference DPO alignment (~200k positive/negative pairs) is applied offline ([repo README](https://github.com/tencent-ailab/SongGeneration/blob/main/README.md)).
14
+
15
+ **Variants and sizes.** Five tiers ([HF model card](https://huggingface.co/tencent/SongGeneration/blob/main/README.md)):
16
+ - `base` (2:30 max, zh) β€” 10/16 GB VRAM, RTF 0.67
17
+ - `base-new` (zh + en) β€” same VRAM
18
+ - `base-full` (4:30, zh + en) β€” 12/18 GB VRAM, RTF 0.69
19
+ - `large` (zh + en) β€” 22/28 GB VRAM, RTF 0.82
20
+ - **`v2-large` β€” 4 B params, multilingual (zh/en/es/ja/…), 22/28 GB VRAM, RTF 0.82, 4:30 max length**
21
+
22
+ **License.** Custom Tencent "academic, research and education purposes" license, **commercial use explicitly prohibited** ([LICENSE](https://github.com/tencent-ailab/SongGeneration/blob/main/LICENSE)). This is the headline blocker for a Suno-like SaaS product.
23
+
24
+ **Languages.** v2-large: Chinese, English, Spanish, Japanese plus others (multilingual lyrics input).
25
+
26
+ **Vocals.** Yes. Separable dual-track output (vocals + accompaniment, instrumental-only, or a cappella).
27
+
28
+ **Speed and hardware.** Reference numbers measured on Tencent's H20 (96 GB) GPU: RTF 0.82 for v2-large. No first-party MPS code path, but a community fork **[SongGen-Mac](https://github.com/Rdx-ai-art/SongGen-Mac)** runs the older base/large models via PyTorch MPS on M-series Macs β€” author reports **~6 min wall-clock per ~2 min song on M1 Max 64 GB (base), ~12 min for large**, and notes RAM+swap usage hits ~70 GB during inference. The fork is tiny (9 GitHub stars) and does **not** yet wrap v2-large β€” porting that to MPS on the M5 Max 128 GB is a real engineering task and will likely need careful attention bf16 casts (LeLM) + diffusion sampler patches.
29
+
30
+ **Benchmarks.** Vendor claims ([repo README](https://github.com/tencent-ailab/SongGeneration)): Phoneme Error Rate **8.55 %** vs. Suno v5 12.4 % and Mureka v8 9.96 %. Subjective panel: 20 industry professionals scored across Overall Quality, Melody, Arrangement, Sound Quality (instrument and vocal), Structure on 100 songs/model β€” Tencent reports v2-large above all open-source baselines and parity with top commercial. **All numbers vendor-reported; no independent re-run located.** The arXiv "Benchmarking Music Generation Models via Human Preference Studies" paper (2506.19085) precedes v2 and tops out at Suno v3.5 / Udio β€” does not cover LeVo ([arXiv](https://arxiv.org/html/2506.19085v1)).
31
+
32
+ **Repo health.** 1.6 k stars / 191 forks, last meaningful update 2026-03-01. 12 active discussion threads ([repo](https://github.com/tencent-ailab/SongGeneration)).
33
+
34
+ **Adoption.** Hugging Face Space (free demo), WaveSpeed AI hosted endpoint ([WaveSpeed](https://wavespeed.ai/models/wavespeed-ai/song-generation)), SECourses Patreon GUI wrapper, vllm-omni issue tracking integration ([HF Space](https://huggingface.co/spaces/tencent/SongGeneration)). No production SaaS adoption seen.
35
+
36
+ **Pros.** State-of-art lyric accuracy (vendor); dual-track outputs ready for mixing; multilingual; clear inference budget; 4 B params fits comfortably in 128 GB unified memory in fp16.
37
+
38
+ **Cons.** **License kills commercial use** for a Suno-clone product. No official MPS path. Community Mac fork lags v2. Inference time on Apple Silicon is multi-minute per song. No independent benchmark verification.
39
+
40
+ ---
41
+
42
+ ## 2. HeartMuLa (HeartMuLa team / academic group)
43
+
44
+ **Overview.** Builder: HeartMuLa research collective, paper credited to Jordi Pons-affiliated group ([Substack explainer](https://artintech.substack.com/p/heartmula-explained)). First weights: 2026-01-19 (`HeartMuLa-oss-3B`), latest: 2026-02-13 (`HeartMuLa-oss-3B-happy-new-year`). arXiv 2601.10547 ([abs](https://arxiv.org/abs/2601.10547)).
45
+
46
+ **Architecture.** Four-stage family ([landing page](https://heartmula.github.io/)): **HeartCLAP** (audio-text alignment / retrieval), **HeartTranscriptor** (Whisper-style lyric ASR), **HeartCodec** (12.5 Hz neural audio codec, low frame rate but high-fi), **HeartMuLa** (LLM-based song generator conditioned on lyrics, tags, and reference audio). Section-level fine-grained control (intro/verse/chorus) is a stated feature.
47
+
48
+ **Variants and sizes.** Six published weights on [HF](https://huggingface.co/HeartMuLa):
49
+ - `HeartMuLa-oss-3B` β€” 4 B text-to-audio (1.21 k downloads, 255 likes)
50
+ - `HeartMuLa-RL-oss-3B-20260123` β€” 4 B RL-tuned variant
51
+ - `HeartMuLa-oss-3B-happy-new-year` β€” 4 B latest checkpoint
52
+ - `HeartCodec-oss-20260123` β€” 2 B codec
53
+ - `HeartTranscriptor-oss` β€” 0.8 B ASR
54
+ - `HeartMuLa-7B` β€” internal/unreleased
55
+
56
+ (Note the naming oddity: HF model card lists "3B" name but 4 B parameter size; treat as ~4 B.)
57
+
58
+ **License.** **Apache 2.0** β€” confirmed via [LICENSE](https://github.com/HeartMuLa/heartlib/blob/main/LICENSE). Commercial use permitted. This is the strongest licensing position of any model in this report.
59
+
60
+ **Languages.** Multilingual; demo page covers en, zh, ja, ko, es. Paper claims "almost all languages."
61
+
62
+ **Vocals.** Yes β€” lyric-conditioned vocal synthesis is the core capability. The paper claims best-in-class lyric intelligibility.
63
+
64
+ **Speed and hardware.** RTF β‰ˆ 1.0 (paper). VRAM via the ComfyUI integration ([FL-HeartMuLa](https://github.com/filliptm/ComfyUI_FL-HeartMuLa)): 3 B model needs **12 GB+ VRAM** at full precision, **6 GB with 4-bit bnb quantisation** (CUDA-only). 7 B will need 24 GB / 12 GB quantised. **MPS supported** on M1/M2/M3/M4 (M5 implied), but 4-bit quantisation does not work on MPS, so the M5 Max will run native bf16. 128 GB unified memory is plenty headroom for the 4 B model and an eventual 7 B release.
65
+
66
+ **Benchmarks.** Vendor PER claims: **0.09 (English), 0.12 (Chinese)** β€” flagged "lowest across every language tested," beating Suno v5 and MiniMax Music 2.0 ([blog](https://huggingface.co/blog/azhan77168/heartmula)). **Note PER unit mismatch with SongGeneration's 8.55 % β€” these are likely measured on different scales (HeartMuLa percentages may be normalised differently); direct comparison unreliable.** Demo page compares against Suno v4.5, Mureka v7.6, YuE, DiffRhythm 2, ACE-Step ([demos](https://heartmula.github.io/)). The single HN comment ([46691275](https://news.ycombinator.com/item?id=46691275)) said "initial results promising, more so than recent ACE-Step 1.5." Otherwise **no independent A/B tests located**; the HF promo blog is vendor-aligned content.
67
+
68
+ **Repo health.** [github.com/HeartMuLa/heartlib](https://github.com/HeartMuLa/heartlib): 3.6 k stars / 396 forks / 71 open issues. Last release Feb 2026. Larger and more active than SongGeneration's repo.
69
+
70
+ **Adoption.** WaveSpeed AI hosted endpoint ([blog](https://wavespeed.ai/blog/posts/introducing-wavespeed-ai-heartmula-generate-music-on-wavespeedai/)); ComfyUI node `FL-HeartMuLa`; HeartMuse local app integrating Ollama for lyric writing ([HN](https://news.ycombinator.com/item?id=46871828)).
71
+
72
+ **Pros.** Apache 2.0 β€” usable for a commercial product. Modular architecture (codec + ASR + CLAP + gen) is reusable. Strong lyric intelligibility claim. Active repo. Explicit MPS support documented downstream.
73
+
74
+ **Cons.** Heavy marketing tone in third-party coverage; benchmarks all vendor-published. 7 B not yet released. No standardised MOS or ELO numbers from a neutral evaluator. PER values reported in non-comparable units to peers.
75
+
76
+ ---
77
+
78
+ ## 3. DiffRhythm 2 (ASLP-Lab)
79
+
80
+ **Overview.** Successor to DiffRhythm v1.2. arXiv 2510.22950, v3 2026-02-03 ([arXiv](https://arxiv.org/abs/2510.22950)). Original repo: [ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm).
81
+
82
+ **Architecture.** Music VAE at 5 Hz frame rate + Diffusion Transformer with **block flow matching** for lyric-to-vocal alignment. Adds cross-pair preference optimisation (RLHF) and a stochastic block representation alignment loss for musicality. Semi-autoregressive blockwise generation.
83
+
84
+ **License.** Apache 2.0 (inherited from v1, confirmed 2025-03-07).
85
+
86
+ **Languages, vocals, hardware.** Multilingual; full vocals + instrumental; uses 44.1 kHz stereo; up to 4:45 song length. DiffRhythm v1 can generate a full song in ~10 s on a single A100 β€” v2 should be in the same ballpark. MPS not officially stated but PyTorch DiT models port relatively cleanly. Parameter count not disclosed in v2 abstract.
87
+
88
+ **Benchmarks.** Vendor claims top-of-class fidelity; no independent verification specific to v2.
89
+
90
+ **Pros/cons.** Pros: very fast, permissive license, mature codebase. Cons: no public param count, no first-party MPS path, lyric clarity historically the weak spot vs LeVo/HeartMuLa.
91
+
92
+ ---
93
+
94
+ ## 4. ACE-Step 1.5 XL (ACE Studio Γ— StepFun)
95
+
96
+ **Overview.** [github.com/ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5). arXiv 2602.00744. Most user-tested local-first option. 10.4 k stars / 1.3 k forks β€” **biggest community by far**.
97
+
98
+ **Architecture.** LM planner (0.6 B / 1.7 B / 4 B selectable) + DiT decoder (2 B or 4 B XL). XL DiT ~9 GB bf16.
99
+
100
+ **License.** **MIT**. Commercial use allowed.
101
+
102
+ **Languages.** 50+.
103
+
104
+ **Speed and hardware.** Under 2 s/song on A100, under 10 s on RTX 3090, **<4 GB VRAM** for DiT-only minimum. **Explicit Mac MPS support** with `start_gradio_ui_macos.sh`; MLX backend optimisation noted. Easiest M5 Max install of any model in this list.
105
+
106
+ **Benchmarks.** Vendor: SongEval 8.12, AudioBox 7.76, claims to beat Suno v5 and MiniMax 2.5 across 11 dimensions ([project page](https://ace-step.github.io/ace-step-v1.5.github.io/)). DEV Community write-up positions it "between Suno v4.5 and v5" β€” more honest framing.
107
+
108
+ **Pros.** Best Mac story, MIT licence, LoRA personalisation in days, tiny VRAM. **Cons.** Vocal naturalness still trails Suno v5 in casual user tests.
109
+
110
+ ---
111
+
112
+ ## 5. SongBloom (Tencent AI Lab)
113
+
114
+ [github.com/tencent-ailab/SongBloom](https://github.com/tencent-ailab/SongBloom). 778 stars. Interleaved autoregressive sketch + diffusion refinement, 2 B params, MPS supported, lengths up to 240 s in Oct 2025 update. Same Tencent academic-only LICENSE pattern (not Apache). Up to 150 s songs from lyrics + 10 s reference audio. Useful as a research baseline; **same commercial-use prohibition as SongGeneration** likely applies β€” verify before deploying.
115
+
116
+ ---
117
+
118
+ ## 6. YuE (M-A-P / HKUST)
119
+
120
+ [github.com/multimodal-art-projection/YuE](https://github.com/multimodal-art-projection/YuE). LLaMA-2 backbone, lyric-to-song, **Apache 2.0** since 2025-01-30, 5 min max length, dual-track ICL mode, no v2 announced. Strong vocal emotion for ballads/R&B. Llama.cpp issue 11467 still tracks GGUF support. Solid permissive fallback if HeartMuLa underperforms.
121
+
122
+ ---
123
+
124
+ ## 7. FunMusic / InspireMusic (Alibaba FunAudioLLM)
125
+
126
+ [github.com/FunAudioLLM/FunMusic](https://github.com/FunAudioLLM/FunMusic). Qwen2.5 backbone + flow-matching super-res. 1.3 k stars. Apache 2.0. **No MPS support, requires Flash Attention 2.6 + CUDA 11.8+** β€” effectively NVIDIA-only. Song-with-vocals models announced but not yet released; current ships are music-only/audio.
127
+
128
+ ---
129
+
130
+ ## Survey table β€” 2026 open-source song generators
131
+
132
+ | Model | Builder | Release | Params | License | Vocals | Repo |
133
+ |---|---|---|---|---|---|---|
134
+ | SongGeneration 2 / LeVo 2 | Tencent AI Lab | 2026-03 | 4 B | Custom non-commercial | Yes, dual-track | [link](https://github.com/tencent-ailab/SongGeneration) |
135
+ | HeartMuLa-oss-3B | HeartMuLa | 2026-01 | ~4 B + 2 B codec + 0.8 B ASR | Apache 2.0 | Yes, multilingual | [link](https://github.com/HeartMuLa/heartlib) |
136
+ | DiffRhythm 2 | ASLP-Lab | 2025-10 β†’ 2026-02 (v3) | undisclosed | Apache 2.0 | Yes | [link](https://github.com/ASLP-lab/DiffRhythm) |
137
+ | ACE-Step 1.5 XL | ACE Studio Γ— StepFun | 2026-01 | LM 0.6–4 B + DiT 2–4 B | MIT | Yes | [link](https://github.com/ace-step/ACE-Step-1.5) |
138
+ | SongBloom | Tencent AI Lab | 2025-06 β†’ 2025-10 | 2 B | Custom (likely non-commercial) | Yes | [link](https://github.com/tencent-ailab/SongBloom) |
139
+ | YuE | M-A-P / HKUST | 2025-01 | up to 7 B | Apache 2.0 | Yes | [link](https://github.com/multimodal-art-projection/YuE) |
140
+ | InspireMusic (FunMusic) | Alibaba FunAudioLLM | 2025-01 | 1.5 B | Apache 2.0 | Coming (music only today) | [link](https://github.com/FunAudioLLM/FunMusic) |
141
+ | NotaGen / NotaGen-X | Central Conservatory + ElectricAlexis | 2025 | symbolic-only | MIT | n/a (ABC/XML) | [link](https://github.com/ElectricAlexis/NotaGen) |
142
+
143
+ ---
144
+
145
+ ## Dark horses / experimental
146
+
147
+ - **NotaGen-X** β€” DeepSeek-R1-style RL on symbolic music. Outputs ABC/MusicXML (not audio). Could feed a TTS-vocal model for a hybrid composer β†’ singer pipeline ([repo](https://github.com/ElectricAlexis/NotaGen), [arXiv](https://arxiv.org/abs/2502.18008)).
148
+ - **LLaSA / LLaSA+** β€” Llama-3B-backbone TTS pipeline ([arXiv](https://arxiv.org/html/2508.06262v1)); not music, but emergent prosody good enough to consider as the vocal layer behind a NotaGen score.
149
+ - **DiffRhythm+** β€” preference-optimised DiffRhythm variant, arXiv 2507.12890; mid-stage between v1 and v2.
150
+ - **AudioX** β€” anything-to-audio DiT, 2503.10522; useful for sound design and SFX layering, not full-song.
151
+ - **MelodyFlow** β€” text-controllable DiT with flow-matching for music editing.
152
+ - **HeartMuse** β€” local Ollama-orchestrated lyric β†’ HeartMuLa song app ([HN](https://news.ycombinator.com/item?id=46871828)); reference for building a thin product wrapper.
153
+
154
+ ---
155
+
156
+ ## Skeptic's bottom line for the M5 Max 128 GB build
157
+
158
+ 1. **For a commercial Suno-clone**: **HeartMuLa** (Apache 2.0, native MPS, 4 B fits easily, Feb-2026 checkpoint, modular components reusable) is the strongest pick. Verify their PER claims yourself before fundraising-style messaging.
159
+ 2. **For best raw quality, research only**: **SongGeneration 2 v2-large** β€” but the Tencent licence forbids commercial deployment and the v2 weights don't yet have a maintained MPS port. The community SongGen-Mac fork targets the older base/large.
160
+ 3. **For fastest iteration / smallest VRAM**: **ACE-Step 1.5 XL** (MIT, native Mac script, <4 GB VRAM) β€” under-promises vocal naturalness vs HeartMuLa but ships today on Apple Silicon with the cleanest licence story.
161
+ 4. Reliable independent benchmark for these specific 2026 releases does not yet exist; the only neutral preference study found ([arXiv 2506.19085](https://arxiv.org/html/2506.19085v1)) stops at Suno v3.5 and does not cover LeVo, HeartMuLa, or ACE-Step. **Run your own blind A/B before betting a product on any vendor PER number.**
research/05_apple_silicon_mps_audit.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Apple Silicon / MPS Compatibility Audit β€” Music Generation Models
2
+
3
+ Hardware target: **M5 Max, 128 GB unified memory**. Date: 2026-05-18.
4
+
5
+ Honest read: MPS is a second-class citizen for almost every music-gen repo. CUDA is the assumed default; Mac support, when it exists, is community-driven. Below is the per-model evidence with verdicts.
6
+
7
+ ---
8
+
9
+ ## 1. YuE (multimodal-art-projection/YuE)
10
+
11
+ - **Official MPS support:** None. The README requires `cuda >= 11.8`, conda-installed `cudatoolkit=11.8`, and **flash-attn 2 is mandatory** to avoid OOM on long sequences ([YuE README](https://github.com/multimodal-art-projection/YuE/blob/main/README.md)).
12
+ - **Community reports:** Issue #51 ("Instructions to run on Mac") is open and **unanswered** ([#51](https://github.com/multimodal-art-projection/YuE/issues/51)). No working Mac fork.
13
+ - **Backend compatibility:** Hard CUDA dependency through flash-attn; xformers/triton flash paths are CUDA-only ([HF forum thread](https://discuss.huggingface.co/t/best-practices-to-use-models-requiring-flash-attn-on-apple-silicon-macs-or-non-cuda/97562)). Stage 1 (7B LLaMA-2-style) and Stage 2 (1B) both transformer-based; in principle portable, but no one has shipped it.
14
+ - **Memory:** 7B + 1B + upsampler. Author recommends **β‰₯80 GB VRAM** for full song; 24 GB OK for short clips. On 128 GB unified memory this fits, *if* you can swap flash-attn for SDPA.
15
+ - **Apple-Silicon timing:** None reported.
16
+ - **Verdict:** **Doesn't work out of the box. Likely broken on MPS.** Would need a non-trivial fork: strip flash-attn, replace with `torch.nn.functional.scaled_dot_product_attention`, and audit RoPE/KV-cache for MPS dtype quirks. There is also a "GPU Poor" fork ([deepbeepmeep/YuEGP](https://github.com/deepbeepmeep/YuEGP)) but it targets CUDA/ROCm with 8-bit quant β€” **no Mac path**.
17
+
18
+ ## 2. DiffRhythm v1 and v2 (ASLP-lab)
19
+
20
+ - **Official MPS support:** DiffRhythm v1 explicitly states *"DiffRhythm can now run on MacOS!"* with `brew install espeak-ng` ([Readme](https://github.com/ASLP-lab/DiffRhythm/blob/main/Readme.md)). No specific MPS notes, but it works.
21
+ - **DiffRhythm 2:** `requirements.txt` is **clean of CUDA-only packages** β€” no flash-attn, xformers, triton, mamba_ssm, deepspeed, bitsandbytes ([requirements.txt](https://github.com/ASLP-lab/DiffRhythm2/blob/main/requirements.txt)). Just `torch==2.7`, `torchaudio==2.7`, `transformers`, `safetensors`, `muq`, `librosa`. The 3.9 % "CUDA" language stat in the repo is benign β€” auto-detected from a small kernel file, but no compiled extensions in the pip install path.
22
+ - **Community reports:** No GitHub issues or Reddit threads surface specific MPS bugs for DiffRhythm β€” implying it either works quietly or no one has tried at scale. The architecture (latent diffusion + DiT with flow matching, very similar to Stable Audio Open / SD3) is the same class that *does* work on MPS via diffusers.
23
+ - **Memory:** DiffRhythm-base needs **β‰₯8 GB VRAM**; `--chunked` decoding reduces it further. Trivial on 128 GB.
24
+ - **Apple-Silicon timing:** Not benchmarked publicly, but extrapolating from Stable Audio Open MPS (β‰ˆ3Γ— CPU speedup) the 285-second full-song run should land in the low minutes on M5 Max.
25
+ - **Verdict:** **Just works on MPS (likely) / Works with workarounds.** Highest confidence pick.
26
+
27
+ ## 3. ACE-Step 1.5 (ace-step/ACE-Step)
28
+
29
+ - **Official MPS support:** **First-class.** README explicitly advertises Mac + AMD + Intel + CUDA. macOS scripts auto-set `ACESTEP_LM_BACKEND=mlx --backend mlx` β€” the language-model side runs on Apple's **MLX**, the DiT side on **PyTorch MPS** ([INSTALL.md](https://github.com/ace-step/ACE-Step-1.5/blob/main/docs/en/INSTALL.md)). bfloat16 supported on MPS since PyTorch 2.4.
30
+ - **Community reports:** Real-world M2 Air 16 GB run: 5–10 min per song, hit MPS-OOM, fixed with `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` ([bioerrorlog](https://en.bioerrorlog.work/entry/ace-step-15-local-m2-macbook)). A dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) fork already centralised MPS detection, swapped CUDA cache calls for `torch.mps.empty_cache()` / `torch.mps.synchronize()`, and tuned VAE conv1d tile sizes for Metal limits.
31
+ - **Backend compatibility:** Flash-attn auto-disabled on MPS. `torch.compile` disabled on MPS. nanovllm not on Mac. Otherwise clean.
32
+ - **Memory:** 4 GB DiT-only / 6 GB LLM+DiT minimum; ~10 GB total install.
33
+ - **Apple-Silicon timing (M1 Pro 16 GB vs M3 Pro 36 GB vs A100, from the AS fork's benchmarks):**
34
+
35
+ | Task | M1 Pro | M3 Pro | A100 |
36
+ | --- | --- | --- | --- |
37
+ | 30 s turbo song | ~45 s | ~25 s | ~2 s |
38
+ | 30 s SFT song | ~3 min | ~1.5 min | ~8 s |
39
+
40
+ **Extrapolated M5 Max:** turbo ~10–15 s, SFT ~45–60 s for 30 s output. Best Mac-citizen of the bunch.
41
+
42
+ - **Verdict:** **Just works on MPS.** Already production-grade on M-series.
43
+
44
+ ## 4. SongGeneration 2 / LeVo 2 (Tencent)
45
+
46
+ - **Official MPS support:** None. Official repo pins `flash-attn 2.7.4.post1` for CUDA 12 + torch 2.6, though `--not_use_flash_attn` flag exists ([Tencent SongGeneration](https://github.com/tencent-ailab/SongGeneration)).
47
+ - **Community reports:** [Rdx-ai-art/SongGen-Mac](https://github.com/Rdx-ai-art/SongGen-Mac) fork β€” "Runs completely on your Mac's GPU via MPS on PyTorch." Tested on M1 Max 64 GB / macOS 15.7.2. **Pre-chorus block produces gibberish vocals** β€” known regression vs CUDA.
48
+ - **Backend compatibility:** Hybrid LLM + diffusion architecture. Once flash-attn is stripped, the LLM side uses SDPA fine on MPS.
49
+ - **Memory (Mac fork):** Base β‰₯24 GB RAM, ~70 GB total app RAM including swap during inference. Large β‰₯32 GB, hits ~80 GB. **On 128 GB M5 Max this fits cleanly without swap.**
50
+ - **Apple-Silicon timing (M1 Max 64 GB):** Base ~4–6 min for ~2 min of audio. Large ~10–25 min for ~2:30. M5 Max should be roughly 2–3Γ— faster (better mem bandwidth + more GPU cores).
51
+ - **Verdict:** **Works with workarounds (community fork only).** Functional but watch the pre-chorus bug.
52
+
53
+ ## 5. HeartMuLa (HeartMuLa/heartlib)
54
+
55
+ - **Official MPS support:** Not in the README. CUDA-first design with `--mula_device` / `--codec_device` flags ([heartlib](https://github.com/HeartMuLa/heartlib)). RTF β‰ˆ 1.0 on CUDA.
56
+ - **Community reports:** **Strong MLX port exists**: [Acelogic/heartlib-mlx](https://github.com/Acelogic/heartlib-mlx). Claims **2.1Γ— faster than PyTorch MPS** on M2 Max (13.4 s vs 27.9 s end-to-end), 8.7Γ— faster model load, 100 % numerical parity with PyTorch.
57
+ - **Backend compatibility:** No flash-attn / mamba / triton in the official deps β€” clean transformer + neural codec. MLX port supports bfloat16.
58
+ - **Memory (MLX port):** 3B model ~6 GB, HeartCodec ~2 GB, KV-cache ~1 GB/min of audio. **Full 1-min song β‰ˆ 11 GB.** 32 GB minimum recommended; M5 Max 128 GB blows past this. 7B variant not yet released as of Feb 2026.
59
+ - **Apple-Silicon timing:** M2 Max β‰ˆ 11.6 s to generate 50 frames; M5 Max should comfortably exceed real-time for the 3B model.
60
+ - **Verdict:** **Just works on MPS via MLX port.** Second-best Mac story after ACE-Step. The official PyTorch path is untested but should run on MPS once you bypass any CUDA cache calls.
61
+
62
+ ## 6. MusicGen (Meta / audiocraft) β€” reference
63
+
64
+ - **Official MPS support:** None. AudioCraft officially supports CUDA or CPU only ([audiocraft README](https://github.com/facebookresearch/audiocraft)). Issues [#13](https://github.com/facebookresearch/audiocraft/issues/13) and [#31](https://github.com/facebookresearch/audiocraft/issues/31) are open requests, no merged PR. EnCodec decoder ops misbehave on MPS β€” common workaround is to **move decoder to CPU** while keeping the LM on MPS.
65
+ - **Community / MLX:** Multiple solid ports β€” [Andrade Olivier's port](https://medium.com/@andradeolivier/i-ported-musicgen-to-apple-silicon-generate-music-from-text-on-your-macbook-9eaf95992053), [Nat Taylor's MusicGen MLX test](https://nattaylor.com/blog/2024/musicgen-via-mlx/). M4 Max: small model 8 s audio in ~6 s (faster than realtime). M1: ~60 s for 9 s of audio at 500 steps. AudioGen (sibling model) [works on MPS](https://blog.peddals.com/en/apple-mps-to-generate-audio-with-meta-audiogen/) by moving decoder ops to CPU.
66
+ - **Memory:** 300 M small / 1.5 B medium / 3.3 B large. Trivial on 128 GB.
67
+ - **Verdict:** **Partial on raw PyTorch MPS (CPU fallback for decoder); Just works via MLX port.**
68
+
69
+ ## 7. Stable Audio Open (Stability AI) β€” reference
70
+
71
+ - **Official MPS support:** Diffusers supports `device="mps"` for the SAO pipeline ([Stable Audio docs](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_audio)).
72
+ - **Community reports:** [phlo.info](https://phlo.info/posts/using-stable-audio-tools-on-apple-silicon/) reports 51 s CPU β†’ 17 s MPS by swapping `cuda` β†’ `mps` in two files. **fp16 conv1d in the decoder is pathologically slow on MPS** β€” fix is `model.pretransform.model_half = False; model.to(torch.float32)` ([HF discussion](https://huggingface.co/stabilityai/stable-audio-open-small/discussions/1)).
73
+ - **Memory:** 1.21 B params. Trivial.
74
+ - **Apple-Silicon timing:** ~17 s per 3-s sample on M1-class; M5 Max should be a few seconds.
75
+ - **Verdict:** **Works with workarounds** (force fp32 in decoder).
76
+
77
+ ---
78
+
79
+ ## Metal / MLX Apple-Native Equivalents
80
+
81
+ - **ACE-Step**: Native MLX backend in the official repo for the LM side. **Closest thing to a first-party Mac music model.**
82
+ - **HeartMuLa**: [heartlib-mlx](https://github.com/Acelogic/heartlib-mlx) β€” 2.1Γ— speedup over PyTorch MPS, full numerical parity.
83
+ - **MusicGen**: Multiple MLX ports, faster than real-time on M4 Max small model.
84
+ - **Stable Audio Open**: MLX-Audio family ([Blaizzy/mlx-audio](https://github.com/Blaizzy/mlx-audio)) covers TTS/STT; SAO has unofficial MLX ports.
85
+ - **YuE / DiffRhythm / SongGeneration**: **No MLX ports** as of May 2026.
86
+
87
+ There is no umbrella "MLX-music" framework; each project rolls its own port.
88
+
89
+ ---
90
+
91
+ ## Practical Recommendation
92
+
93
+ **Start with ACE-Step 1.5.** It is the only model with first-party Apple Silicon support, hybrid MLX + MPS execution, published M-series benchmarks, and no CUDA-only dependencies. The user's 128 GB unified memory completely eliminates the OOM workaround other Mac users hit on 16–36 GB machines.
94
+
95
+ **Second pick: HeartMuLa via the MLX port** ([heartlib-mlx](https://github.com/Acelogic/heartlib-mlx)). Faster than the PyTorch MPS path, bfloat16, well-benchmarked. 3B only for now; 7B unreleased.
96
+
97
+ **Third pick: DiffRhythm v2** β€” clean deps, README claims macOS support, similar architecture class to Stable Audio Open which is known to work on MPS with the fp32 decoder workaround.
98
+
99
+ **Avoid on MPS unless you enjoy yak-shaving:**
100
+ - **YuE** β€” flash-attn-mandatory, no Mac fork, no MLX port.
101
+ - **SongGeneration / LeVo** β€” only via [SongGen-Mac](https://github.com/Rdx-ai-art/SongGen-Mac) fork, pre-chorus bug, 70+ GB RAM pressure with swap. Workable on 128 GB but not pleasant.
102
+
103
+ **Remote-dev path:** For YuE specifically, **train/develop on a rented H100 or A100** (RunPod, Lambda, Modal, Replicate) and pull weights for inference on M5 Max **only if** you fork it to drop flash-attn. Otherwise treat YuE as a remote-only model. For everything else on this list, M5 Max is sufficient as the primary development machine.
104
+
105
+ **On the user's prior LTX-Video burns:** music models are LM/diffusion stacks without the multi-modal Gemma + complex64 + SDPA-on-meta-tensor traps that bit LTX-2.3. The main MPS gotchas here are mundane: flash-attn substitution, fp16 conv1d in audio decoders, and `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` for high-watermark allocator behaviour.
research/06_comparison_matrix.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Open-Source Song Generation Models β€” Side-by-Side Comparison
2
+
3
+ *Compiled 2026-05-18 for M5 Max / 128 GB unified memory target.*
4
+
5
+ ---
6
+
7
+ ## Headline matrix
8
+
9
+ | Property | **ACE-Step 1.5 XL** | **HeartMuLa 4B** | **DiffRhythm 2** | **YuE 7B** | SongGeneration 2 |
10
+ |---|---|---|---|---|---|
11
+ | **Builder** | ACE Studio Γ— StepFun | HeartMuLa | NWPU ASLP-lab + Xiaomi | M-A-P / HKUST | Tencent AI Lab |
12
+ | **Release** | 2026-01-28 | 2026-01-19 | 2025-10-27 β†’ 2026-02-03 (v3) | 2025-01-26 | 2026-03-01 |
13
+ | **License** | **MIT** | **Apache 2.0** | **Apache 2.0** | **Apache 2.0** | **Custom NON-commercial** |
14
+ | **Repo stars** | 10.4 k | 3.6 k | ~2.3 k (v1) + 0.16 k (v2) | 6.2 k | 1.6 k |
15
+ | **Last major commit** | v0.1.7 (2026-04-24) | 2026-02 | 2026-02 | 2025-06-04 (stale) | 2026-03-01 |
16
+ | **Architecture** | LM-planner (Qwen3 0.6/1.7/4 B) + DiT (2/4 B) | CLAP + ASR + 12.5 Hz Codec + 4 B LLM | 5 Hz Music VAE + DiT w/ block flow matching | LLaMA2 7B AR Stage-1 + 1B Stage-2 + X-Codec | LeLM hybrid + diffusion decoder |
17
+ | **Params (largest)** | up to 8 B (4 B DiT + 4 B LM) | ~4 B + 2 B codec + 0.8 B ASR | ~1 B DiT + 170 M VAE-dec | 7 B + 1 B + upsampler | 4 B (v2-large) |
18
+ | **Audio rate** | 44.1 kHz stereo | 24 kHz neural codec | 44.1 kHz stereo | 16 kHz then upsampled | High-fi via diffusion |
19
+ | **Max length** | 4+ min | β‰₯1 min, scaling | **210 s (regression from v1)** | 5 min | 4:30 |
20
+ | **Vocals + Instruments** | βœ… Native | βœ… Native | βœ… Native, single stream | βœ… Native, dual-track AR | βœ… Dual-track |
21
+ | **Languages** | 50+ | 5+ (en/zh/ja/ko/es benchmarked) | Bilingual EN/ZH + JP/KR/ES marketing-only | EN, Mandarin, Cantonese, JP, KR | zh/en/es/ja + others |
22
+ | **VRAM (minimum)** | **<4 GB** with offload (turbo) | 6 GB 4-bit / 12 GB bf16 | 8 GB v1 with `--chunked` | 24 GB consumer / 80 GB single-pass | 22–28 GB |
23
+ | **VRAM (recommended)** | 12 GB+ offload, 24 GB optimal | 24 GB for 7B (unreleased) | 24 GB | 80 GB H100/H800 | 28 GB |
24
+ | **MPS / Apple Silicon** | **First-class, MLX + MPS, dedicated fork** | **MLX port, 2.1Γ— PyTorch MPS** | Likely OK; clean deps; untested | ❌ Mandatory flash-attn | Community fork, pre-chorus bug |
25
+ | **MPS bench M-series (30 s clip)** | M3 Pro 25 s turbo / 1.5 min SFT | M2 Max 11.6 s for 50 frames | not published | not published | M1 Max 4–6 min for 2 min |
26
+ | **MPS bench M5 Max (projected)** | turbo ~10–15 s / SFT ~45–60 s | <real-time | low-minute range | n/a | ~2–3Γ— M1 Max |
27
+ | **Speed (RTF on A100 / 4090)** | sub-2 s/song on A100 (v1.5) | RTF β‰ˆ 1.0 | v2 RTF 0.213 (4090) β†’ ~45 s for 210 s | 27 steps RTF 27.27Γ— on A100 (v1, ~15 min/song) | RTF 0.82 (H20) |
28
+ | **Vocal naturalness vs Suno v4** | **4.4/5 vs 4.1/5** (blind 50-person test) | Vendor only, unverified | Authors admit clear gap vs v4.5 | Comparable vocal range; weaker mix | Vendor claim parity, unverified |
29
+ | **Lyric alignment (PER)** | Strong (lyric tags) | Vendor: 0.09 EN / 0.12 ZH (unit mismatch) | **0.13 (open-source SOTA)** | Strong from lyric tags | Vendor: 8.55 % |
30
+ | **Fine-tuning support** | βœ… LoRA, 8 songs/1h on 3090, **MPS-validated** | ❌ public training code | ❌ "Coming soon" since Mar 2025 | βœ… LoRA (Megatron pipeline, CUDA 12.1+) | ❌ |
31
+ | **ComfyUI integration** | βœ… Native, official workflows | βœ… FL-HeartMuLa | βœ… billwuhao/ComfyUI_DiffRhythm | βœ… smthemex/ComfyUI_YuE | βœ… |
32
+ | **Replicate hosted** | ❌ no first-party | ❌ | ❌ | βœ… fofr/yue | ❌ |
33
+ | **Style/audio reference** | LoRA + lyric tags | Reference audio supported | Reference audio supported | ICL mode (style cloning) | Limited |
34
+ | **Stem separation** | Built into `fspecii/ace-step-ui` via Demucs | Modular Codec is reusable | ❌ single stream | βœ… AR dual-track is inherently separable | βœ… Dual-track output |
35
+ | **Continuation / extension** | Supported in workflows | Limited | Supported | βœ… explicit continuation mode | Supported |
36
+ | **Production deployments** | acestep.io, ace-step.app, fspecii/ace-step-ui, AMD-blessed | WaveSpeed AI, HeartMuse local app | Chutes serverless | Replicate fofr/yue, HF Spaces | WaveSpeed AI, HF Space |
37
+ | **Watermarking / content credentials** | None baked-in | None baked-in | None baked-in | None baked-in | None baked-in |
38
+ | **License gotchas** | None (MIT) | None (Apache 2.0) | Ethical disclaimer (non-binding) | Attribution required ("YuE by HKUST/M-A-P"), label "AI-generated" | **Commercial use prohibited** |
39
+ | **Independent benchmarks** | Yes β€” 50-person blind test, AMD vendor-validated | None located | Internal MOS only | Paper + community | None β€” Tencent only |
40
+
41
+ ---
42
+
43
+ ## Quality dimensions (qualitative)
44
+
45
+ | Dimension | Best (open source) | Notes |
46
+ |---|---|---|
47
+ | **Pop / EDM polish** | (none β€” Suno v4/v5 still wins) | All open models lag commercial. |
48
+ | **Folk / classical / jazz vocal naturalness** | **ACE-Step 1.5 XL** | Wins blind test vs Suno v4 in these genres. |
49
+ | **Lyric intelligibility (PER)** | **DiffRhythm 2** (0.13) | HeartMuLa claims lower but unit-incomparable. |
50
+ | **Musical macro-structure (verse/chorus/bridge over 3-5 min)** | **YuE** or **ACE-Step 1.5** (planner) | LM-planner models lead diffusion-only here. |
51
+ | **Stereo image, mix depth** | **DiffRhythm 2** (44.1 kHz stereo native) | YuE is mono-ish; ACE-Step is stereo but variable. |
52
+ | **Genre breadth** | **YuE** | Death-growl metal to Beijing opera to rap. |
53
+ | **Multilingual breadth** | **ACE-Step 1.5** | 50+ languages w/ lyric tags; YuE deep on 5 only. |
54
+ | **Code-switching (English ↔ Mandarin in one song)** | **YuE** | Explicit demos. |
55
+ | **Speed / cost per song** | **ACE-Step 1.5** | Sub-2 s/song on A100; <minute on M5 Max. |
56
+ | **Modular reusability of components** | **HeartMuLa** | Codec/ASR/CLAP separately exportable. |
57
+
58
+ ---
59
+
60
+ ## Cost model (rough)
61
+
62
+ | Path | Per-song cost | Latency | Best for |
63
+ |---|---|---|---|
64
+ | Self-host ACE-Step 1.5 on M5 Max | $0 marginal (electricity) | ~30-50 s | Dev, beta, low-volume |
65
+ | Self-host ACE-Step 1.5 on rented A100 80 GB | ~$0.0001 (sub-2 s Γ— $1.50/hr) | <2 s | Production, paid SaaS |
66
+ | Replicate `fofr/yue` | ~$0.30-1.00 per song (estimated from 4090 cog runtime) | 5-15 min | Multilingual fallback, occasional |
67
+ | Self-host DiffRhythm 2 on 4090 | $0 marginal on owned 4090 | ~45 s | Speed tier, instrumentals |
68
+ | Replicate / WaveSpeed managed endpoints | varies | varies | Cold-start / spike capacity |
69
+
70
+ ---
71
+
72
+ ## License risk matrix
73
+
74
+ | License | Commercial SaaS | Output ownership | Risk |
75
+ |---|---|---|---|
76
+ | MIT (ACE-Step 1.5) | βœ… | User owns | Lowest |
77
+ | Apache 2.0 (ACE-Step v1, HeartMuLa, DiffRhythm v1/v2, YuE) | βœ… with attribution | User owns | Low |
78
+ | Tencent custom (SongGeneration, SongBloom) | ❌ **prohibited** | n/a | **Blocks SaaS** |
79
+ | Suno API (closed-source baseline) | $ paid tier | platform terms | Medium |
80
+
81
+ ---
82
+
83
+ ## Hardware sizing on M5 Max (128 GB unified memory)
84
+
85
+ | Model | Fits? | Headroom | Notes |
86
+ |---|---|---|---|
87
+ | ACE-Step 1.5 XL (4 B DiT + 4 B planner) | βœ… huge | ~120 GB free | Overkill; LoRA training viable in-RAM |
88
+ | HeartMuLa 4B + 2 B codec + 0.8 B ASR | βœ… huge | ~120 GB free | 7 B variant when released will also fit |
89
+ | DiffRhythm 2 (~1 B + 170 M VAE-dec) | βœ… trivial | ~125 GB free | Tiny by 2026 standards |
90
+ | YuE 7B Stage-1 + 1B Stage-2 + upsampler | βœ… but blocked | n/a | Memory fine, **flash-attn dep blocks MPS** |
91
+ | SongGeneration 2-large (4 B + diffusion) | βœ… comfortable | ~100 GB free | Community fork bug aside, fits |
92
+
93
+ **Conclusion:** the user's 128 GB unified memory completely eliminates memory pressure for every model in this list. The constraint is software (MPS kernel compat, flash-attn substitution), not hardware.
research/07_platform_architecture.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Suno-Clone Platform Architecture β€” Build Plan
2
+
3
+ *Compiled 2026-05-18. Target hardware: Apple M5 Max, 128 GB unified memory. Core model decision: ACE-Step 1.5 XL.*
4
+
5
+ ---
6
+
7
+ ## Mental model
8
+
9
+ Suno (and Udio) are not just a song-generation model. They are a **product stack** with at least five distinct AI components and a few non-AI scaffolds. If we want to replicate the product experience, we have to plan for all of them. The song-gen model is the headline; everything else is what makes it usable.
10
+
11
+ ```
12
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
13
+ β”‚ Web / mobile UI β”‚
14
+ β”‚ (text prompt + style + lyrics) β”‚
15
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
16
+ β”‚
17
+ β–Ό
18
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
19
+ β”‚ Orchestrator API β”‚
20
+ β”‚ - prompt routing, queue, billing, history, sharing β”‚
21
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
22
+ β”‚ β”‚ β”‚ β”‚
23
+ β–Ό β–Ό β–Ό β–Ό
24
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
25
+ β”‚ Lyrics LLM β”‚ β”‚ Style/Tag β”‚ β”‚ Song-gen β”‚ β”‚ Voice β”‚
26
+ β”‚ (Llama 3.3 β”‚ β”‚ rewriter β”‚ β”‚ router β”‚ β”‚ cloning β”‚
27
+ β”‚ or Qwen) β”‚ β”‚ (small LM) β”‚ β”‚ β”‚ β”‚ (RVC) β”‚
28
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
29
+ β”‚
30
+ β–Ό
31
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
32
+ β”‚ Model pool (the actual research)β”‚
33
+ β”‚ - ACE-Step 1.5 XL (default) β”‚
34
+ β”‚ - HeartMuLa-MLX (A/B) β”‚
35
+ β”‚ - DiffRhythm 2 (speed tier) β”‚
36
+ β”‚ - YuE on Replicate (intl.) β”‚
37
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
38
+ β”‚
39
+ β–Ό
40
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
41
+ β”‚ Post-processing pipeline β”‚
42
+ β”‚ - Loudness normalization β”‚
43
+ β”‚ - Demucs stem separation β”‚
44
+ β”‚ - Watermarking (audible+meta) β”‚
45
+ β”‚ - FFmpeg encoding β†’ m4a/mp3 β”‚
46
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
47
+ β”‚
48
+ β–Ό
49
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
50
+ β”‚ Storage + streaming β”‚
51
+ β”‚ - S3 / R2 origin β”‚
52
+ β”‚ - HLS for in-browser playback β”‚
53
+ β”‚ - CDN β”‚
54
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
55
+ ```
56
+
57
+ ---
58
+
59
+ ## Component-by-component plan
60
+
61
+ ### 1. Song generation β€” primary model
62
+
63
+ - **ACE-Step 1.5 XL** via [`clockworksquirrel/ace-step-apple-silicon`](https://github.com/clockworksquirrel/ace-step-apple-silicon) on M5 Max.
64
+ - Hybrid backend: Qwen3 planner on **MLX**, DiT decoder on **PyTorch MPS**, bf16 throughout.
65
+ - Why XL over standard 2B: 128 GB unified eats the cost, and the 4 B DiT closes meaningful quality gaps for paying users.
66
+
67
+ **LoRA fine-tuning path (when needed):**
68
+ - Document the platform's target genres β†’ curate ~50–200 song lyric/audio pairs per genre.
69
+ - Train a per-genre LoRA on the 3090-class budget (~1 hour per LoRA per [`ace-step-1.5 README`](https://github.com/ace-step/ACE-Step-1.5)).
70
+ - Serve via the same inference pipeline with LoRA hot-swap.
71
+
72
+ **Fallback / A-B candidates:**
73
+ - **HeartMuLa-MLX** ([`Acelogic/heartlib-mlx`](https://github.com/Acelogic/heartlib-mlx)) β€” 2.1Γ— faster than PyTorch MPS, full numerical parity, Apache 2.0.
74
+ - **DiffRhythm 2** ([`ASLP-lab/DiffRhythm`](https://github.com/ASLP-lab/DiffRhythm)) β€” for the speed/instrumental tier (210 s ceiling acceptable for short-form features like background loops).
75
+ - **YuE via Replicate** ([`replicate.com/fofr/yue`](https://replicate.com/fofr/yue/api)) β€” only for EN+Mandarin+Cantonese+JP+KR generations that ACE-Step underperforms; pay-per-second, no local infra cost.
76
+
77
+ ### 2. Lyrics generation β€” separate LLM
78
+
79
+ The song-gen model takes **lyrics + style** as input, not raw user prompts. Suno's "song description" flow is actually two stages: prompt β†’ lyrics LLM β†’ lyrics β†’ song model.
80
+
81
+ - Use any decent open LLM running on the user's M5 Max. Candidates:
82
+ - **Qwen 2.5 Coder 32B / Qwen 3 7B** β€” good multilingual chops, fast on MPS via Ollama or mlx-lm.
83
+ - **Llama 3.3 70B 4-bit** β€” premium tier; fits comfortably in 128 GB unified.
84
+ - **GPT-OSS-20B** β€” Apache 2.0, sturdy English.
85
+ - Prompt template should:
86
+ 1. Parse user style hint into tags (genre, tempo, mood, instruments).
87
+ 2. Output structured lyrics with `[verse]`, `[chorus]`, `[bridge]`, `[outro]` markers β€” these are **exactly the structural tags ACE-Step's `TextEncodeAceStepAudio` consumes**.
88
+ 3. Constrain section count and line count to roughly match the target song duration.
89
+
90
+ **This LLM is independent of the song-gen model and can be swapped freely.**
91
+
92
+ ### 3. Style / tag normalization
93
+
94
+ A small classifier or 3 B LM that normalizes user free-text into the controlled-vocabulary tag set the song model was trained on (per genre, BPM bucket, vocal gender, mood). For ACE-Step this maps to its lyric-tag schema; for YuE it maps to `top_200_tags.json`.
95
+
96
+ Implementation: 1-shot prompt to the lyrics LLM with examples; cache results.
97
+
98
+ ### 4. Voice cloning / personas (optional but Suno-equivalent)
99
+
100
+ To match Suno's "Personas" feature:
101
+ - **RVC v2** (Retrieval-based Voice Conversion) β€” open source, fast, runs on MPS, well-supported.
102
+ - Train a 5-minute reference clip β†’ 10–15 min on M5 Max β†’ speaker embedding.
103
+ - Apply to the generated vocal stem (Demucs-extracted) β†’ remix.
104
+
105
+ ACE-Step's **ICL mode** (in-context learning from a reference clip) and YuE's ICL variants partly cover this too, but RVC gives explicit per-speaker control.
106
+
107
+ ### 5. Stem separation
108
+
109
+ For Suno's "download stems" feature:
110
+ - **Demucs v4 / HTDemucs** β€” open source, Apache 2.0, runs on MPS, separates into vocals / drums / bass / other.
111
+ - Already bundled in [`fspecii/ace-step-ui`](https://github.com/fspecii/ace-step-ui).
112
+
113
+ ### 6. Mastering / loudness normalization
114
+
115
+ - **pyloudnorm** for LUFS normalization to streaming spec (-14 LUFS Spotify, -16 for AirPods).
116
+ - **ffmpeg-normalize** as a CLI wrapper.
117
+ - **Optional: TBProAudio mvMeter / Voxengo Span equivalents** via web-audio for UI metering.
118
+
119
+ ### 7. Watermarking + content credentials
120
+
121
+ This is a **legal must-have** for any 2026 generative-music product (training-data lawsuits against Suno/Udio set the precedent).
122
+
123
+ - **Inaudible audio watermark**: AudioSeal or SilentCipher β€” open-source, Meta-built, survives MP3 transcoding.
124
+ - **C2PA metadata**: sign the m4a with model name + version + prompt + timestamp via the C2PA SDK.
125
+ - **Visible "AI-generated" tag** in UI per the YuE model card's recommendation (and increasingly per platform policy).
126
+
127
+ ### 8. Storage and streaming
128
+
129
+ - **S3-compatible object store** (R2, Backblaze B2, or self-hosted MinIO on the M5 Max if dev-only).
130
+ - **HLS encoding pipeline**: ffmpeg β†’ m3u8 + 4 s segments; serve via NGINX or Cloudflare.
131
+ - For local dev, plain m4a + range requests are fine.
132
+
133
+ ### 9. Orchestrator API
134
+
135
+ - **FastAPI** for the request-handling layer.
136
+ - **Redis Streams** or **Hatchet** for the generation queue (songs are 30 s–2 min jobs on M5 Max β€” non-trivial latency, must be async).
137
+ - **PostgreSQL** for users, songs, lyrics, LoRAs, billing.
138
+ - **Server-Sent Events** for progress streaming back to the UI ("planner stage", "DiT denoising step 14/27", "mastering...").
139
+
140
+ ### 10. Frontend
141
+
142
+ - **Next.js 16** + Cache Components for the user dashboard / library.
143
+ - **Wavesurfer.js** for waveform display and scrubbing.
144
+ - **Tone.js** for any in-browser preview / mixing.
145
+ - Auth via Clerk or Auth0 β€” the user's portfolio revamp may already include this.
146
+
147
+ ---
148
+
149
+ ## Build order (incremental milestones)
150
+
151
+ | Milestone | Scope | Validates |
152
+ |---|---|---|
153
+ | **M0 β€” Spike** | Get ACE-Step 1.5 XL running locally via clockworksquirrel fork; generate one 30 s song end-to-end | Hardware compatibility, RTF on M5 Max |
154
+ | **M1 β€” CLI MVP** | Wrap in a Python CLI: `genmusic --prompt "..." --lyrics "..." --out song.m4a` | Headless generation, mastering chain, file output |
155
+ | **M2 β€” Local UI** | Replace UI with `fspecii/ace-step-ui` initially (fastest path); add Demucs stem download | Browser flow, multi-song library, LAN access |
156
+ | **M3 β€” Lyrics LLM integration** | Plug Qwen 3 / Llama 3.3 as the lyrics generator; produce structured lyrics from a one-line prompt | Suno-equivalent prompt UX |
157
+ | **M4 β€” Multi-model router** | Add HeartMuLa-MLX as alternate; add Replicate YuE as multilingual fallback; user can pick or auto-route | A/B capability, breadth |
158
+ | **M5 β€” LoRA pipeline** | First custom LoRA on a target genre (e.g., user's preferred style); hot-swap at inference | Differentiation vs Suno |
159
+ | **M6 β€” Production wrapper** | FastAPI + Postgres + queue + auth + watermarking + C2PA signing | Real product surface |
160
+ | **M7 β€” Deploy** | Move heavy inference behind a rented A100 endpoint for paid users; keep M5 Max for free tier / personal use | Paid-tier economics |
161
+
162
+ ---
163
+
164
+ ## Open questions for the user before M0
165
+
166
+ 1. **Commercial intent.** Is this a personal portfolio project (research mode β†’ SongGeneration 2 is fair game) or a real SaaS (must stay Apache/MIT)? The license map changes drastically.
167
+ 2. **Target audience.** Western pop (where Suno still wins polish) vs world music / experimental genres (where ACE-Step / YuE compete fairly)?
168
+ 3. **Latency target.** Suno generates in ~30 s; users tolerate up to 90 s. ACE-Step on M5 Max hits this; YuE local does not.
169
+ 4. **Hosting plan.** Local-only for personal use? Or eventually paid tier on rented GPU?
170
+ 5. **Vocal cloning.** Is Suno-style "Persona" upload a must-have v1 feature, or v2?
171
+ 6. **Catalog / training data.** Any in-house licensed song catalog for LoRA fine-tuning, or strictly the public-domain model out of the box?
172
+
173
+ ---
174
+
175
+ ## Risks and mitigations
176
+
177
+ | Risk | Likelihood | Mitigation |
178
+ |---|---|---|
179
+ | MPS regression in a future PyTorch release breaks ACE-Step | medium | Pin torch version; keep CPU fallback path. |
180
+ | ACE-Step releases v2 with breaking API mid-build | medium | Wrap inference in a thin adapter; abstract model behind a single `Generator.generate()` interface. |
181
+ | Vendor PER claims (HeartMuLa, LeVo) overstated β†’ quality disappointment | medium | Run internal blind A/B on 20+ prompts before featuring a model in the UI. |
182
+ | Output watermark stripped by transcoding | low | Use AudioSeal which survives MP3; double-stamp with C2PA metadata. |
183
+ | Lyrics LLM hallucinates copyrighted hooks | medium | Run a similarity check against an embeddings index of known songs; flag for human review. |
184
+ | Training-data IP suit (Suno-style) | low for derivative usage | Use models with documented public-data training (ACE-Step's paper is reasonably transparent); avoid Tencent's non-commercial weights. |
185
+ | MPS OOM on long sequences | low (128 GB) | `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0`; chunk generation; offload non-active LoRAs. |
186
+
187
+ ---
188
+
189
+ ## Why ACE-Step 1.5 XL is the foundation (not just a model pick)
190
+
191
+ This is worth saying explicitly. Choosing the base model determines:
192
+
193
+ 1. **Inference budget and unit economics** β€” ACE-Step is the only model where <2 s/song on A100 makes a paid tier economically obvious.
194
+ 2. **Mac developer ergonomics** β€” first-class MPS means the user can iterate on the M5 Max for weeks without renting cloud GPU.
195
+ 3. **License-clean output ownership** β€” MIT means users own their songs unambiguously.
196
+ 4. **Future-proof on multilingual** β€” 50+ languages out of the box matters if the platform grows beyond an English audience.
197
+ 5. **LoRA personalization is the differentiator** β€” fine-tuning support that works on MPS lets the user ship genre-specialist sub-models that Suno can't, because Suno's weights are locked.
198
+ 6. **Production deployments exist** β€” AMD vendor-backed, `fspecii/ace-step-ui` running at scale, multiple SaaS already on the open weights. This is not betting on a research artifact.
199
+
200
+ The compound effect of those six is why ACE-Step is recommended as the platform foundation rather than just "the model to start with."