Spaces:
Running on Zero
docs: track spec + mockups + model research
Browse filesThe docs/ and research/ directories existed on disk from before
git init (they hold the brainstorming + research output) but were
never explicitly added by the early A-series commits. Tracking them
now so the spec, UI mockups, and base-model research are part of
the repo history alongside the implementation plan.
Contents:
- docs/superpowers/specs/2026-05-18-ace-music-studio-design.md
- docs/superpowers/specs/mockups/01_generate_mobile_errors.html
- docs/superpowers/specs/mockups/02_cover_extend.html
- docs/superpowers/specs/mockups/03_edit_lyrics.html
- docs/superpowers/specs/mockups/README.md
- research/00_executive_summary.md
- research/01_yue.md
- research/02_diffrhythm.md
- research/03_acestep.md
- research/04_newcomers_and_survey.md
- research/05_apple_silicon_mps_audit.md
- research/06_comparison_matrix.md
- research/07_platform_architecture.md
- docs/superpowers/specs/2026-05-18-ace-music-studio-design.md +550 -0
- docs/superpowers/specs/mockups/01_generate_mobile_errors.html +604 -0
- docs/superpowers/specs/mockups/02_cover_extend.html +572 -0
- docs/superpowers/specs/mockups/03_edit_lyrics.html +517 -0
- docs/superpowers/specs/mockups/README.md +50 -0
- research/00_executive_summary.md +122 -0
- research/01_yue.md +268 -0
- research/02_diffrhythm.md +138 -0
- research/03_acestep.md +224 -0
- research/04_newcomers_and_survey.md +161 -0
- research/05_apple_silicon_mps_audit.md +105 -0
- research/06_comparison_matrix.md +93 -0
- research/07_platform_architecture.md +200 -0
|
@@ -0,0 +1,550 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ACE Music Studio β Design Spec
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-05-18
|
| 4 |
+
**Status:** Approved β ready for implementation plan
|
| 5 |
+
**Repo:** `~/Projects/llm/music-generator/` β GitHub `techfreakworm/ace-music-studio` (to be created)
|
| 6 |
+
**HF Space:** `huggingface.co/spaces/techfreakworm/ace-music-studio` (to be created)
|
| 7 |
+
**Companion docs:** `research/00_executive_summary.md` (model selection rationale)
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## 1. Goal
|
| 12 |
+
|
| 13 |
+
A single-process Gradio app that wraps **ACE-Step 1.5 XL SFT** for full-song generation with vocals, deployable both to a free non-profit **Hugging Face ZeroGPU Space** and locally on **Apple M5 Max (MPS / MLX)** or **NVIDIA (CUDA)** workstations. Supports the full ACE-Step feature surface β text-to-song, audio-reference cover, song extension, segment-level edit/repaint, plus an in-app lyrics writer powered by a bundled small LM. Users can stack any number of LoRAs from a curated preset library or upload custom `.safetensors` files at runtime.
|
| 14 |
+
|
| 15 |
+
Non-goals (v1): commercial-tier SaaS, multi-user accounts, persistent storage across sessions, social features, payment integration.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2. Locked product decisions
|
| 20 |
+
|
| 21 |
+
| Decision | Value | Source |
|
| 22 |
+
|---|---|---|
|
| 23 |
+
| Product name | **ACE Music Studio** (slug `ace-music-studio`) | brainstorming Q1 |
|
| 24 |
+
| Base model | ACE-Step 1.5 XL SFT (4 B DiT + 4 B Qwen3 planner) | research bundle `03_acestep.md` |
|
| 25 |
+
| Backend pattern | Direct ACE-Step Python API, single Gradio process | brainstorming Q architecture |
|
| 26 |
+
| UI layout | Sidebar nav + form + output (3 columns on desktop) | brainstorming Q layout = B |
|
| 27 |
+
| Theme | Brutalist Mono (pure black/white, no accent) | brainstorming Q palette = E |
|
| 28 |
+
| Tab set | Generate Β· Cover Β· Extend Β· Edit Β· Lyrics | brainstorming Q scope = all |
|
| 29 |
+
| LoRA capability | Multi-stack via PEFT + bundled presets + custom upload | brainstorming Q scope |
|
| 30 |
+
| Lyrics LM | Qwen 2.5 7B Instruct (Apache-2.0, ~14 GB bf16) | brainstorming Q lyrics LLM |
|
| 31 |
+
| Hosting | Free ZeroGPU (community grant if needed) | brainstorming Q hosting |
|
| 32 |
+
| License | MIT, public GitHub | brainstorming Q license |
|
| 33 |
+
| Mobile | Horizontal scroll tabs at top, β€ 640 px | brainstorming Q responsive = A |
|
| 34 |
+
| Authorship rule | Mayank Gupta sole author on every commit | user prior memory `feedback_git_authorship.md` |
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## 3. Architecture
|
| 39 |
+
|
| 40 |
+
### 3.1 Top-level shape
|
| 41 |
+
|
| 42 |
+
```
|
| 43 |
+
ββββββββββββββββββββββββββββββββββββββββββββ
|
| 44 |
+
browser ββΆ β app.py β Gradio Blocks β
|
| 45 |
+
β header Β· sidebar Β· 5 tabs Β· CTA footer β
|
| 46 |
+
βββββββββββββββββββ¬βββββββββββββββββββββββββ
|
| 47 |
+
β
|
| 48 |
+
βΌ
|
| 49 |
+
ββββββββββββββββββββββββββββββββββββββββββββ
|
| 50 |
+
β backend.py β ACEStepStudioBackend β
|
| 51 |
+
β @spaces.GPU(duration=callable) β
|
| 52 |
+
β lazy singletons; one mode-dispatch fn β
|
| 53 |
+
βββββββββββββββββββ¬βββββββββββββββββββββββββ
|
| 54 |
+
β
|
| 55 |
+
ββββββββββββββββ¬ββββββββββββββ΄βββββββββ¬ββββββββββββββββββ
|
| 56 |
+
βΌ βΌ βΌ βΌ
|
| 57 |
+
ace_pipeline.py lora_stack.py lyrics_lm.py post_process.py
|
| 58 |
+
ACEStepPipeline preset registry Qwen 2.5 7B Demucs stems
|
| 59 |
+
device/cache PEFT adapters MLX or PyTorch pyloudnorm
|
| 60 |
+
sniff + validate lazy load
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### 3.2 Backend singleton β `ACEStepStudioBackend`
|
| 64 |
+
|
| 65 |
+
One per-process instance, constructed lazily on first request. Owns three independently-lazy sub-singletons:
|
| 66 |
+
|
| 67 |
+
| Sub-singleton | Loads when | Holds |
|
| 68 |
+
|---|---|---|
|
| 69 |
+
| `ACEStepPipeline` instance | first generation request | DiT, Qwen3 planner, audio codec, VAE |
|
| 70 |
+
| `LyricsLM` instance | first lyrics-tab request | Qwen 2.5 7B weights, tokenizer |
|
| 71 |
+
| `Demucs` instance | first stem-separation request | `htdemucs_ft` weights |
|
| 72 |
+
|
| 73 |
+
Boot cost: only `_bootstrap()` (cache mirror + symlinks) β ~1β5 s. First gen request: +30β60 s warm-up. First lyrics request: +20β40 s. First stem request: +10 s. All amortised across the session.
|
| 74 |
+
|
| 75 |
+
### 3.3 Device autodetect (`ace_pipeline.py`)
|
| 76 |
+
|
| 77 |
+
Priority: **CUDA β MPS β CPU**.
|
| 78 |
+
|
| 79 |
+
Apple Silicon path:
|
| 80 |
+
|
| 81 |
+
- Set `PYTORCH_ENABLE_MPS_FALLBACK=1` before any torch import (in `app.py` module preamble, before backend imports torch).
|
| 82 |
+
- Use the **Apple-Silicon fork's branch of ACE-Step** (`clockworksquirrel/ace-step-apple-silicon`) on Mac β pinned via `requirements-mac.txt` extra. Hybrid MLX (LM planner) + PyTorch MPS (DiT decoder).
|
| 83 |
+
- Skip the CUDA-only `torch.mps.mem_get_info` gate β `vram_limit_for("mps")` returns `None` so ACE-Step's free-VRAM check short-circuits.
|
| 84 |
+
- bf16 throughout; `--bf16 false` only if a specific kernel falls back.
|
| 85 |
+
|
| 86 |
+
CUDA path:
|
| 87 |
+
|
| 88 |
+
- Vanilla `ace-step` from git (or PyPI when published).
|
| 89 |
+
- bf16; allow flash-attn if installed.
|
| 90 |
+
- `vram_limit_for("cuda")` returns the safe cap from `torch.cuda.mem_get_info`.
|
| 91 |
+
|
| 92 |
+
CPU path (warning only, not blocked):
|
| 93 |
+
|
| 94 |
+
- Single warning banner on app load if no GPU detected: "CPU inference: expect ~10Γ slower."
|
| 95 |
+
|
| 96 |
+
### 3.4 HF Spaces bootstrap (`app.py:_bootstrap()`)
|
| 97 |
+
|
| 98 |
+
Direct port of z-image-studio's pattern, with model paths swapped:
|
| 99 |
+
|
| 100 |
+
1. If `on_spaces()`, mirror the read-only `HF_HOME` (build cache) to `~/hf-cache-rw/`.
|
| 101 |
+
2. Repoint `HF_HOME` and `HF_HUB_CACHE` env vars at the writable copy.
|
| 102 |
+
3. Set `ACESTEP_MODEL_BASE_PATH` (or whatever the fork's env var is) to a project-local `./models/`.
|
| 103 |
+
4. Symlink each cached HF snapshot into `./models/<repo>/` so the pipeline's loader finds them locally.
|
| 104 |
+
|
| 105 |
+
This avoids re-downloads on every cold container start and works around HF's read-only build cache layer.
|
| 106 |
+
|
| 107 |
+
### 3.5 ZeroGPU integration
|
| 108 |
+
|
| 109 |
+
- `@spaces.GPU(duration=β¦)` decorates `backend.generate(mode, params)` at module load time. The decorator is a no-op identity off Spaces.
|
| 110 |
+
- `duration` is a callable that estimates per-call timeout from `(mode, params)`, clamped to `[60, 180] s`:
|
| 111 |
+
- Generate / Cover at default settings β 60 s
|
| 112 |
+
- Long Generate (>120 s output) or Edit β 90β120 s
|
| 113 |
+
- Extend with large repaint window β 120β180 s
|
| 114 |
+
- Lyrics (separate decoration) β 30 s
|
| 115 |
+
- On `"GPU task aborted"` exception, auto-retry once at 2Γ duration. After second failure, return `gr.Warning` with timing diagnostics.
|
| 116 |
+
- `requirements.txt` **must not pin `spaces`** (HF injects its own version).
|
| 117 |
+
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
## 4. The five modes
|
| 121 |
+
|
| 122 |
+
All mode handlers live in `modes.py` as pure functions over `(backend, params) β (audio_path, meta_dict)`. They share the **LoRA stack** and **advanced opts** code paths via shared helpers.
|
| 123 |
+
|
| 124 |
+
### 4.1 Generate (text β song)
|
| 125 |
+
|
| 126 |
+
**Inputs**: `prompt` (style), `lyrics`, `duration_s` (5β240), `instrumental` (bool), `lora_stack`, `advanced`.
|
| 127 |
+
|
| 128 |
+
**ACE-Step params**: `audio_cover_strength=0`, `repaint_mode=None`, `flow_edit_morph=False`, `cot_*` controlled by advanced "LM thinking" toggle.
|
| 129 |
+
|
| 130 |
+
**Output**: WAV (44.1 kHz stereo) + metadata JSON.
|
| 131 |
+
|
| 132 |
+
### 4.2 Cover (audio reference β song in that style)
|
| 133 |
+
|
| 134 |
+
**Inputs**: `prompt` (new style hint, optional), `ref_audio` file (any of mp3/wav/flac, β€ 60 s), `lyrics` (new lyrics), `duration_s`, `lora_stack`, `advanced`.
|
| 135 |
+
|
| 136 |
+
**ACE-Step params**: `audio_cover_strengthβ0.93` (configurable in advanced), `cover_noise_strength=0`, `infer_method="ode"`.
|
| 137 |
+
|
| 138 |
+
**Output**: WAV.
|
| 139 |
+
|
| 140 |
+
### 4.3 Extend (continue an existing song)
|
| 141 |
+
|
| 142 |
+
**Inputs**: `seed_audio` (β€ 240 s), `extra_prompt`, `extra_duration_s` (5β120), `lora_stack`, `advanced`.
|
| 143 |
+
|
| 144 |
+
**ACE-Step params**: `repaint_mode="balanced"`, `repaint_strength` configurable, `repainting_start` set to the seed-audio end timestamp, `repainting_end` set to seed-end + `extra_duration_s`. Exact param names + sentinels for "append-after-end" must be verified against the current ACE-Step Python API during M3 implementation β see Β§14 open question.
|
| 145 |
+
|
| 146 |
+
**Output**: WAV (seed + extension concatenated).
|
| 147 |
+
|
| 148 |
+
### 4.4 Edit (repaint / flow morph a segment)
|
| 149 |
+
|
| 150 |
+
**Inputs**: `source_audio`, `source_lyrics`, `target_lyrics`, `segment_start_s`, `segment_end_s`, `mode` β {`repaint`, `flow_edit`}, `lora_stack`, `advanced`.
|
| 151 |
+
|
| 152 |
+
**ACE-Step params**:
|
| 153 |
+
|
| 154 |
+
- repaint sub-mode: `repaint_mode="balanced"`, `repainting_start=segment_start_s`, `repainting_end=segment_end_s`, `repaint_strength=0.5`.
|
| 155 |
+
- flow_edit sub-mode: `flow_edit_morph=True`, `flow_edit_source_caption`, `flow_edit_source_lyrics`, `flow_edit_n_min=0.0`, `flow_edit_n_max=1.0`, `flow_edit_n_avg=1`.
|
| 156 |
+
|
| 157 |
+
**Output**: WAV.
|
| 158 |
+
|
| 159 |
+
### 4.5 Lyrics (Qwen 2.5 β structured lyrics)
|
| 160 |
+
|
| 161 |
+
**Inputs**: `brief` (free-text prompt), `target_structure` (e.g., "intro, verse, chorus, verse, chorus, bridge, chorus, outro"), `language`, `tone` (optional).
|
| 162 |
+
|
| 163 |
+
**System prompt** (locked):
|
| 164 |
+
|
| 165 |
+
```
|
| 166 |
+
You are a songwriter. Output ONLY structured lyrics for an AI music generator. Use these section tags exactly:
|
| 167 |
+
[intro] [verse 1] [verse 2] [chorus] [bridge] [outro] (etc.)
|
| 168 |
+
|
| 169 |
+
Each section is on its own line, followed by the lyrics for that section. Keep verses 4-8 lines, choruses 4 lines, bridges 2-4 lines. Match the requested tone and language. Do not include commentary, headers, or markdown.
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
**Output**: plain text with structural tags. A "Use these in Generate" button populates the Generate tab's `lyrics` field.
|
| 173 |
+
|
| 174 |
+
### 4.6 Retake button
|
| 175 |
+
|
| 176 |
+
Every mode's output panel has a "β» retake" button. It re-runs the same mode handler with a new random seed, all other params unchanged.
|
| 177 |
+
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
## 5. LoRA stack (`lora_stack.py`)
|
| 181 |
+
|
| 182 |
+
### 5.1 Preset registry
|
| 183 |
+
|
| 184 |
+
`presets/manifest.json`:
|
| 185 |
+
|
| 186 |
+
```json
|
| 187 |
+
[
|
| 188 |
+
{"name":"RapMachine","hf_id":"ACE-Step/ACE-Step-v1-RapMachine-LoRA","kind":"genre"},
|
| 189 |
+
{"name":"Chinese Rap","hf_id":"ACE-Step/ACE-Step-v1-Chinese-Rap-LoRA","kind":"genre"},
|
| 190 |
+
{"name":"Lyric2Vocal","hf_id":"ACE-Step/ACE-Step-v1-Lyric2Vocal-LoRA","kind":"voice"},
|
| 191 |
+
{"name":"Text2Samples","hf_id":"ACE-Step/ACE-Step-v1-Text2Samples-LoRA","kind":"instrumental"}
|
| 192 |
+
]
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
Presets are downloaded from HF on first preset-click, cached, and registered as PEFT adapters with the preset name. The four preset chips appear in every song-mode tab.
|
| 196 |
+
|
| 197 |
+
### 5.2 Custom upload
|
| 198 |
+
|
| 199 |
+
User drops a `.safetensors` file into the upload zone:
|
| 200 |
+
|
| 201 |
+
1. `sniff(path)` reads the safetensors header (no full load, just metadata).
|
| 202 |
+
2. Verifies key naming matches ACE-Step 1.5 XL DiT (`*.to_q.lora_A.weight`, etc.) and rank β€ 256, alpha set, file β€ 500 MB.
|
| 203 |
+
3. On success, registers as a new PEFT adapter under `Path(path).stem` as adapter name; appears in the active stack.
|
| 204 |
+
4. On failure, raises `LoRAValidationError` β `gr.Error` toast: "This LoRA isn't compatible with ACE-Step 1.5 XL SFT. Expected DiT modules: to_q, to_k, to_v, to_out.0, ff.net.0.proj, ff.net.2."
|
| 205 |
+
|
| 206 |
+
### 5.3 Active stack management
|
| 207 |
+
|
| 208 |
+
UI shows a list of active LoRAs with per-row strength slider (0.0β1.5) and Γ button. State held in `gr.State` per tab. On generate:
|
| 209 |
+
|
| 210 |
+
```python
|
| 211 |
+
backend.apply_lora_stack(active_adapters) # pipe.set_adapters(names, weights=scales)
|
| 212 |
+
audio, meta = backend.generate(mode, params)
|
| 213 |
+
meta["loras"] = [{"name":n, "scale":s, "sha256":h} for n,s,h in active_adapters]
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
After generation the adapters stay loaded (cheap memory cost) but are deactivated via `pipe.disable_adapters()` if the user clears the stack.
|
| 217 |
+
|
| 218 |
+
### 5.4 Sole-LoRA edge cases
|
| 219 |
+
|
| 220 |
+
- All chips off + no upload β `pipe.disable_adapters()` (vanilla SFT XL output).
|
| 221 |
+
- One LoRA with scale 0.0 β effectively disabled but still listed (UX: don't surprise the user by silently dropping it).
|
| 222 |
+
- Same LoRA loaded twice (user dragged the same file twice) β dedupe by file sha256; UI flash: "already in stack."
|
| 223 |
+
|
| 224 |
+
---
|
| 225 |
+
|
| 226 |
+
## 6. Lyrics LM (`lyrics_lm.py`)
|
| 227 |
+
|
| 228 |
+
### 6.1 Backend selection
|
| 229 |
+
|
| 230 |
+
| Device | Backend | Weights size |
|
| 231 |
+
|---|---|---|
|
| 232 |
+
| `mps` (Mac) | `mlx-lm` with quantised Qwen 2.5 7B 4-bit | ~4 GB |
|
| 233 |
+
| `cuda` | `transformers` with bf16 | ~14 GB |
|
| 234 |
+
| ZeroGPU | `transformers` bf16, sliced into the `@spaces.GPU` lifetime | ~14 GB |
|
| 235 |
+
|
| 236 |
+
Quantisation on Mac is the practical choice β 4-bit MLX-quant Qwen 2.5 7B runs ~3Γ faster than full-precision PyTorch MPS and barely affects lyric quality.
|
| 237 |
+
|
| 238 |
+
### 6.2 Generation
|
| 239 |
+
|
| 240 |
+
- `max_new_tokens=600`, `temperature=0.85`, `top_p=0.9`, `repetition_penalty=1.1`.
|
| 241 |
+
- Stop sequences: `\n\n[end]`, `</lyrics>`.
|
| 242 |
+
- Post-process: strip leading/trailing whitespace, normalize section tags to lowercase (e.g., `[Verse 1]` β `[verse 1]`).
|
| 243 |
+
|
| 244 |
+
### 6.3 Lazy loading
|
| 245 |
+
|
| 246 |
+
```python
|
| 247 |
+
class LyricsLM:
|
| 248 |
+
_instance = None
|
| 249 |
+
@classmethod
|
| 250 |
+
def get(cls):
|
| 251 |
+
if cls._instance is None:
|
| 252 |
+
cls._instance = cls._load()
|
| 253 |
+
return cls._instance
|
| 254 |
+
```
|
| 255 |
+
|
| 256 |
+
First call cost: ~20β40 s on Mac, ~10 s on CUDA. Surfaced to the user via `gr.Progress` on the Lyrics tab.
|
| 257 |
+
|
| 258 |
+
---
|
| 259 |
+
|
| 260 |
+
## 7. Post-processing (`post_process.py`)
|
| 261 |
+
|
| 262 |
+
### 7.1 Stem separation
|
| 263 |
+
|
| 264 |
+
- `demucs.api.Separator(model="htdemucs_ft")` lazy singleton.
|
| 265 |
+
- Output: 4 WAV files (vocals, drums, bass, other).
|
| 266 |
+
- Runs synchronously after generation if the user expands the Stems section, or on-demand via a "Separate stems" button in the output panel.
|
| 267 |
+
- On ZeroGPU, counted in the same `@spaces.GPU` lifetime as the generation that produced the audio.
|
| 268 |
+
|
| 269 |
+
### 7.2 Loudness normalization
|
| 270 |
+
|
| 271 |
+
- `pyloudnorm` normalises to **-14 LUFS** (streaming spec).
|
| 272 |
+
- Toggled by an `Advanced` checkbox per mode (default ON).
|
| 273 |
+
- Applied to the final WAV before MP3 encoding.
|
| 274 |
+
|
| 275 |
+
### 7.3 MP3 export
|
| 276 |
+
|
| 277 |
+
- `ffmpeg` via `subprocess` β 320 kbps CBR, 44.1 kHz, stereo.
|
| 278 |
+
- Embeds metadata as ID3 tags (prompt, lora hashes, seed).
|
| 279 |
+
|
| 280 |
+
---
|
| 281 |
+
|
| 282 |
+
## 8. Frontend (`app.py` + `ui.py` + `theme.py`)
|
| 283 |
+
|
| 284 |
+
> **Reference mockups (visual source of truth):**
|
| 285 |
+
>
|
| 286 |
+
> | File | Covers |
|
| 287 |
+
> |---|---|
|
| 288 |
+
> | [`mockups/01_generate_mobile_errors.html`](./mockups/01_generate_mobile_errors.html) | Generate tab (fully expanded), mobile phone screens, error / edge-case states |
|
| 289 |
+
> | [`mockups/02_cover_extend.html`](./mockups/02_cover_extend.html) | Cover tab + Extend tab (both fully expanded) |
|
| 290 |
+
> | [`mockups/03_edit_lyrics.html`](./mockups/03_edit_lyrics.html) | Edit tab (Repaint + Flow Morph sub-modes) + Lyrics tab (Qwen LM params) |
|
| 291 |
+
> | [`mockups/README.md`](./mockups/README.md) | What's shared across tabs + what each tab adds |
|
| 292 |
+
>
|
| 293 |
+
> The mockups define the **layout, spacing, control surface, and disclosure hierarchy.** The prose below defines the **semantics** β what each control does, what the defaults are, what the responsive breakpoints are. If a discrepancy ever shows up, the mockups are the source for layout, and Β§3βΒ§7 of this spec are the source for behaviour.
|
| 294 |
+
|
| 295 |
+
### 8.1 Page chrome
|
| 296 |
+
|
| 297 |
+
```html
|
| 298 |
+
HEADER (sticky):
|
| 299 |
+
[brand: "ACE Music Studio." in 15px white, "." in #FFF as period]
|
| 300 |
+
[status: "ready Β· MPS Β· M5 Max" in 10px muted]
|
| 301 |
+
|
| 302 |
+
CTA (below header, separator below):
|
| 303 |
+
Built with β₯. Drop a like Β· Follow @techfreakworm for what's next.
|
| 304 |
+
|
| 305 |
+
(Tab content)
|
| 306 |
+
```
|
| 307 |
+
|
| 308 |
+
### 8.2 Sidebar (desktop β₯ 1024 px)
|
| 309 |
+
|
| 310 |
+
5 mode items + History section below. Active item: white left border + brighter text. Width: 170 px.
|
| 311 |
+
|
| 312 |
+
### 8.3 Tablet (640β1024 px)
|
| 313 |
+
|
| 314 |
+
Sidebar collapses to 30 px wide **icon rail**. Hover shows tooltip with full label. Same active treatment.
|
| 315 |
+
|
| 316 |
+
### 8.4 Mobile (< 640 px)
|
| 317 |
+
|
| 318 |
+
Native `gr.Tabs` (horizontal scroll) replaces the sidebar entirely. Hidden via CSS media query swap: `display: none` on `.ms-sidebar`, `display: flex` on a `.ms-mobile-tabs`. No JS.
|
| 319 |
+
|
| 320 |
+
### 8.5 Tab body
|
| 321 |
+
|
| 322 |
+
Two-column on desktop (form 60% / output 40%), stacks vertically on tablet and mobile.
|
| 323 |
+
|
| 324 |
+
Form layer order (top to bottom, always-visible by default):
|
| 325 |
+
|
| 326 |
+
1. Style prompt (textarea, ~3 rows)
|
| 327 |
+
2. Lyrics (textarea, ~6 rows) β except Lyrics tab, which replaces with brief + structure inputs
|
| 328 |
+
3. Mode-specific: ref audio (Cover), seed audio (Extend), source + segment (Edit)
|
| 329 |
+
4. Duration slider + vocals/instrumental toggle (Generate only)
|
| 330 |
+
5. LoRA section (collapsed by default; chip row visible if any preset is "on")
|
| 331 |
+
6. Advanced accordion (collapsed by default)
|
| 332 |
+
7. LM-planner accordion (collapsed by default)
|
| 333 |
+
8. Generate button (primary; white-on-black; full-width on mobile)
|
| 334 |
+
|
| 335 |
+
### 8.6 Output panel
|
| 336 |
+
|
| 337 |
+
- Audio player with built-in waveform (Gradio 5 native)
|
| 338 |
+
- Retake button (β»)
|
| 339 |
+
- Stems grid (Demucs) β only visible after Demucs runs
|
| 340 |
+
- Action row: β mp3 Β· β wav Β· `{ }` meta Β· β share (copies a permalink with prompt+seed in URL params)
|
| 341 |
+
- Metadata JSON viewer (collapsible, default closed)
|
| 342 |
+
|
| 343 |
+
### 8.7 Theme tokens (`theme.py`)
|
| 344 |
+
|
| 345 |
+
```python
|
| 346 |
+
BG = "#0A0A0A"
|
| 347 |
+
SURFACE = "#141414"
|
| 348 |
+
SURFACE_STRONG = "#000000"
|
| 349 |
+
BORDER = "#1F1F1F"
|
| 350 |
+
BORDER_STRONG = "#2A2A2A"
|
| 351 |
+
INK = "#E5E5E5"
|
| 352 |
+
INK_MUTED = "#6B6B6B"
|
| 353 |
+
PRIMARY = "#FFFFFF"
|
| 354 |
+
ERROR = "#E5E5E5" # high-contrast white in Brutalist Mono; gradio error background still red-ish but our text is white
|
| 355 |
+
RADIUS = "6px"
|
| 356 |
+
FONT_STACK = '"Inter", -apple-system, BlinkMacSystemFont, "Segoe UI", system-ui, sans-serif'
|
| 357 |
+
```
|
| 358 |
+
|
| 359 |
+
CSS injected via `gr.Blocks(css=β¦)` covers sidebar layout, responsive media queries, LoRA chip pill, waveform tightening, accordion arrow customization, hide-Gradio-footer.
|
| 360 |
+
|
| 361 |
+
---
|
| 362 |
+
|
| 363 |
+
## 9. Data flow per generation
|
| 364 |
+
|
| 365 |
+
```
|
| 366 |
+
1. User clicks "Generate" button on the Generate tab.
|
| 367 |
+
2. app.py:on_generate(...) handler reads all gr inputs, coerces types.
|
| 368 |
+
3. Handler validates active LoRAs (cheap header sniff) β raises gr.Error on failure.
|
| 369 |
+
4. Handler calls backend.generate_with_retry(mode="generate", params={...}).
|
| 370 |
+
5. backend.generate_with_retry is the @spaces.GPU-decorated entrypoint.
|
| 371 |
+
6. Inside the GPU lifetime:
|
| 372 |
+
a. _ensure_pipeline() β lazy load on first call
|
| 373 |
+
b. _apply_lora_stack(params.loras) β pipe.set_adapters(names, weights)
|
| 374 |
+
c. _dispatch_mode("generate", params) β calls pipe(...) with mode-specific kwargs
|
| 375 |
+
d. _post_process(audio, params) β loudness norm, optionally stems
|
| 376 |
+
e. _emit_meta(params, audio) β build metadata JSON, sha256s
|
| 377 |
+
7. Returns (audio_path, meta_dict).
|
| 378 |
+
8. Handler updates UI: audio player, metadata JSON viewer.
|
| 379 |
+
9. History entry appended (in-memory, last 10).
|
| 380 |
+
```
|
| 381 |
+
|
| 382 |
+
ZeroGPU abort handling wraps step 5 in a one-shot retry at 2Γ duration. Beyond that: `gr.Warning` with the suggestion to reduce duration or steps.
|
| 383 |
+
|
| 384 |
+
---
|
| 385 |
+
|
| 386 |
+
## 10. Error handling matrix
|
| 387 |
+
|
| 388 |
+
| Trigger | User-facing | Logs |
|
| 389 |
+
|---|---|---|
|
| 390 |
+
| LoRA file invalid (rank, modules, size) | `gr.Error("This LoRA isn't compatible with ACE-Step 1.5 XL SFT. β¦")` | full traceback to stderr |
|
| 391 |
+
| Audio input wrong format | `gr.Error("Audio must be wav/mp3/flac, β€ 240 s.")` | format diagnostics |
|
| 392 |
+
| Cover/Extend/Edit missing required input | `gr.Error("Reference audio is required for Cover mode.")` | param dump |
|
| 393 |
+
| ZeroGPU abort | auto-retry once at 2Γ duration; if still aborts: `gr.Warning("Generation timed out. Try a shorter duration or fewer steps.")` | timing info |
|
| 394 |
+
| Lyrics LM cold-load fails (OOM) | `gr.Error("Couldn't load lyrics model. Free some memory and retry.")` | full traceback |
|
| 395 |
+
| MPS op not implemented | falls back to CPU via env var; if still crashes: `gr.Error("This ACE-Step op isn't yet supported on Apple Silicon. Generation aborted.")` | op name + diagnostics |
|
| 396 |
+
| Demucs separator fails on weird audio | `gr.Warning("Stem separation failed β audio still saved.")` | traceback |
|
| 397 |
+
| Custom-LoRA download fails (preset) | `gr.Error("Couldn't download preset 'X'. Check network.")` | network log |
|
| 398 |
+
| Out-of-disk on cache mirror | `gr.Error("Disk full. Free space and reload.")` | mount stats |
|
| 399 |
+
|
| 400 |
+
---
|
| 401 |
+
|
| 402 |
+
## 11. Testing
|
| 403 |
+
|
| 404 |
+
### 11.1 Layers
|
| 405 |
+
|
| 406 |
+
- **L1 β no GPU, no models**: module structure, type signatures, theme CSS asserts, LoRA-header sniff unit tests, metadata JSON shape, preset manifest schema. ~30 tests, runs in < 5 s.
|
| 407 |
+
- **L2 β mocked pipeline**: each mode handler calls the backend with the right kwargs; `set_adapters` invoked with correct order/weights; lyrics LM prompt template asserted. ~25 tests, runs in < 30 s.
|
| 408 |
+
- **GPU smoke (`@pytest.mark.gpu`, skipped by default)**: one Generate + one Cover + one Extend + one Lyrics at minimum settings, asserts output exists and is non-zero size. ~4 tests, runs in 5β10 min on M5 Max.
|
| 409 |
+
|
| 410 |
+
### 11.2 CI
|
| 411 |
+
|
| 412 |
+
- GitHub Actions: Python 3.11, run L1 + L2 with `pytest -m "not gpu"`.
|
| 413 |
+
- ruff format + ruff check both pass.
|
| 414 |
+
- No GPU testing in CI (cost). The user runs `pytest -m gpu` locally on the M5 Max before each release tag.
|
| 415 |
+
|
| 416 |
+
### 11.3 Manual verification before merge
|
| 417 |
+
|
| 418 |
+
- Each new mode handler: at least one end-to-end on M5 Max with a real prompt + the psytrance LoRA loaded.
|
| 419 |
+
- LoRA upload: at least one bad-file rejection (rank mismatch) + one good-file success.
|
| 420 |
+
- Responsive: open on phone (Safari iOS), verify horizontal tab strip, verify generate end-to-end.
|
| 421 |
+
|
| 422 |
+
---
|
| 423 |
+
|
| 424 |
+
## 12. Deployment
|
| 425 |
+
|
| 426 |
+
### 12.1 HF Spaces
|
| 427 |
+
|
| 428 |
+
`README.md` frontmatter:
|
| 429 |
+
|
| 430 |
+
```yaml
|
| 431 |
+
---
|
| 432 |
+
title: ACE Music Studio
|
| 433 |
+
emoji: π΅
|
| 434 |
+
colorFrom: gray
|
| 435 |
+
colorTo: gray
|
| 436 |
+
sdk: gradio
|
| 437 |
+
sdk_version: "5.50.0"
|
| 438 |
+
app_file: app.py
|
| 439 |
+
python_version: "3.11"
|
| 440 |
+
suggested_hardware: zero-a10g
|
| 441 |
+
hf_oauth: false
|
| 442 |
+
preload_from_hub:
|
| 443 |
+
- ACE-Step/ACE-Step-v1.5-XL-SFT *.safetensors,config.json,scheduler/*,vae/*,tokenizer/*
|
| 444 |
+
- Qwen/Qwen2.5-7B-Instruct *.safetensors,config.json,tokenizer*
|
| 445 |
+
- facebook/htdemucs_ft *.th
|
| 446 |
+
- ACE-Step/ACE-Step-v1-RapMachine-LoRA *.safetensors
|
| 447 |
+
- ACE-Step/ACE-Step-v1-Chinese-Rap-LoRA *.safetensors
|
| 448 |
+
- ACE-Step/ACE-Step-v1-Lyric2Vocal-LoRA *.safetensors
|
| 449 |
+
- ACE-Step/ACE-Step-v1-Text2Samples-LoRA *.safetensors
|
| 450 |
+
---
|
| 451 |
+
```
|
| 452 |
+
|
| 453 |
+
Preload size estimate: ACE-Step XL SFT ~16 GB + Qwen 2.5 ~14 GB + htdemucs ~250 MB + 4 LoRAs ~400 MB = **~31 GB**, well under HF's 150 GB cap.
|
| 454 |
+
|
| 455 |
+
### 12.2 GitHub
|
| 456 |
+
|
| 457 |
+
- Repo: `techfreakworm/ace-music-studio` (public).
|
| 458 |
+
- License: MIT.
|
| 459 |
+
- HF Space mirror via dedicated git remote (`git push space main`).
|
| 460 |
+
- README badges: HF Space, GitHub stars, MIT license, Python 3.11, backend ACE-Step.
|
| 461 |
+
|
| 462 |
+
### 12.3 Local install
|
| 463 |
+
|
| 464 |
+
```bash
|
| 465 |
+
git clone https://github.com/techfreakworm/ace-music-studio
|
| 466 |
+
cd ace-music-studio
|
| 467 |
+
bash setup.sh # creates .venv (Python 3.11), installs requirements
|
| 468 |
+
source .venv/bin/activate
|
| 469 |
+
python app.py # http://127.0.0.1:7860
|
| 470 |
+
```
|
| 471 |
+
|
| 472 |
+
`setup.sh` detects Mac vs CUDA and installs the right ACE-Step branch + Qwen backend (mlx-lm on Mac).
|
| 473 |
+
|
| 474 |
+
---
|
| 475 |
+
|
| 476 |
+
## 13. Out of scope for v1
|
| 477 |
+
|
| 478 |
+
These are deferred to v2+ β do **not** implement without explicit user OK:
|
| 479 |
+
|
| 480 |
+
- Multi-prompt batch queue (generate 5 variants in a row)
|
| 481 |
+
- Persistent generation history across sessions (DB-backed)
|
| 482 |
+
- User accounts / auth
|
| 483 |
+
- Telemetry dashboard
|
| 484 |
+
- Voice cloning ("Persona" feature β RVC integration)
|
| 485 |
+
- LoRA training inside the app
|
| 486 |
+
- ControlNet-style conditioning (rhythm tracks, MIDI input)
|
| 487 |
+
- Spectrogram visualization (waveform is enough for v1)
|
| 488 |
+
- Multi-language UI strings (English only; song content can be any language)
|
| 489 |
+
- Watermarking output audio
|
| 490 |
+
- Browser-side audio editing (cut, paste, fade)
|
| 491 |
+
- Multi-tenant rate limiting
|
| 492 |
+
- Export to DAW format (stem zip is enough for v1)
|
| 493 |
+
- Visual regression tests for the Gradio UI
|
| 494 |
+
|
| 495 |
+
---
|
| 496 |
+
|
| 497 |
+
## 14. Open implementation questions (defer to writing-plans)
|
| 498 |
+
|
| 499 |
+
1. **ACE-Step package β git or PyPI?** As of 2026-05-18, the official `ace-step` PyPI package exists for v1.5 but the Apple-Silicon fork is git-only. Decision: `pip install ace-step` on CUDA, `pip install git+https://github.com/clockworksquirrel/ace-step-apple-silicon` on Mac (detected by `setup.sh`).
|
| 500 |
+
2. **Demucs model β `htdemucs` or `htdemucs_ft`?** `htdemucs_ft` is the fine-tuned variant with slightly better separation. Larger weight (~250 MB) but trivial in our budget. Default: `htdemucs_ft`.
|
| 501 |
+
3. **LoRA preset HF IDs** β placeholder paths above (`ACE-Step/ACE-Step-v1-*-LoRA`) may not match the exact HF org/repo naming when this is implemented; the plan should verify each preset's actual canonical HF path before the preload directive is finalised.
|
| 502 |
+
4. **Qwen 2.5 7B vs 3B for ZeroGPU comfort** β 7B is correct per the brainstorming answer. If ZeroGPU's 60 s budget is too tight for cold-load + generate, fall back to **Qwen 2.5 3B Instruct** (~6 GB) without UI changes.
|
| 503 |
+
5. **Edit-mode UX for segment selection** β start with two numeric inputs (start_s, end_s). v1.5 can add a waveform-clickable selector if user feedback demands it.
|
| 504 |
+
6. **History persistence** β v1 is in-memory only. The sidebar history list is `gr.State`-backed and wipes on page reload. Persistent history is v2.
|
| 505 |
+
7. **ACE-Step Extend / Repaint exact API surface** β the psytrance LoRA generation config shows the relevant kwargs (`repainting_start`, `repainting_end`, `repaint_mode`, `repaint_strength`, `chunk_mask_mode`, `repaint_latent_crossfade_frames`, `repaint_wav_crossfade_sec`). Verify the conventions for "append after end of seed audio" (e.g., does `repainting_end > audio_length` extend, or do we need a different sentinel?) before M3 ships.
|
| 506 |
+
8. **MLX-quant Qwen 2.5 7B availability** β confirm `mlx-community/Qwen2.5-7B-Instruct-4bit` exists and produces acceptable lyric quality. If not, use `mlx-community/Qwen2.5-3B-Instruct-4bit` as the Mac path (the model card under Β§6.1's table moves to 3B-on-Mac, 7B-on-CUDA).
|
| 507 |
+
|
| 508 |
+
---
|
| 509 |
+
|
| 510 |
+
## 15. Sole-author rule
|
| 511 |
+
|
| 512 |
+
Per the user's permanent feedback (memory `feedback_git_authorship.md`):
|
| 513 |
+
|
| 514 |
+
- Mayank Gupta is sole author on every commit.
|
| 515 |
+
- **NO** `Co-Authored-By: Claudeβ¦` trailer.
|
| 516 |
+
- **NO** `Generated with Claude Code` footer.
|
| 517 |
+
- **NO** `--author=β¦` flag.
|
| 518 |
+
- This applies to commits made by any AI assistant working on this repo.
|
| 519 |
+
|
| 520 |
+
Encoded in `CLAUDE.md`, `AGENTS.md`, and `SKILLS.md` at the top of the repo so every assistant sees it on first read.
|
| 521 |
+
|
| 522 |
+
---
|
| 523 |
+
|
| 524 |
+
## 16. Implementation milestones (rough)
|
| 525 |
+
|
| 526 |
+
(Detailed sequencing belongs in the implementation plan β see `docs/superpowers/plans/`.)
|
| 527 |
+
|
| 528 |
+
| Milestone | Deliverable | Validates |
|
| 529 |
+
|---|---|---|
|
| 530 |
+
| M0 β Bootstrap | `app.py:_bootstrap()` + device autodetect + Gradio Blocks skeleton + theme | App boots on M5 Max and on a Space-equivalent CPU env |
|
| 531 |
+
| M1 β Generate mode (no LoRA) | `modes.generate` + `ace_pipeline.py` + audio player output | End-to-end "psytrance, 30 s" generation on M5 Max |
|
| 532 |
+
| M2 β LoRA stack | `lora_stack.py` + preset chips + custom upload + active stack UI | Psytrance v2 + RapMachine stacked at 0.95 / 0.85 produce visibly different output |
|
| 533 |
+
| M3 β Cover, Extend, Edit | Three more handlers + their tab UIs | Each mode produces a non-trivial output |
|
| 534 |
+
| M4 β Lyrics LM | `lyrics_lm.py` + Lyrics tab + "use these" flow | Qwen 2.5 emits valid structural-tag lyrics; round-trip into Generate works |
|
| 535 |
+
| M5 β Post-processing | Demucs + pyloudnorm + mp3 export | Stems download, normalised output, ID3-tagged MP3 |
|
| 536 |
+
| M6 β Responsive + polish | Mobile media queries + tooltips + error UX + history sidebar | Phone Safari renders + generates end-to-end |
|
| 537 |
+
| M7 β Deploy | Preload directive + ZeroGPU decorator + retry logic + Space mirror | Public Space serves requests at parity with local |
|
| 538 |
+
|
| 539 |
+
---
|
| 540 |
+
|
| 541 |
+
## 17. References
|
| 542 |
+
|
| 543 |
+
- ACE-Step 1.5 paper: [arXiv 2506.00045](https://arxiv.org/abs/2506.00045)
|
| 544 |
+
- ACE-Step 1.5 repo: [github.com/ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)
|
| 545 |
+
- Apple Silicon fork: [github.com/clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)
|
| 546 |
+
- ACE-Step LoRA family: [ace-step.github.io](https://ace-step.github.io/)
|
| 547 |
+
- Qwen 2.5: [huggingface.co/Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
|
| 548 |
+
- Demucs: [github.com/facebookresearch/demucs](https://github.com/facebookresearch/demucs)
|
| 549 |
+
- z-image-studio (architectural precedent): `~/Projects/llm/z-image-studio/`
|
| 550 |
+
- Research bundle: `research/00_executive_summary.md` and siblings
|
|
@@ -0,0 +1,604 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<h2>Generate fully expanded Β· mobile Β· error states</h2>
|
| 2 |
+
<p class="subtitle">Last batch. Generate tab with every control surfaced. Mobile phone screens for Generate + Cover + Lyrics. Six error/edge-case states.</p>
|
| 3 |
+
|
| 4 |
+
<style>
|
| 5 |
+
.gm { background:#0A0A0A; color:#E5E5E5; border:1px solid #1F1F1F; border-radius:10px; padding:18px; font-size:12px; line-height:1.5; margin-top:14px; }
|
| 6 |
+
.gm-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:10px; border-bottom:1px solid #1F1F1F; margin-bottom:14px; }
|
| 7 |
+
.gm-brand { font-size:15px; font-weight:600; }
|
| 8 |
+
.gm-cta { font-size:11px; color:#6B6B6B; }
|
| 9 |
+
.gm-cta strong { color:#E5E5E5; }
|
| 10 |
+
.gm-status { font-size:10px; color:#6B6B6B; letter-spacing:0.08em; text-transform:uppercase; }
|
| 11 |
+
.gm-row { display:flex; gap:16px; align-items:flex-start; }
|
| 12 |
+
.gm-sidebar { background:#000; padding:14px 10px; border-radius:6px; min-width:170px; }
|
| 13 |
+
.gm-side { display:block; padding:8px 10px; border-radius:4px; margin-bottom:3px; font-size:12px; color:#6B6B6B; }
|
| 14 |
+
.gm-side.active { background:#1A1A1A; color:#FFF; border-left:2px solid #FFF; padding-left:8px; }
|
| 15 |
+
.gm-side .em { margin-right:6px; }
|
| 16 |
+
.gm-main { flex:1; display:flex; gap:14px; align-items:flex-start; }
|
| 17 |
+
.gm-form { flex:1.3; background:#141414; padding:16px; border-radius:6px; }
|
| 18 |
+
.gm-output { flex:1; background:#141414; padding:16px; border-radius:6px; min-width:260px; }
|
| 19 |
+
.gm-label { font-size:10px; text-transform:uppercase; letter-spacing:0.08em; color:#6B6B6B; margin-bottom:6px; display:flex; justify-content:space-between; align-items:center; }
|
| 20 |
+
.gm-label .hint { color:#5A5048; font-size:9px; text-transform:none; letter-spacing:normal; font-weight:400; }
|
| 21 |
+
.gm-input { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; margin-bottom:12px; font-size:11px; }
|
| 22 |
+
.gm-textarea { min-height:46px; }
|
| 23 |
+
.gm-grid2 { display:grid; grid-template-columns:1fr 1fr; gap:12px; margin-bottom:12px; }
|
| 24 |
+
.gm-grid3 { display:grid; grid-template-columns:1fr 1fr 1fr; gap:10px; margin-bottom:12px; }
|
| 25 |
+
.gm-grid4 { display:grid; grid-template-columns:1fr 1fr 1fr 1fr; gap:8px; margin-bottom:12px; }
|
| 26 |
+
.gm-slider-row { display:flex; align-items:center; gap:10px; padding:6px 8px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; }
|
| 27 |
+
.gm-slider-row .name { color:#6B6B6B; font-size:10px; min-width:130px; }
|
| 28 |
+
.gm-slider { flex:1; height:3px; background:#2A2A2A; border-radius:2px; position:relative; }
|
| 29 |
+
.gm-slider::after { content:""; position:absolute; top:-4px; width:10px; height:10px; background:#FFF; border-radius:50%; }
|
| 30 |
+
.gm-slider.p5::after { left:5%; }
|
| 31 |
+
.gm-slider.p10::after { left:10%; }
|
| 32 |
+
.gm-slider.p15::after { left:15%; }
|
| 33 |
+
.gm-slider.p20::after { left:20%; }
|
| 34 |
+
.gm-slider.p25::after { left:25%; }
|
| 35 |
+
.gm-slider.p33::after { left:33%; }
|
| 36 |
+
.gm-slider.p40::after { left:40%; }
|
| 37 |
+
.gm-slider.p50::after { left:50%; }
|
| 38 |
+
.gm-slider.p60::after { left:60%; }
|
| 39 |
+
.gm-slider.p65::after { left:65%; }
|
| 40 |
+
.gm-slider.p70::after { left:70%; }
|
| 41 |
+
.gm-slider.p85::after { left:85%; }
|
| 42 |
+
.gm-slider.p90::after { left:90%; }
|
| 43 |
+
.gm-slider.p95::after { left:95%; }
|
| 44 |
+
.gm-slider-row .val { color:#FFF; font-family:monospace; font-size:11px; min-width:42px; text-align:right; }
|
| 45 |
+
.gm-toggle { display:flex; align-items:center; gap:8px; padding:6px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; cursor:pointer; }
|
| 46 |
+
.gm-toggle .box { width:14px; height:14px; border:1px solid #2A2A2A; border-radius:3px; display:inline-flex; align-items:center; justify-content:center; font-size:9px; }
|
| 47 |
+
.gm-toggle.on { color:#FFF; border-color:#FFF; }
|
| 48 |
+
.gm-toggle.on .box { background:#FFF; color:#0A0A0A; border-color:#FFF; }
|
| 49 |
+
.gm-pills { display:flex; gap:0; background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:2px; margin-bottom:12px; }
|
| 50 |
+
.gm-pill { flex:1; text-align:center; padding:6px 10px; font-size:11px; color:#6B6B6B; border-radius:3px; cursor:pointer; }
|
| 51 |
+
.gm-pill.on { background:#FFF; color:#0A0A0A; }
|
| 52 |
+
.gm-select { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; font-size:11px; display:flex; justify-content:space-between; align-items:center; margin-bottom:8px; }
|
| 53 |
+
.gm-select .arrow { color:#6B6B6B; }
|
| 54 |
+
.gm-section { border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-top:14px; background:#0F0F0F; }
|
| 55 |
+
.gm-section-h { display:flex; justify-content:space-between; align-items:center; margin-bottom:12px; font-size:11px; font-weight:600; }
|
| 56 |
+
.gm-section-h .arrow { color:#FFF; }
|
| 57 |
+
.gm-section-h .meta { color:#6B6B6B; font-weight:400; font-size:10px; }
|
| 58 |
+
.gm-chip { display:inline-block; padding:5px 10px; border-radius:14px; font-size:10px; margin-right:5px; margin-bottom:5px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; cursor:pointer; }
|
| 59 |
+
.gm-chip.on { border-color:#FFF; color:#FFF; }
|
| 60 |
+
.gm-chip.upload { border-style:dashed; color:#FFF; }
|
| 61 |
+
.gm-lora-row { display:flex; align-items:center; gap:10px; padding:8px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:6px; font-size:11px; }
|
| 62 |
+
.gm-lora-name { flex:1; }
|
| 63 |
+
.gm-lora-name small { color:#6B6B6B; font-weight:400; margin-left:4px; }
|
| 64 |
+
.gm-x { color:#6B6B6B; cursor:pointer; padding:0 4px; }
|
| 65 |
+
.gm-btn { background:#FFF; color:#0A0A0A; padding:12px 18px; border-radius:4px; font-weight:600; display:block; font-size:13px; text-align:center; cursor:pointer; margin-top:16px; }
|
| 66 |
+
.gm-waveform { height:60px; background:#000; border:1px solid #2A2A2A; border-radius:4px; display:flex; align-items:center; justify-content:center; gap:2px; padding:8px; margin-bottom:10px; }
|
| 67 |
+
.gm-bar { width:2px; background:#E5E5E5; }
|
| 68 |
+
.gm-player-controls { display:flex; align-items:center; gap:10px; color:#6B6B6B; font-size:10px; margin-bottom:14px; }
|
| 69 |
+
.gm-play { width:28px; height:28px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:11px; }
|
| 70 |
+
.gm-meta-block { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:8px 10px; font-size:9px; color:#6B6B6B; font-family:monospace; line-height:1.6; max-height:160px; overflow:hidden; margin-top:8px; }
|
| 71 |
+
.gm-actions { display:flex; flex-wrap:wrap; gap:6px; margin-bottom:10px; }
|
| 72 |
+
.gm-secondary { border:1px solid #2A2A2A; color:#E5E5E5; padding:6px 12px; border-radius:4px; font-size:10px; cursor:pointer; }
|
| 73 |
+
.gm-stems { display:grid; grid-template-columns:1fr 1fr; gap:6px; margin-bottom:10px; }
|
| 74 |
+
.gm-stem { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between; align-items:center; }
|
| 75 |
+
.gm-stem .dl { color:#FFF; cursor:pointer; }
|
| 76 |
+
</style>
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
<h3 style="margin-top:14px">π΅ Generate β fully expanded Β· psytrance preset stacked with custom LoRA</h3>
|
| 80 |
+
|
| 81 |
+
<div class="gm">
|
| 82 |
+
<div class="gm-header">
|
| 83 |
+
<div>
|
| 84 |
+
<div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
|
| 85 |
+
<div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">β₯</span>. <strong>Drop a like</strong> Β· Follow <strong>@techfreakworm</strong> for what's next.</div>
|
| 86 |
+
</div>
|
| 87 |
+
<div class="gm-status">ready Β· MPS Β· M5 Max</div>
|
| 88 |
+
</div>
|
| 89 |
+
|
| 90 |
+
<div class="gm-row">
|
| 91 |
+
<div class="gm-sidebar">
|
| 92 |
+
<div class="gm-side active"><span class="em">π΅</span>Generate</div>
|
| 93 |
+
<div class="gm-side"><span class="em">π€</span>Cover</div>
|
| 94 |
+
<div class="gm-side"><span class="em">β©</span>Extend</div>
|
| 95 |
+
<div class="gm-side"><span class="em">βοΈ</span>Edit</div>
|
| 96 |
+
<div class="gm-side"><span class="em">βοΈ</span>Lyrics</div>
|
| 97 |
+
<div style="border-top:1px solid #1F1F1F; margin:14px 0 10px; padding-top:10px; font-size:9px; color:#6B6B6B; text-transform:uppercase; letter-spacing:0.1em;">History Β· session</div>
|
| 98 |
+
<div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">βΆ psytrance Β· just now</div>
|
| 99 |
+
<div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">βΆ ambient_v4 Β· 2m</div>
|
| 100 |
+
<div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">βΆ chinese_rap Β· 7m</div>
|
| 101 |
+
<div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">βΆ lofi_vocal Β· 14m</div>
|
| 102 |
+
</div>
|
| 103 |
+
|
| 104 |
+
<div class="gm-main">
|
| 105 |
+
<div class="gm-form">
|
| 106 |
+
|
| 107 |
+
<div class="gm-label">1 Β· Style prompt <span class="hint">describe the song Β· genre, instruments, mood</span></div>
|
| 108 |
+
<div class="gm-input">psytrance, rolling triplet bassline, acid squelch, metallic leads, atmospheric pads, high quality</div>
|
| 109 |
+
|
| 110 |
+
<div class="gm-label">2 Β· Lyrics <span class="hint">use [verse] [chorus] [bridge] tags Β· β open Lyrics tab to draft with Qwen 2.5</span></div>
|
| 111 |
+
<div class="gm-input gm-textarea" style="min-height:64px">[intro - atmospheric pads & ambient synth]<br><br>[verse 1] six in the morning, the sun's still pretending<br>kick drum carries what the night was sending<br>shoes off, eyes closed, the city's still bending<br><br>[chorus] we let go, we let go, we let go</div>
|
| 112 |
+
|
| 113 |
+
<div class="gm-grid2">
|
| 114 |
+
<div>
|
| 115 |
+
<div class="gm-label">Duration <span class="hint">5 β 240 s</span></div>
|
| 116 |
+
<div class="gm-slider-row"><span class="name">seconds</span><span class="gm-slider p15"></span><span class="val">30</span></div>
|
| 117 |
+
</div>
|
| 118 |
+
<div>
|
| 119 |
+
<div class="gm-label">Vocal mode</div>
|
| 120 |
+
<div class="gm-pills">
|
| 121 |
+
<div class="gm-pill on">With vocals</div>
|
| 122 |
+
<div class="gm-pill">Instrumental</div>
|
| 123 |
+
</div>
|
| 124 |
+
</div>
|
| 125 |
+
</div>
|
| 126 |
+
|
| 127 |
+
<!-- LoRA section, expanded -->
|
| 128 |
+
<div class="gm-section">
|
| 129 |
+
<div class="gm-section-h">
|
| 130 |
+
<span>LoRA stack <span class="meta">Β· 2 active Β· order matters</span></span>
|
| 131 |
+
<span class="arrow">βΎ</span>
|
| 132 |
+
</div>
|
| 133 |
+
|
| 134 |
+
<div class="gm-label">Bundled presets <span class="hint">click to toggle</span></div>
|
| 135 |
+
<div style="margin-bottom:12px;">
|
| 136 |
+
<span class="gm-chip">RapMachine</span>
|
| 137 |
+
<span class="gm-chip">Chinese Rap</span>
|
| 138 |
+
<span class="gm-chip on">Lyric2Vocal</span>
|
| 139 |
+
<span class="gm-chip">Text2Samples</span>
|
| 140 |
+
</div>
|
| 141 |
+
|
| 142 |
+
<div class="gm-label">Active stack <span class="hint">ββ to reorder Β· Γ to remove</span></div>
|
| 143 |
+
<div class="gm-lora-row">
|
| 144 |
+
<span class="gm-lora-name">Lyric2Vocal <small>Β· preset Β· 28 MB</small></span>
|
| 145 |
+
<span class="gm-slider p65" style="width:100px"></span>
|
| 146 |
+
<span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.65</span>
|
| 147 |
+
<span class="gm-x">Γ</span>
|
| 148 |
+
</div>
|
| 149 |
+
<div class="gm-lora-row">
|
| 150 |
+
<span class="gm-lora-name">psytrance_v2 <small>Β· custom Β· 47 MB Β· rank 64 Β· sha 0c94β¦</small></span>
|
| 151 |
+
<span class="gm-slider p95" style="width:100px"></span>
|
| 152 |
+
<span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.95</span>
|
| 153 |
+
<span class="gm-x">Γ</span>
|
| 154 |
+
</div>
|
| 155 |
+
|
| 156 |
+
<div style="margin-top:10px;">
|
| 157 |
+
<span class="gm-chip upload">β drop .safetensors here or click</span>
|
| 158 |
+
</div>
|
| 159 |
+
</div>
|
| 160 |
+
|
| 161 |
+
<!-- Advanced section, expanded -->
|
| 162 |
+
<div class="gm-section">
|
| 163 |
+
<div class="gm-section-h">
|
| 164 |
+
<span>Advanced <span class="meta">Β· generation parameters</span></span>
|
| 165 |
+
<span class="arrow">βΎ</span>
|
| 166 |
+
</div>
|
| 167 |
+
|
| 168 |
+
<div class="gm-grid3">
|
| 169 |
+
<div><div class="gm-label">BPM</div><div class="gm-input" style="margin-bottom:0">135</div></div>
|
| 170 |
+
<div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">auto</div></div>
|
| 171 |
+
<div><div class="gm-label">Time signature</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
|
| 172 |
+
</div>
|
| 173 |
+
|
| 174 |
+
<div class="gm-grid2">
|
| 175 |
+
<div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">βΎ</span></div></div>
|
| 176 |
+
<div><div class="gm-label">Vocal language</div><div class="gm-select">auto <span class="arrow">βΎ</span></div></div>
|
| 177 |
+
</div>
|
| 178 |
+
|
| 179 |
+
<div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p25"></span><span class="val">50</span></div>
|
| 180 |
+
<div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
|
| 181 |
+
<div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p33"></span><span class="val">3</span></div>
|
| 182 |
+
<div class="gm-slider-row"><span class="name">CFG interval start</span><span class="gm-slider p5"></span><span class="val">0.0</span></div>
|
| 183 |
+
<div class="gm-slider-row"><span class="name">CFG interval end</span><span class="gm-slider p95"></span><span class="val">1.0</span></div>
|
| 184 |
+
|
| 185 |
+
<div class="gm-label" style="margin-top:8px">Negative prompt <span class="hint">things to avoid</span></div>
|
| 186 |
+
<div class="gm-input gm-textarea" style="font-size:10px">bitcrushed, aliasing, quantizing noise, digital clipping, glitchy, mp3 artifacts, jazz, funk, pop, acoustic, lo-fi, orchestral, dubstep, vocal hooks, electric guitar, slow tempo, jazz chords, blues scale</div>
|
| 187 |
+
|
| 188 |
+
<div class="gm-grid2">
|
| 189 |
+
<div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
|
| 190 |
+
<div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">β</span> -14 LUFS</div></div>
|
| 191 |
+
</div>
|
| 192 |
+
|
| 193 |
+
<div class="gm-grid2">
|
| 194 |
+
<div><div class="gm-label">Fade in</div><div class="gm-slider-row"><span class="name">seconds</span><span class="gm-slider p5"></span><span class="val">0.0</span></div></div>
|
| 195 |
+
<div><div class="gm-label">Fade out</div><div class="gm-slider-row"><span class="name">seconds</span><span class="gm-slider p5"></span><span class="val">0.0</span></div></div>
|
| 196 |
+
</div>
|
| 197 |
+
|
| 198 |
+
<div class="gm-grid2">
|
| 199 |
+
<div><div class="gm-label">Latent shift</div><div class="gm-input" style="margin-bottom:0">0</div></div>
|
| 200 |
+
<div><div class="gm-label">Latent rescale</div><div class="gm-input" style="margin-bottom:0">1</div></div>
|
| 201 |
+
</div>
|
| 202 |
+
|
| 203 |
+
<div class="gm-grid2">
|
| 204 |
+
<div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">1297183202</div></div>
|
| 205 |
+
<div><div class="gm-label"> </div><div class="gm-toggle"><span class="box"></span> Lock seed</div></div>
|
| 206 |
+
</div>
|
| 207 |
+
</div>
|
| 208 |
+
|
| 209 |
+
<!-- LM planner section, expanded -->
|
| 210 |
+
<div class="gm-section">
|
| 211 |
+
<div class="gm-section-h">
|
| 212 |
+
<span>LM planner Β· Qwen3 thinking <span class="meta">Β· chain-of-thought structure</span></span>
|
| 213 |
+
<span class="arrow">βΎ</span>
|
| 214 |
+
</div>
|
| 215 |
+
|
| 216 |
+
<div class="gm-toggle on"><span class="box">β</span> Thinking enabled <span style="color:#6B6B6B; font-size:9px; margin-left:auto">+ slower but better structure</span></div>
|
| 217 |
+
<div class="gm-toggle on"><span class="box">β</span> Constrained decoding</div>
|
| 218 |
+
|
| 219 |
+
<div class="gm-grid4" style="margin-top:8px">
|
| 220 |
+
<div><div class="gm-label">Temperature</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
|
| 221 |
+
<div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
|
| 222 |
+
<div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
|
| 223 |
+
<div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
|
| 224 |
+
</div>
|
| 225 |
+
|
| 226 |
+
<div class="gm-label">CoT pipeline toggles <span class="hint">which fields the LM rewrites pre-generation</span></div>
|
| 227 |
+
<div class="gm-grid4">
|
| 228 |
+
<div class="gm-toggle"><span class="box"></span> metas</div>
|
| 229 |
+
<div class="gm-toggle"><span class="box"></span> caption</div>
|
| 230 |
+
<div class="gm-toggle"><span class="box"></span> lyrics</div>
|
| 231 |
+
<div class="gm-toggle"><span class="box"></span> language</div>
|
| 232 |
+
</div>
|
| 233 |
+
|
| 234 |
+
<div class="gm-label">LM negative prompt</div>
|
| 235 |
+
<div class="gm-input" style="font-size:10px">happy chords, major scale, uplifting melody</div>
|
| 236 |
+
|
| 237 |
+
<div class="gm-label">CoT override fields <span class="hint">if a CoT toggle is on, the LM rewrites these</span></div>
|
| 238 |
+
<div class="gm-grid2">
|
| 239 |
+
<div><div class="gm-label">cot_bpm</div><div class="gm-input" style="margin-bottom:0; opacity:0.5">(blank β use main BPM)</div></div>
|
| 240 |
+
<div><div class="gm-label">cot_keyscale</div><div class="gm-input" style="margin-bottom:0; opacity:0.5">(blank β use main key)</div></div>
|
| 241 |
+
</div>
|
| 242 |
+
</div>
|
| 243 |
+
|
| 244 |
+
<!-- DCW section, expanded -->
|
| 245 |
+
<div class="gm-section">
|
| 246 |
+
<div class="gm-section-h">
|
| 247 |
+
<span>DCW Β· dynamic CFG warping <span class="meta">Β· wavelet-based</span></span>
|
| 248 |
+
<span class="arrow">βΎ</span>
|
| 249 |
+
</div>
|
| 250 |
+
|
| 251 |
+
<div class="gm-toggle on"><span class="box">β</span> DCW enabled</div>
|
| 252 |
+
|
| 253 |
+
<div class="gm-grid3">
|
| 254 |
+
<div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">βΎ</span></div></div>
|
| 255 |
+
<div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">βΎ</span></div></div>
|
| 256 |
+
<div><div class="gm-label"> </div><div style="font-size:9px; color:#6B6B6B; padding-top:8px;">leave defaults if unsure</div></div>
|
| 257 |
+
</div>
|
| 258 |
+
|
| 259 |
+
<div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p5"></span><span class="val">0.02</span></div>
|
| 260 |
+
<div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
|
| 261 |
+
</div>
|
| 262 |
+
|
| 263 |
+
<div class="gm-btn">βΆ Generate Β· est. ~30 s on M5 Max</div>
|
| 264 |
+
</div>
|
| 265 |
+
|
| 266 |
+
<!-- Output panel -->
|
| 267 |
+
<div class="gm-output">
|
| 268 |
+
<div class="gm-label" style="margin-bottom:10px">Output Β· psytrance Β· 30 s Β· seed 1297183202</div>
|
| 269 |
+
|
| 270 |
+
<div class="gm-waveform">
|
| 271 |
+
<div class="gm-bar" style="height:18%"></div><div class="gm-bar" style="height:32%"></div><div class="gm-bar" style="height:54%"></div><div class="gm-bar" style="height:72%"></div><div class="gm-bar" style="height:88%"></div><div class="gm-bar" style="height:62%"></div><div class="gm-bar" style="height:42%"></div><div class="gm-bar" style="height:78%"></div><div class="gm-bar" style="height:92%"></div><div class="gm-bar" style="height:66%"></div><div class="gm-bar" style="height:48%"></div><div class="gm-bar" style="height:30%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:80%"></div><div class="gm-bar" style="height:70%"></div><div class="gm-bar" style="height:44%"></div><div class="gm-bar" style="height:24%"></div><div class="gm-bar" style="height:50%"></div>
|
| 272 |
+
</div>
|
| 273 |
+
|
| 274 |
+
<div class="gm-player-controls">
|
| 275 |
+
<span class="gm-play">βΆ</span>
|
| 276 |
+
<span>0:00 / 0:30</span>
|
| 277 |
+
<span style="margin-left:auto; cursor:pointer; color:#FFF">β» retake Β· new seed</span>
|
| 278 |
+
</div>
|
| 279 |
+
|
| 280 |
+
<div class="gm-label">Stems Β· Demucs htdemucs_ft</div>
|
| 281 |
+
<div class="gm-stems">
|
| 282 |
+
<div class="gm-stem"><span>vocals Β· 1.8 MB</span><span class="dl">β</span></div>
|
| 283 |
+
<div class="gm-stem"><span>drums Β· 1.6 MB</span><span class="dl">β</span></div>
|
| 284 |
+
<div class="gm-stem"><span>bass Β· 1.4 MB</span><span class="dl">β</span></div>
|
| 285 |
+
<div class="gm-stem"><span>other Β· 1.7 MB</span><span class="dl">β</span></div>
|
| 286 |
+
</div>
|
| 287 |
+
|
| 288 |
+
<div class="gm-label">Export</div>
|
| 289 |
+
<div class="gm-actions">
|
| 290 |
+
<span class="gm-secondary">β mp3 Β· 1.2 MB</span>
|
| 291 |
+
<span class="gm-secondary">β wav Β· 5.3 MB</span>
|
| 292 |
+
<span class="gm-secondary">β stems zip</span>
|
| 293 |
+
<span class="gm-secondary">{ } meta</span>
|
| 294 |
+
<span class="gm-secondary">β share</span>
|
| 295 |
+
</div>
|
| 296 |
+
|
| 297 |
+
<div class="gm-label" style="margin-top:14px">Metadata</div>
|
| 298 |
+
<div class="gm-meta-block">
|
| 299 |
+
{<br>
|
| 300 |
+
"mode": "generate",<br>
|
| 301 |
+
"prompt": "psytrance, rolling triplet bassline...",<br>
|
| 302 |
+
"lyrics_first_line": "[intro - atmospheric pads...",<br>
|
| 303 |
+
"duration_s": 30, "instrumental": false,<br>
|
| 304 |
+
"bpm": 135, "key": "auto", "time_sig": "4/4",<br>
|
| 305 |
+
"sampler": "heun", "steps": 50, "cfg": 5.0, "shift": 3,<br>
|
| 306 |
+
"cfg_interval": [0.0, 1.0],<br>
|
| 307 |
+
"lm": {"thinking": true, "temp": 0.85, "top_p": 0.9, "cfg": 2,<br>
|
| 308 |
+
"cot": {"metas":false,"caption":false,"lyrics":false,"language":false}},<br>
|
| 309 |
+
"dcw": {"enabled":true,"mode":"double","scaler":0.02,"high_scaler":0.06,"wavelet":"haar"},<br>
|
| 310 |
+
"loras": [<br>
|
| 311 |
+
{"name":"Lyric2Vocal","scale":0.65,"sha256":"7e1f..."},<br>
|
| 312 |
+
{"name":"psytrance_v2","scale":0.95,"sha256":"0c94..."}<br>
|
| 313 |
+
],<br>
|
| 314 |
+
"seed": 1297183202,<br>
|
| 315 |
+
"output_sha256": "f33a..."<br>
|
| 316 |
+
}
|
| 317 |
+
</div>
|
| 318 |
+
</div>
|
| 319 |
+
</div>
|
| 320 |
+
</div>
|
| 321 |
+
</div>
|
| 322 |
+
|
| 323 |
+
|
| 324 |
+
<h3 style="margin-top:30px">π± Mobile β phone screens</h3>
|
| 325 |
+
<p class="subtitle">Horizontal scroll tab strip at the top replaces the sidebar. Output stacks below form. Same Brutalist Mono.</p>
|
| 326 |
+
|
| 327 |
+
<style>
|
| 328 |
+
.mob-frame { display:flex; gap:24px; flex-wrap:wrap; justify-content:center; align-items:flex-start; }
|
| 329 |
+
.mob-phone { background:#222; border-radius:18px; padding:8px; }
|
| 330 |
+
.mob-screen { width:200px; background:#0A0A0A; color:#E5E5E5; border-radius:12px; padding:10px; }
|
| 331 |
+
.mob-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:6px; border-bottom:1px solid #1F1F1F; margin-bottom:8px; }
|
| 332 |
+
.mob-brand { font-size:11px; font-weight:600; }
|
| 333 |
+
.mob-cta { font-size:8px; color:#6B6B6B; }
|
| 334 |
+
.mob-tabs { display:flex; gap:6px; overflow-x:auto; padding:4px 0; margin-bottom:8px; border-bottom:1px solid #1F1F1F; }
|
| 335 |
+
.mob-tab { font-size:9px; color:#6B6B6B; white-space:nowrap; padding:4px 6px; }
|
| 336 |
+
.mob-tab.active { color:#FFF; border-bottom:1px solid #FFF; }
|
| 337 |
+
.mob-form { background:#141414; padding:10px; border-radius:5px; }
|
| 338 |
+
.mob-label { font-size:8px; text-transform:uppercase; letter-spacing:0.06em; color:#6B6B6B; margin-bottom:4px; }
|
| 339 |
+
.mob-input { background:#000; border:1px solid #2A2A2A; padding:5px 8px; border-radius:3px; font-size:9px; margin-bottom:8px; }
|
| 340 |
+
.mob-textarea { min-height:30px; }
|
| 341 |
+
.mob-chips { margin-bottom:8px; }
|
| 342 |
+
.mob-chip { display:inline-block; padding:2px 7px; border-radius:9px; font-size:8px; margin-right:3px; margin-bottom:3px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; }
|
| 343 |
+
.mob-chip.on { border-color:#FFF; color:#FFF; }
|
| 344 |
+
.mob-accordion { background:#000; border:1px solid #2A2A2A; border-radius:3px; padding:5px 8px; margin-bottom:6px; font-size:9px; color:#6B6B6B; display:flex; justify-content:space-between; }
|
| 345 |
+
.mob-btn { background:#FFF; color:#0A0A0A; padding:6px 10px; border-radius:3px; font-weight:600; font-size:9px; text-align:center; }
|
| 346 |
+
.mob-output { background:#141414; padding:10px; border-radius:5px; margin-top:8px; }
|
| 347 |
+
.mob-wave { height:30px; background:#000; border:1px solid #2A2A2A; border-radius:3px; display:flex; align-items:center; gap:1px; padding:4px; margin-bottom:6px; }
|
| 348 |
+
.mob-wave-bar { width:1px; background:#FFF; }
|
| 349 |
+
.mob-controls { display:flex; align-items:center; gap:6px; font-size:8px; color:#6B6B6B; margin-bottom:8px; }
|
| 350 |
+
.mob-play { width:20px; height:20px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:9px; }
|
| 351 |
+
.mob-export { display:flex; flex-wrap:wrap; gap:3px; }
|
| 352 |
+
.mob-secondary { border:1px solid #2A2A2A; padding:3px 8px; border-radius:3px; font-size:8px; color:#E5E5E5; }
|
| 353 |
+
.mob-dropzone { background:#000; border:2px solid #FFF; border-radius:4px; padding:6px 8px; margin-bottom:8px; font-size:8px; }
|
| 354 |
+
.mob-caption { text-align:center; color:#6B6B6B; font-size:10px; margin-top:8px; }
|
| 355 |
+
/* Mobile slider β bounded inside its container, no overflow */
|
| 356 |
+
.mob-slider { height:3px; background:#2A2A2A; border-radius:2px; position:relative; margin:6px 0 10px; box-sizing:border-box; }
|
| 357 |
+
.mob-slider::after { content:""; position:absolute; top:-3px; width:9px; height:9px; background:#FFF; border-radius:50%; transform:translateX(-50%); }
|
| 358 |
+
.mob-slider.p15::after { left:15%; }
|
| 359 |
+
.mob-slider.p93::after { left:93%; }
|
| 360 |
+
</style>
|
| 361 |
+
|
| 362 |
+
<div class="mob-frame">
|
| 363 |
+
|
| 364 |
+
<!-- Phone 1: Generate -->
|
| 365 |
+
<div>
|
| 366 |
+
<div class="mob-phone">
|
| 367 |
+
<div class="mob-screen">
|
| 368 |
+
<div class="mob-header">
|
| 369 |
+
<div class="mob-brand">ACE Music.</div>
|
| 370 |
+
<div class="mob-cta">β₯ @tfw</div>
|
| 371 |
+
</div>
|
| 372 |
+
<div class="mob-tabs">
|
| 373 |
+
<div class="mob-tab active">π΅ Generate</div>
|
| 374 |
+
<div class="mob-tab">π€ Cover</div>
|
| 375 |
+
<div class="mob-tab">β©</div>
|
| 376 |
+
<div class="mob-tab">βοΈ</div>
|
| 377 |
+
<div class="mob-tab">βοΈ</div>
|
| 378 |
+
</div>
|
| 379 |
+
<div class="mob-form">
|
| 380 |
+
<div class="mob-label">Style</div>
|
| 381 |
+
<div class="mob-input">psytrance, acid leads</div>
|
| 382 |
+
<div class="mob-label">Lyrics</div>
|
| 383 |
+
<div class="mob-input mob-textarea">[verse] six in the morning...</div>
|
| 384 |
+
<div class="mob-label">Duration Β· 30 s</div>
|
| 385 |
+
<div class="mob-slider p15"></div>
|
| 386 |
+
<div class="mob-chips">
|
| 387 |
+
<span class="mob-chip on">psytrance_v2</span>
|
| 388 |
+
<span class="mob-chip">+ upload</span>
|
| 389 |
+
</div>
|
| 390 |
+
<div class="mob-accordion">βΈ Advanced Β· BPM 135, sampler heun</div>
|
| 391 |
+
<div class="mob-accordion">βΈ LM planner</div>
|
| 392 |
+
<div class="mob-accordion">βΈ DCW</div>
|
| 393 |
+
<div class="mob-btn" style="margin-top:6px">βΆ Generate</div>
|
| 394 |
+
</div>
|
| 395 |
+
<div class="mob-output">
|
| 396 |
+
<div class="mob-wave">
|
| 397 |
+
<div class="mob-wave-bar" style="height:30%"></div><div class="mob-wave-bar" style="height:60%"></div><div class="mob-wave-bar" style="height:80%"></div><div class="mob-wave-bar" style="height:50%"></div><div class="mob-wave-bar" style="height:70%"></div><div class="mob-wave-bar" style="height:90%"></div><div class="mob-wave-bar" style="height:40%"></div><div class="mob-wave-bar" style="height:65%"></div><div class="mob-wave-bar" style="height:80%"></div><div class="mob-wave-bar" style="height:55%"></div><div class="mob-wave-bar" style="height:75%"></div><div class="mob-wave-bar" style="height:45%"></div><div class="mob-wave-bar" style="height:35%"></div><div class="mob-wave-bar" style="height:60%"></div><div class="mob-wave-bar" style="height:25%"></div>
|
| 398 |
+
</div>
|
| 399 |
+
<div class="mob-controls">
|
| 400 |
+
<span class="mob-play">βΆ</span>
|
| 401 |
+
<span>0:00 / 0:30</span>
|
| 402 |
+
<span style="margin-left:auto; color:#FFF">β»</span>
|
| 403 |
+
</div>
|
| 404 |
+
<div class="mob-export">
|
| 405 |
+
<span class="mob-secondary">β mp3</span>
|
| 406 |
+
<span class="mob-secondary">β wav</span>
|
| 407 |
+
<span class="mob-secondary">stems</span>
|
| 408 |
+
</div>
|
| 409 |
+
</div>
|
| 410 |
+
</div>
|
| 411 |
+
</div>
|
| 412 |
+
<div class="mob-caption">Generate Β· 360 Γ 720 mobile</div>
|
| 413 |
+
</div>
|
| 414 |
+
|
| 415 |
+
<!-- Phone 2: Cover with file picked -->
|
| 416 |
+
<div>
|
| 417 |
+
<div class="mob-phone">
|
| 418 |
+
<div class="mob-screen">
|
| 419 |
+
<div class="mob-header">
|
| 420 |
+
<div class="mob-brand">ACE Music.</div>
|
| 421 |
+
<div class="mob-cta">β₯ @tfw</div>
|
| 422 |
+
</div>
|
| 423 |
+
<div class="mob-tabs">
|
| 424 |
+
<div class="mob-tab">π΅</div>
|
| 425 |
+
<div class="mob-tab active">π€ Cover</div>
|
| 426 |
+
<div class="mob-tab">β©</div>
|
| 427 |
+
<div class="mob-tab">βοΈ</div>
|
| 428 |
+
<div class="mob-tab">βοΈ</div>
|
| 429 |
+
</div>
|
| 430 |
+
<div class="mob-form">
|
| 431 |
+
<div class="mob-label">1 Β· Reference</div>
|
| 432 |
+
<div class="mob-dropzone">
|
| 433 |
+
<strong style="color:#FFF">β ref_psy.wav</strong><br>
|
| 434 |
+
<span style="color:#6B6B6B">44.1k Β· 28 s Β· 2.1 MB</span>
|
| 435 |
+
</div>
|
| 436 |
+
<div class="mob-label">2 Β· New prompt</div>
|
| 437 |
+
<div class="mob-input">faster, more aggressive</div>
|
| 438 |
+
<div class="mob-label">3 Β· New lyrics</div>
|
| 439 |
+
<div class="mob-input mob-textarea">[verse] new lyrics over ref...</div>
|
| 440 |
+
<div class="mob-label">Cover strength Β· 0.93</div>
|
| 441 |
+
<div class="mob-slider p93"></div>
|
| 442 |
+
<div class="mob-chips">
|
| 443 |
+
<span class="mob-chip on">RapMachine</span>
|
| 444 |
+
</div>
|
| 445 |
+
<div class="mob-accordion">βΈ Advanced</div>
|
| 446 |
+
<div class="mob-accordion">βΈ LM planner</div>
|
| 447 |
+
<div class="mob-btn" style="margin-top:6px">βΆ Cover</div>
|
| 448 |
+
</div>
|
| 449 |
+
</div>
|
| 450 |
+
</div>
|
| 451 |
+
<div class="mob-caption">Cover Β· with ref audio loaded</div>
|
| 452 |
+
</div>
|
| 453 |
+
|
| 454 |
+
<!-- Phone 3: Lyrics output -->
|
| 455 |
+
<div>
|
| 456 |
+
<div class="mob-phone">
|
| 457 |
+
<div class="mob-screen">
|
| 458 |
+
<div class="mob-header">
|
| 459 |
+
<div class="mob-brand">ACE Music.</div>
|
| 460 |
+
<div class="mob-cta">β₯ @tfw</div>
|
| 461 |
+
</div>
|
| 462 |
+
<div class="mob-tabs">
|
| 463 |
+
<div class="mob-tab">π΅</div>
|
| 464 |
+
<div class="mob-tab">π€</div>
|
| 465 |
+
<div class="mob-tab">β©</div>
|
| 466 |
+
<div class="mob-tab">βοΈ</div>
|
| 467 |
+
<div class="mob-tab active">βοΈ Lyrics</div>
|
| 468 |
+
</div>
|
| 469 |
+
<div class="mob-form">
|
| 470 |
+
<div class="mob-label">Brief</div>
|
| 471 |
+
<div class="mob-input mob-textarea">psytrance anthem about sunrise...</div>
|
| 472 |
+
<div class="mob-label">Structure</div>
|
| 473 |
+
<div class="mob-input">intro, verse, chorus...</div>
|
| 474 |
+
<div class="mob-label">Language Β· en Β· 0.85 temp</div>
|
| 475 |
+
<div class="mob-accordion">βΈ LM parameters</div>
|
| 476 |
+
<div class="mob-btn" style="margin-top:6px">βΆ Draft</div>
|
| 477 |
+
</div>
|
| 478 |
+
<div class="mob-output">
|
| 479 |
+
<div style="font-size:9px; line-height:1.5;">
|
| 480 |
+
<strong style="color:#FFF">[intro]</strong><br>
|
| 481 |
+
<span style="color:#B8B0A4">the lights start low...</span><br>
|
| 482 |
+
<strong style="color:#FFF">[verse 1]</strong><br>
|
| 483 |
+
<span style="color:#B8B0A4">six in the morning,<br>the sun's still pretending...</span>
|
| 484 |
+
</div>
|
| 485 |
+
<div class="mob-export" style="margin-top:8px">
|
| 486 |
+
<span class="mob-secondary" style="border-color:#FFF; color:#FFF">β Use in Generate</span>
|
| 487 |
+
<span class="mob-secondary">β»</span>
|
| 488 |
+
</div>
|
| 489 |
+
</div>
|
| 490 |
+
</div>
|
| 491 |
+
</div>
|
| 492 |
+
<div class="mob-caption">Lyrics Β· draft visible</div>
|
| 493 |
+
</div>
|
| 494 |
+
|
| 495 |
+
</div>
|
| 496 |
+
|
| 497 |
+
|
| 498 |
+
<h3 style="margin-top:30px">β οΈ Error and edge-case states</h3>
|
| 499 |
+
|
| 500 |
+
<style>
|
| 501 |
+
.err { background:#0A0A0A; border:1px solid #1F1F1F; border-radius:8px; padding:14px; margin-bottom:10px; }
|
| 502 |
+
.err-row { display:flex; align-items:flex-start; gap:14px; }
|
| 503 |
+
.err-icon { width:28px; height:28px; flex-shrink:0; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:14px; font-weight:600; }
|
| 504 |
+
.err-icon.warn { background:#0A0A0A; color:#FFF; border:1px solid #FFF; }
|
| 505 |
+
.err-icon.info { background:transparent; color:#6B6B6B; border:1px solid #6B6B6B; }
|
| 506 |
+
.err-body { flex:1; }
|
| 507 |
+
.err-title { font-size:12px; font-weight:600; color:#FFF; margin-bottom:4px; }
|
| 508 |
+
.err-msg { font-size:11px; color:#B8B0A4; line-height:1.5; margin-bottom:6px; }
|
| 509 |
+
.err-action { display:inline-block; border:1px solid #FFF; color:#FFF; padding:4px 10px; border-radius:3px; font-size:10px; cursor:pointer; margin-right:4px; }
|
| 510 |
+
.err-action.secondary { border-color:#2A2A2A; color:#B8B0A4; }
|
| 511 |
+
.err-tag { display:inline-block; background:#1A1A1A; color:#6B6B6B; padding:2px 6px; border-radius:3px; font-size:9px; font-family:monospace; margin-left:6px; }
|
| 512 |
+
|
| 513 |
+
.progress { background:#0A0A0A; border:1px solid #1F1F1F; border-radius:8px; padding:18px; margin-bottom:10px; }
|
| 514 |
+
.progress-bar { height:4px; background:#1A1A1A; border-radius:2px; overflow:hidden; margin:10px 0 6px; }
|
| 515 |
+
.progress-bar .fill { height:100%; background:#FFF; width:42%; }
|
| 516 |
+
.progress-meta { display:flex; justify-content:space-between; font-size:10px; color:#6B6B6B; font-family:monospace; }
|
| 517 |
+
.progress-title { font-size:12px; font-weight:600; color:#FFF; margin-bottom:4px; }
|
| 518 |
+
.progress-sub { font-size:10px; color:#6B6B6B; }
|
| 519 |
+
</style>
|
| 520 |
+
|
| 521 |
+
<div class="err">
|
| 522 |
+
<div class="err-row">
|
| 523 |
+
<div class="err-icon">!</div>
|
| 524 |
+
<div class="err-body">
|
| 525 |
+
<div class="err-title">LoRA not compatible <span class="err-tag">LoRAValidationError</span></div>
|
| 526 |
+
<div class="err-msg">This LoRA was trained against <code>SDXL</code>, not ACE-Step 1.5 XL SFT. Expected DiT modules: <code>to_q, to_k, to_v, to_out.0, ff.net.0.proj, ff.net.2</code>. Got: <code>unet.down_blocksβ¦</code>.</div>
|
| 527 |
+
<div class="err-action">Remove from stack</div>
|
| 528 |
+
<span class="err-action secondary">View header diagnostics</span>
|
| 529 |
+
</div>
|
| 530 |
+
</div>
|
| 531 |
+
</div>
|
| 532 |
+
|
| 533 |
+
<div class="err">
|
| 534 |
+
<div class="err-row">
|
| 535 |
+
<div class="err-icon warn">β </div>
|
| 536 |
+
<div class="err-body">
|
| 537 |
+
<div class="err-title">ZeroGPU timed out Β· auto-retried at 2Γ duration</div>
|
| 538 |
+
<div class="err-msg">First attempt aborted at the 60 s shared-A10G cap. Second attempt at 120 s also aborted. Try a shorter duration, fewer steps, or fewer active LoRAs. <span style="color:#6B6B6B">last seen: 70 s wall, step 41/50</span></div>
|
| 539 |
+
<div class="err-action">Lower steps to 30</div>
|
| 540 |
+
<span class="err-action secondary">Reduce duration to 20 s</span>
|
| 541 |
+
</div>
|
| 542 |
+
</div>
|
| 543 |
+
</div>
|
| 544 |
+
|
| 545 |
+
<div class="err">
|
| 546 |
+
<div class="err-row">
|
| 547 |
+
<div class="err-icon warn">β </div>
|
| 548 |
+
<div class="err-body">
|
| 549 |
+
<div class="err-title">MPS op not implemented Β· falling back to CPU <span class="err-tag">aten::_fft_r2c</span></div>
|
| 550 |
+
<div class="err-msg">An ACE-Step kernel hit a PyTorch MPS gap. CPU fallback engaged via <code>PYTORCH_ENABLE_MPS_FALLBACK=1</code>. Generation will continue but be ~2β3Γ slower for the affected segments.</div>
|
| 551 |
+
<div class="err-action secondary">Continue anyway</div>
|
| 552 |
+
<span class="err-action secondary">Open issue on GitHub</span>
|
| 553 |
+
</div>
|
| 554 |
+
</div>
|
| 555 |
+
</div>
|
| 556 |
+
|
| 557 |
+
<div class="err">
|
| 558 |
+
<div class="err-row">
|
| 559 |
+
<div class="err-icon">!</div>
|
| 560 |
+
<div class="err-body">
|
| 561 |
+
<div class="err-title">Reference audio rejected <span class="err-tag">unsupported format</span></div>
|
| 562 |
+
<div class="err-msg">Cover mode needs <code>wav</code>, <code>mp3</code>, or <code>flac</code>, β€ 60 s, β€ 50 MB. Got <code>m4a</code>, 4:12 long, 87 MB.</div>
|
| 563 |
+
<div class="err-action">Pick a different file</div>
|
| 564 |
+
<span class="err-action secondary">Auto-convert + trim to first 60 s</span>
|
| 565 |
+
</div>
|
| 566 |
+
</div>
|
| 567 |
+
</div>
|
| 568 |
+
|
| 569 |
+
<div class="err">
|
| 570 |
+
<div class="err-row">
|
| 571 |
+
<div class="err-icon info">i</div>
|
| 572 |
+
<div class="err-body">
|
| 573 |
+
<div class="err-title">First request β warming up the pipeline (~45 s)</div>
|
| 574 |
+
<div class="err-msg">Loading <code>ACE-Step v1.5 XL SFT</code> weights into MPS memory. Subsequent generations in this session start instantly.</div>
|
| 575 |
+
</div>
|
| 576 |
+
</div>
|
| 577 |
+
</div>
|
| 578 |
+
|
| 579 |
+
<div class="progress">
|
| 580 |
+
<div class="progress-title">Generating⦠<span style="color:#6B6B6B; font-weight:400; font-size:10px;">step 21 / 50 · ETA 14 s</span></div>
|
| 581 |
+
<div class="progress-sub">heun sampler Β· CFG 5.0 Β· 2 LoRAs active Β· seed 1297183202</div>
|
| 582 |
+
<div class="progress-bar"><div class="fill"></div></div>
|
| 583 |
+
<div class="progress-meta">
|
| 584 |
+
<span>0:08 elapsed</span>
|
| 585 |
+
<span>β» cancel</span>
|
| 586 |
+
</div>
|
| 587 |
+
</div>
|
| 588 |
+
|
| 589 |
+
<div class="options" style="margin-top:24px">
|
| 590 |
+
<div class="option" data-choice="approve" onclick="toggleSelect(this)">
|
| 591 |
+
<div class="letter">β</div>
|
| 592 |
+
<div class="content">
|
| 593 |
+
<h3>All mockups approved β bake them into the spec</h3>
|
| 594 |
+
<p>Move every approved mockup into <code>docs/superpowers/specs/mockups/</code> and reference them from Β§8 of the spec. Then hand off to writing-plans.</p>
|
| 595 |
+
</div>
|
| 596 |
+
</div>
|
| 597 |
+
<div class="option" data-choice="revise" onclick="toggleSelect(this)">
|
| 598 |
+
<div class="letter">β</div>
|
| 599 |
+
<div class="content">
|
| 600 |
+
<h3>Revise something specific</h3>
|
| 601 |
+
<p>Tell me which mockup / control / error needs work.</p>
|
| 602 |
+
</div>
|
| 603 |
+
</div>
|
| 604 |
+
</div>
|
|
@@ -0,0 +1,572 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<h2>Cover and Extend Β· everything expanded</h2>
|
| 2 |
+
<p class="subtitle">Every accordion open. Every control visible. Showing the actual depth of options. In production, "Advanced", "LM planner", "DCW" stay collapsed by default β but this is the full surface so you can verify nothing is missing.</p>
|
| 3 |
+
|
| 4 |
+
<style>
|
| 5 |
+
/* base */
|
| 6 |
+
.gm { background:#0A0A0A; color:#E5E5E5; border:1px solid #1F1F1F; border-radius:10px; padding:18px; font-size:12px; line-height:1.5; margin-top:14px; }
|
| 7 |
+
.gm-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:10px; border-bottom:1px solid #1F1F1F; margin-bottom:14px; }
|
| 8 |
+
.gm-brand { font-size:15px; font-weight:600; }
|
| 9 |
+
.gm-cta { font-size:11px; color:#6B6B6B; }
|
| 10 |
+
.gm-cta strong { color:#E5E5E5; }
|
| 11 |
+
.gm-status { font-size:10px; color:#6B6B6B; letter-spacing:0.08em; text-transform:uppercase; }
|
| 12 |
+
.gm-row { display:flex; gap:16px; align-items:flex-start; }
|
| 13 |
+
.gm-sidebar { background:#000; padding:14px 10px; border-radius:6px; min-width:170px; position:sticky; top:0; }
|
| 14 |
+
.gm-side { display:block; padding:8px 10px; border-radius:4px; margin-bottom:3px; font-size:12px; color:#6B6B6B; }
|
| 15 |
+
.gm-side.active { background:#1A1A1A; color:#FFF; border-left:2px solid #FFF; padding-left:8px; }
|
| 16 |
+
.gm-side .em { margin-right:6px; }
|
| 17 |
+
.gm-main { flex:1; display:flex; gap:14px; align-items:flex-start; }
|
| 18 |
+
.gm-form { flex:1.3; background:#141414; padding:16px; border-radius:6px; }
|
| 19 |
+
.gm-output { flex:1; background:#141414; padding:16px; border-radius:6px; min-width:260px; position:sticky; top:0; }
|
| 20 |
+
|
| 21 |
+
/* form controls */
|
| 22 |
+
.gm-label { font-size:10px; text-transform:uppercase; letter-spacing:0.08em; color:#6B6B6B; margin-bottom:6px; display:flex; justify-content:space-between; align-items:center; }
|
| 23 |
+
.gm-label .hint { color:#5A5048; font-size:9px; text-transform:none; letter-spacing:normal; font-weight:400; }
|
| 24 |
+
.gm-input { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; margin-bottom:12px; font-size:11px; }
|
| 25 |
+
.gm-textarea { min-height:46px; }
|
| 26 |
+
.gm-grid2 { display:grid; grid-template-columns:1fr 1fr; gap:12px; margin-bottom:12px; }
|
| 27 |
+
.gm-grid3 { display:grid; grid-template-columns:1fr 1fr 1fr; gap:10px; margin-bottom:12px; }
|
| 28 |
+
.gm-grid4 { display:grid; grid-template-columns:1fr 1fr 1fr 1fr; gap:8px; margin-bottom:12px; }
|
| 29 |
+
|
| 30 |
+
/* slider */
|
| 31 |
+
.gm-slider-row { display:flex; align-items:center; gap:10px; padding:6px 8px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; }
|
| 32 |
+
.gm-slider-row .name { color:#6B6B6B; font-size:10px; min-width:90px; }
|
| 33 |
+
.gm-slider { flex:1; height:3px; background:#2A2A2A; border-radius:2px; position:relative; }
|
| 34 |
+
.gm-slider::after { content:""; position:absolute; top:-4px; width:10px; height:10px; background:#FFF; border-radius:50%; }
|
| 35 |
+
.gm-slider.p10::after { left:10%; }
|
| 36 |
+
.gm-slider.p20::after { left:20%; }
|
| 37 |
+
.gm-slider.p30::after { left:30%; }
|
| 38 |
+
.gm-slider.p40::after { left:40%; }
|
| 39 |
+
.gm-slider.p50::after { left:50%; }
|
| 40 |
+
.gm-slider.p60::after { left:60%; }
|
| 41 |
+
.gm-slider.p70::after { left:70%; }
|
| 42 |
+
.gm-slider.p85::after { left:85%; }
|
| 43 |
+
.gm-slider.p93::after { left:93%; }
|
| 44 |
+
.gm-slider.p95::after { left:95%; }
|
| 45 |
+
.gm-slider-row .val { color:#FFF; font-family:monospace; font-size:11px; min-width:42px; text-align:right; }
|
| 46 |
+
|
| 47 |
+
/* toggle */
|
| 48 |
+
.gm-toggle { display:flex; align-items:center; gap:8px; padding:6px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; cursor:pointer; }
|
| 49 |
+
.gm-toggle .box { width:14px; height:14px; border:1px solid #2A2A2A; border-radius:3px; display:inline-flex; align-items:center; justify-content:center; font-size:9px; }
|
| 50 |
+
.gm-toggle.on { color:#FFF; border-color:#FFF; }
|
| 51 |
+
.gm-toggle.on .box { background:#FFF; color:#0A0A0A; border-color:#FFF; }
|
| 52 |
+
|
| 53 |
+
/* radio pill */
|
| 54 |
+
.gm-pills { display:flex; gap:0; background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:2px; margin-bottom:12px; }
|
| 55 |
+
.gm-pill { flex:1; text-align:center; padding:6px 10px; font-size:11px; color:#6B6B6B; border-radius:3px; cursor:pointer; }
|
| 56 |
+
.gm-pill.on { background:#FFF; color:#0A0A0A; }
|
| 57 |
+
|
| 58 |
+
/* select */
|
| 59 |
+
.gm-select { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; font-size:11px; display:flex; justify-content:space-between; align-items:center; margin-bottom:8px; }
|
| 60 |
+
.gm-select .arrow { color:#6B6B6B; }
|
| 61 |
+
|
| 62 |
+
/* section divider */
|
| 63 |
+
.gm-section { border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-top:14px; background:#0F0F0F; }
|
| 64 |
+
.gm-section-h { display:flex; justify-content:space-between; align-items:center; margin-bottom:12px; font-size:11px; font-weight:600; }
|
| 65 |
+
.gm-section-h .arrow { color:#FFF; }
|
| 66 |
+
.gm-section-h .meta { color:#6B6B6B; font-weight:400; font-size:10px; }
|
| 67 |
+
|
| 68 |
+
.gm-chip { display:inline-block; padding:5px 10px; border-radius:14px; font-size:10px; margin-right:5px; margin-bottom:5px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; cursor:pointer; }
|
| 69 |
+
.gm-chip.on { border-color:#FFF; color:#FFF; }
|
| 70 |
+
.gm-chip.upload { border-style:dashed; color:#FFF; }
|
| 71 |
+
|
| 72 |
+
.gm-lora-row { display:flex; align-items:center; gap:10px; padding:8px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:6px; font-size:11px; }
|
| 73 |
+
.gm-lora-name { flex:1; }
|
| 74 |
+
.gm-lora-name small { color:#6B6B6B; font-weight:400; margin-left:4px; }
|
| 75 |
+
.gm-x { color:#6B6B6B; cursor:pointer; padding:0 4px; }
|
| 76 |
+
|
| 77 |
+
.gm-btn { background:#FFF; color:#0A0A0A; padding:12px 18px; border-radius:4px; font-weight:600; display:block; font-size:13px; text-align:center; cursor:pointer; margin-top:16px; }
|
| 78 |
+
|
| 79 |
+
/* drop zone */
|
| 80 |
+
.gm-dropzone { background:#000; border:2px dashed #2A2A2A; border-radius:6px; padding:14px; margin-bottom:12px; text-align:center; font-size:11px; color:#6B6B6B; }
|
| 81 |
+
.gm-dropzone.has-file { border-style:solid; border-color:#FFF; color:#FFF; text-align:left; padding:10px 12px; }
|
| 82 |
+
.gm-dropzone .filename { font-weight:600; }
|
| 83 |
+
.gm-dropzone .meta { color:#6B6B6B; font-size:9px; margin-top:2px; font-weight:400; }
|
| 84 |
+
.gm-dropzone .miniwave { height:18px; background:repeating-linear-gradient(90deg, currentColor 0 1px, transparent 1px 3px); margin-top:6px; opacity:0.5; }
|
| 85 |
+
|
| 86 |
+
/* output */
|
| 87 |
+
.gm-waveform { height:60px; background:#000; border:1px solid #2A2A2A; border-radius:4px; display:flex; align-items:center; justify-content:center; gap:2px; padding:8px; margin-bottom:10px; }
|
| 88 |
+
.gm-bar { width:2px; background:#E5E5E5; }
|
| 89 |
+
.gm-player-controls { display:flex; align-items:center; gap:10px; color:#6B6B6B; font-size:10px; margin-bottom:14px; }
|
| 90 |
+
.gm-play { width:28px; height:28px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:11px; }
|
| 91 |
+
.gm-stems { display:grid; grid-template-columns:1fr 1fr; gap:6px; margin-bottom:10px; }
|
| 92 |
+
.gm-stem { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between; align-items:center; }
|
| 93 |
+
.gm-stem .dl { color:#FFF; cursor:pointer; }
|
| 94 |
+
.gm-meta-block { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:8px 10px; font-size:9px; color:#6B6B6B; font-family:monospace; line-height:1.6; max-height:140px; overflow:hidden; margin-top:8px; }
|
| 95 |
+
.gm-actions { display:flex; flex-wrap:wrap; gap:6px; margin-bottom:10px; }
|
| 96 |
+
.gm-secondary { border:1px solid #2A2A2A; color:#E5E5E5; padding:6px 12px; border-radius:4px; font-size:10px; cursor:pointer; }
|
| 97 |
+
</style>
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
<h3 style="margin-top:14px">π€ Cover β fully expanded</h3>
|
| 101 |
+
|
| 102 |
+
<div class="gm">
|
| 103 |
+
<div class="gm-header">
|
| 104 |
+
<div>
|
| 105 |
+
<div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
|
| 106 |
+
<div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">β₯</span>. <strong>Drop a like</strong> Β· Follow <strong>@techfreakworm</strong> for what's next.</div>
|
| 107 |
+
</div>
|
| 108 |
+
<div class="gm-status">ready Β· MPS Β· M5 Max</div>
|
| 109 |
+
</div>
|
| 110 |
+
|
| 111 |
+
<div class="gm-row">
|
| 112 |
+
<div class="gm-sidebar">
|
| 113 |
+
<div class="gm-side"><span class="em">π΅</span>Generate</div>
|
| 114 |
+
<div class="gm-side active"><span class="em">π€</span>Cover</div>
|
| 115 |
+
<div class="gm-side"><span class="em">β©</span>Extend</div>
|
| 116 |
+
<div class="gm-side"><span class="em">βοΈ</span>Edit</div>
|
| 117 |
+
<div class="gm-side"><span class="em">βοΈ</span>Lyrics</div>
|
| 118 |
+
<div style="border-top:1px solid #1F1F1F; margin:14px 0 10px; padding-top:10px; font-size:9px; color:#6B6B6B; text-transform:uppercase; letter-spacing:0.1em;">History Β· session</div>
|
| 119 |
+
<div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">βΆ psy_cover Β· just now</div>
|
| 120 |
+
<div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">βΆ lofi_remix Β· 3m</div>
|
| 121 |
+
<div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">βΆ ambient_v2 Β· 12m</div>
|
| 122 |
+
</div>
|
| 123 |
+
|
| 124 |
+
<div class="gm-main">
|
| 125 |
+
<div class="gm-form">
|
| 126 |
+
|
| 127 |
+
<div class="gm-label">1 Β· Reference audio <span class="hint">wav / mp3 / flac Β· β€ 60 s Β· matters most for first 12 s</span></div>
|
| 128 |
+
<div class="gm-dropzone has-file">
|
| 129 |
+
<div class="filename">β reference_psy_track.wav</div>
|
| 130 |
+
<div class="meta">44.1 kHz Β· stereo Β· 28.4 s Β· 2.1 MB</div>
|
| 131 |
+
<div class="miniwave"></div>
|
| 132 |
+
</div>
|
| 133 |
+
|
| 134 |
+
<div class="gm-label">2 Β· New style prompt <span class="hint">leave blank to fully inherit reference style</span></div>
|
| 135 |
+
<div class="gm-input">faster, more aggressive leads, club-ready</div>
|
| 136 |
+
|
| 137 |
+
<div class="gm-label">3 Β· New lyrics <span class="hint">use [verse] [chorus] [bridge] tags Β· open Lyrics tab to draft with AI</span></div>
|
| 138 |
+
<div class="gm-input gm-textarea">[intro] driving acid bassline<br>[verse] new lyrics over the reference style<br>[chorus] one more time, one more time<br>[outro] ...</div>
|
| 139 |
+
|
| 140 |
+
<div class="gm-grid2">
|
| 141 |
+
<div>
|
| 142 |
+
<div class="gm-label">Duration <span class="hint">seconds</span></div>
|
| 143 |
+
<div class="gm-slider-row"><span class="name">5 β 240 s</span><span class="gm-slider p10"></span><span class="val">30</span></div>
|
| 144 |
+
</div>
|
| 145 |
+
<div>
|
| 146 |
+
<div class="gm-label">Vocal mode</div>
|
| 147 |
+
<div class="gm-pills">
|
| 148 |
+
<div class="gm-pill on">With vocals</div>
|
| 149 |
+
<div class="gm-pill">Instrumental</div>
|
| 150 |
+
</div>
|
| 151 |
+
</div>
|
| 152 |
+
</div>
|
| 153 |
+
|
| 154 |
+
<div class="gm-label">Cover-specific <span class="hint">how the reference influences the output</span></div>
|
| 155 |
+
<div class="gm-slider-row"><span class="name">Cover strength</span><span class="gm-slider p93"></span><span class="val">0.93</span></div>
|
| 156 |
+
<div class="gm-slider-row"><span class="name">Cover noise</span><span class="gm-slider p10" style="--p:0.05;"></span><span class="val">0.00</span></div>
|
| 157 |
+
|
| 158 |
+
<!-- LoRA section, expanded -->
|
| 159 |
+
<div class="gm-section">
|
| 160 |
+
<div class="gm-section-h">
|
| 161 |
+
<span>LoRA stack <span class="meta">Β· 2 active</span></span>
|
| 162 |
+
<span class="arrow">βΎ</span>
|
| 163 |
+
</div>
|
| 164 |
+
|
| 165 |
+
<div class="gm-label">Bundled presets <span class="hint">click to toggle</span></div>
|
| 166 |
+
<div style="margin-bottom:12px;">
|
| 167 |
+
<span class="gm-chip on">RapMachine</span>
|
| 168 |
+
<span class="gm-chip">Chinese Rap</span>
|
| 169 |
+
<span class="gm-chip">Lyric2Vocal</span>
|
| 170 |
+
<span class="gm-chip">Text2Samples</span>
|
| 171 |
+
</div>
|
| 172 |
+
|
| 173 |
+
<div class="gm-label">Active stack <span class="hint">applied in order, top first</span></div>
|
| 174 |
+
<div class="gm-lora-row">
|
| 175 |
+
<span class="gm-lora-name">RapMachine <small>Β· preset</small></span>
|
| 176 |
+
<span class="gm-slider p85" style="width:100px"></span>
|
| 177 |
+
<span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.85</span>
|
| 178 |
+
<span class="gm-x">Γ</span>
|
| 179 |
+
</div>
|
| 180 |
+
<div class="gm-lora-row">
|
| 181 |
+
<span class="gm-lora-name">psytrance_v2 <small>Β· custom Β· 47 MB Β· rank 64</small></span>
|
| 182 |
+
<span class="gm-slider p95" style="width:100px"></span>
|
| 183 |
+
<span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.95</span>
|
| 184 |
+
<span class="gm-x">Γ</span>
|
| 185 |
+
</div>
|
| 186 |
+
|
| 187 |
+
<div style="margin-top:10px;">
|
| 188 |
+
<span class="gm-chip upload">β drop .safetensors here or click</span>
|
| 189 |
+
</div>
|
| 190 |
+
</div>
|
| 191 |
+
|
| 192 |
+
<!-- Advanced section, expanded -->
|
| 193 |
+
<div class="gm-section">
|
| 194 |
+
<div class="gm-section-h">
|
| 195 |
+
<span>Advanced</span>
|
| 196 |
+
<span class="arrow">βΎ</span>
|
| 197 |
+
</div>
|
| 198 |
+
|
| 199 |
+
<div class="gm-grid3">
|
| 200 |
+
<div><div class="gm-label">BPM</div><div class="gm-input" style="margin-bottom:0">135</div></div>
|
| 201 |
+
<div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">auto</div></div>
|
| 202 |
+
<div><div class="gm-label">Time sig</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
|
| 203 |
+
</div>
|
| 204 |
+
|
| 205 |
+
<div class="gm-grid2">
|
| 206 |
+
<div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">βΎ</span></div></div>
|
| 207 |
+
<div><div class="gm-label">Vocal language</div><div class="gm-select">auto <span class="arrow">βΎ</span></div></div>
|
| 208 |
+
</div>
|
| 209 |
+
|
| 210 |
+
<div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p20"></span><span class="val">50</span></div>
|
| 211 |
+
<div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
|
| 212 |
+
<div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p30"></span><span class="val">3</span></div>
|
| 213 |
+
|
| 214 |
+
<div class="gm-label" style="margin-top:8px">Negative prompt <span class="hint">things to avoid in the output</span></div>
|
| 215 |
+
<div class="gm-input gm-textarea" style="font-size:10px">bitcrushed, aliasing, jazz, pop, vocal hooks, slow tempo</div>
|
| 216 |
+
|
| 217 |
+
<div class="gm-grid2">
|
| 218 |
+
<div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
|
| 219 |
+
<div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">β</span> Normalize to -14 LUFS</div></div>
|
| 220 |
+
</div>
|
| 221 |
+
|
| 222 |
+
<div class="gm-grid2">
|
| 223 |
+
<div><div class="gm-label">Fade in</div><div class="gm-input" style="margin-bottom:0">0.0 s</div></div>
|
| 224 |
+
<div><div class="gm-label">Fade out</div><div class="gm-input" style="margin-bottom:0">0.0 s</div></div>
|
| 225 |
+
</div>
|
| 226 |
+
|
| 227 |
+
<div class="gm-grid2">
|
| 228 |
+
<div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">1297183202</div></div>
|
| 229 |
+
<div><div class="gm-label"> </div><div class="gm-toggle"><span class="box"></span> Lock seed Β· re-use across retakes</div></div>
|
| 230 |
+
</div>
|
| 231 |
+
</div>
|
| 232 |
+
|
| 233 |
+
<!-- LM planner section, expanded -->
|
| 234 |
+
<div class="gm-section">
|
| 235 |
+
<div class="gm-section-h">
|
| 236 |
+
<span>LM planner Β· Qwen3 thinking</span>
|
| 237 |
+
<span class="arrow">βΎ</span>
|
| 238 |
+
</div>
|
| 239 |
+
|
| 240 |
+
<div class="gm-toggle on"><span class="box">β</span> Thinking enabled <span style="color:#6B6B6B; font-size:9px; margin-left:auto">+ slower but better structure</span></div>
|
| 241 |
+
<div class="gm-toggle on"><span class="box">β</span> Constrained decoding</div>
|
| 242 |
+
|
| 243 |
+
<div class="gm-grid4" style="margin-top:8px">
|
| 244 |
+
<div><div class="gm-label">Temp</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
|
| 245 |
+
<div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
|
| 246 |
+
<div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
|
| 247 |
+
<div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
|
| 248 |
+
</div>
|
| 249 |
+
|
| 250 |
+
<div class="gm-label">CoT pipeline toggles <span class="hint">which fields the LM rewrites</span></div>
|
| 251 |
+
<div class="gm-grid4">
|
| 252 |
+
<div class="gm-toggle"><span class="box"></span> metas</div>
|
| 253 |
+
<div class="gm-toggle"><span class="box"></span> caption</div>
|
| 254 |
+
<div class="gm-toggle"><span class="box"></span> lyrics</div>
|
| 255 |
+
<div class="gm-toggle"><span class="box"></span> language</div>
|
| 256 |
+
</div>
|
| 257 |
+
|
| 258 |
+
<div class="gm-label">LM negative prompt</div>
|
| 259 |
+
<div class="gm-input" style="font-size:10px">happy chords, major scale</div>
|
| 260 |
+
</div>
|
| 261 |
+
|
| 262 |
+
<!-- DCW section, expanded -->
|
| 263 |
+
<div class="gm-section">
|
| 264 |
+
<div class="gm-section-h">
|
| 265 |
+
<span>DCW Β· dynamic CFG warping</span>
|
| 266 |
+
<span class="arrow">βΎ</span>
|
| 267 |
+
</div>
|
| 268 |
+
|
| 269 |
+
<div class="gm-toggle on"><span class="box">β</span> DCW enabled</div>
|
| 270 |
+
|
| 271 |
+
<div class="gm-grid3">
|
| 272 |
+
<div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">βΎ</span></div></div>
|
| 273 |
+
<div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">βΎ</span></div></div>
|
| 274 |
+
<div><div class="gm-label"> </div><div style="font-size:9px; color:#6B6B6B; padding-top:8px;">leave defaults if unsure</div></div>
|
| 275 |
+
</div>
|
| 276 |
+
|
| 277 |
+
<div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p10"></span><span class="val">0.02</span></div>
|
| 278 |
+
<div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
|
| 279 |
+
</div>
|
| 280 |
+
|
| 281 |
+
<div class="gm-btn">βΆ Generate cover Β· est. ~35 s on M5 Max</div>
|
| 282 |
+
</div>
|
| 283 |
+
|
| 284 |
+
<!-- Output panel -->
|
| 285 |
+
<div class="gm-output">
|
| 286 |
+
<div class="gm-label" style="margin-bottom:10px">Output Β· cover Β· 30 s Β· seed 1297183202</div>
|
| 287 |
+
|
| 288 |
+
<div class="gm-toggle"><span class="box"></span> Compare side-by-side with reference</div>
|
| 289 |
+
|
| 290 |
+
<div class="gm-waveform">
|
| 291 |
+
<div class="gm-bar" style="height:22%"></div><div class="gm-bar" style="height:54%"></div><div class="gm-bar" style="height:78%"></div><div class="gm-bar" style="height:42%"></div><div class="gm-bar" style="height:62%"></div><div class="gm-bar" style="height:88%"></div><div class="gm-bar" style="height:32%"></div><div class="gm-bar" style="height:70%"></div><div class="gm-bar" style="height:50%"></div><div class="gm-bar" style="height:84%"></div><div class="gm-bar" style="height:64%"></div><div class="gm-bar" style="height:38%"></div><div class="gm-bar" style="height:74%"></div><div class="gm-bar" style="height:46%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:80%"></div><div class="gm-bar" style="height:36%"></div><div class="gm-bar" style="height:68%"></div>
|
| 292 |
+
</div>
|
| 293 |
+
|
| 294 |
+
<div class="gm-player-controls">
|
| 295 |
+
<span class="gm-play">βΆ</span>
|
| 296 |
+
<span>0:00 / 0:30</span>
|
| 297 |
+
<span style="margin-left:auto; cursor:pointer; color:#FFF">β» retake Β· new seed</span>
|
| 298 |
+
</div>
|
| 299 |
+
|
| 300 |
+
<div class="gm-label">Stems Β· Demucs htdemucs_ft</div>
|
| 301 |
+
<div class="gm-stems">
|
| 302 |
+
<div class="gm-stem"><span>vocals Β· 1.8 MB</span><span class="dl">β</span></div>
|
| 303 |
+
<div class="gm-stem"><span>drums Β· 1.6 MB</span><span class="dl">β</span></div>
|
| 304 |
+
<div class="gm-stem"><span>bass Β· 1.4 MB</span><span class="dl">β</span></div>
|
| 305 |
+
<div class="gm-stem"><span>other Β· 1.7 MB</span><span class="dl">β</span></div>
|
| 306 |
+
</div>
|
| 307 |
+
|
| 308 |
+
<div class="gm-label">Export</div>
|
| 309 |
+
<div class="gm-actions">
|
| 310 |
+
<span class="gm-secondary">β mp3 Β· 320k Β· 1.2 MB</span>
|
| 311 |
+
<span class="gm-secondary">β wav Β· 44.1k Β· 5.3 MB</span>
|
| 312 |
+
<span class="gm-secondary">β stems zip</span>
|
| 313 |
+
<span class="gm-secondary">{ } meta json</span>
|
| 314 |
+
<span class="gm-secondary">β copy share link</span>
|
| 315 |
+
</div>
|
| 316 |
+
|
| 317 |
+
<div class="gm-label" style="margin-top:14px">Metadata Β· for reproducibility</div>
|
| 318 |
+
<div class="gm-meta-block">
|
| 319 |
+
{<br>
|
| 320 |
+
"mode": "cover",<br>
|
| 321 |
+
"prompt": "faster, more aggressive leads, club-ready",<br>
|
| 322 |
+
"lyrics_first_line": "[intro] driving acid bassline...",<br>
|
| 323 |
+
"ref_audio_sha256": "a4f1...d29c",<br>
|
| 324 |
+
"duration_s": 30, "bpm": 135, "key": "auto",<br>
|
| 325 |
+
"sampler": "heun", "steps": 50, "cfg": 5.0, "shift": 3,<br>
|
| 326 |
+
"audio_cover_strength": 0.93, "cover_noise_strength": 0.0,<br>
|
| 327 |
+
"lm": {"thinking": true, "temp": 0.85, "top_p": 0.9, "cfg": 2,<br>
|
| 328 |
+
"cot": {"metas": false, "caption": false, "lyrics": false}},<br>
|
| 329 |
+
"dcw": {"enabled": true, "mode": "double", "scaler": 0.02, "high_scaler": 0.06, "wavelet": "haar"},<br>
|
| 330 |
+
"loras": [<br>
|
| 331 |
+
{"name": "RapMachine", "scale": 0.85, "sha256": "b7e2..."},<br>
|
| 332 |
+
{"name": "psytrance_v2", "scale": 0.95, "sha256": "0c94..."}<br>
|
| 333 |
+
],<br>
|
| 334 |
+
"seed": 1297183202,<br>
|
| 335 |
+
"output_sha256": "f33a...19b8"<br>
|
| 336 |
+
}
|
| 337 |
+
</div>
|
| 338 |
+
</div>
|
| 339 |
+
</div>
|
| 340 |
+
</div>
|
| 341 |
+
</div>
|
| 342 |
+
|
| 343 |
+
<h3 style="margin-top:30px">β© Extend β fully expanded</h3>
|
| 344 |
+
|
| 345 |
+
<div class="gm">
|
| 346 |
+
<div class="gm-header">
|
| 347 |
+
<div>
|
| 348 |
+
<div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
|
| 349 |
+
<div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">β₯</span>. <strong>Drop a like</strong> Β· Follow <strong>@techfreakworm</strong> for what's next.</div>
|
| 350 |
+
</div>
|
| 351 |
+
<div class="gm-status">ready Β· MPS Β· M5 Max</div>
|
| 352 |
+
</div>
|
| 353 |
+
|
| 354 |
+
<div class="gm-row">
|
| 355 |
+
<div class="gm-sidebar">
|
| 356 |
+
<div class="gm-side"><span class="em">π΅</span>Generate</div>
|
| 357 |
+
<div class="gm-side"><span class="em">π€</span>Cover</div>
|
| 358 |
+
<div class="gm-side active"><span class="em">β©</span>Extend</div>
|
| 359 |
+
<div class="gm-side"><span class="em">βοΈ</span>Edit</div>
|
| 360 |
+
<div class="gm-side"><span class="em">βοΈ</span>Lyrics</div>
|
| 361 |
+
</div>
|
| 362 |
+
|
| 363 |
+
<div class="gm-main">
|
| 364 |
+
<div class="gm-form">
|
| 365 |
+
|
| 366 |
+
<div class="gm-label">1 Β· Seed audio <span class="hint">what to continue Β· wav / mp3 / flac Β· β€ 240 s</span></div>
|
| 367 |
+
<div class="gm-dropzone has-file">
|
| 368 |
+
<div class="filename">β unfinished_track_v3.wav</div>
|
| 369 |
+
<div class="meta">44.1 kHz Β· stereo Β· 1:42 Β· 18.0 MB Β· BPM detected 135 Β· key C minor</div>
|
| 370 |
+
<div class="miniwave"></div>
|
| 371 |
+
</div>
|
| 372 |
+
|
| 373 |
+
<div class="gm-label">2 Β· Extension prompt <span class="hint">style hint for what comes next</span></div>
|
| 374 |
+
<div class="gm-input">build to climax, layered acid leads, then breakdown</div>
|
| 375 |
+
|
| 376 |
+
<div class="gm-label">3 Β· Extension lyrics <span class="hint">optional Β· use [verse] [chorus] tags Β· blank = instrumental continuation</span></div>
|
| 377 |
+
<div class="gm-input gm-textarea">[bridge] the drop is coming...<br>[chorus] one more time, one more time</div>
|
| 378 |
+
|
| 379 |
+
<div class="gm-grid2">
|
| 380 |
+
<div>
|
| 381 |
+
<div class="gm-label">Extra duration <span class="hint">seconds</span></div>
|
| 382 |
+
<div class="gm-slider-row"><span class="name">5 β 120 s</span><span class="gm-slider p50"></span><span class="val">60</span></div>
|
| 383 |
+
</div>
|
| 384 |
+
<div>
|
| 385 |
+
<div class="gm-label">Vocal mode</div>
|
| 386 |
+
<div class="gm-pills"><div class="gm-pill on">With vocals</div><div class="gm-pill">Instrumental</div></div>
|
| 387 |
+
</div>
|
| 388 |
+
</div>
|
| 389 |
+
|
| 390 |
+
<div class="gm-label">Extend-specific <span class="hint">how the seam is handled</span></div>
|
| 391 |
+
<div class="gm-grid2">
|
| 392 |
+
<div><div class="gm-label">Repaint mode</div><div class="gm-select">balanced <span class="arrow">βΎ</span></div></div>
|
| 393 |
+
<div><div class="gm-label">Chunk mask</div><div class="gm-select">auto <span class="arrow">βΎ</span></div></div>
|
| 394 |
+
</div>
|
| 395 |
+
<div class="gm-slider-row"><span class="name">Repaint strength</span><span class="gm-slider p50"></span><span class="val">0.50</span></div>
|
| 396 |
+
<div class="gm-slider-row"><span class="name">Latent crossfade frames</span><span class="gm-slider p20"></span><span class="val">10</span></div>
|
| 397 |
+
<div class="gm-slider-row"><span class="name">WAV crossfade seconds</span><span class="gm-slider p10"></span><span class="val">2.0</span></div>
|
| 398 |
+
|
| 399 |
+
<!-- LoRA section, expanded -->
|
| 400 |
+
<div class="gm-section">
|
| 401 |
+
<div class="gm-section-h"><span>LoRA stack <span class="meta">Β· 1 active</span></span><span class="arrow">βΎ</span></div>
|
| 402 |
+
|
| 403 |
+
<div class="gm-label">Bundled presets</div>
|
| 404 |
+
<div style="margin-bottom:12px;">
|
| 405 |
+
<span class="gm-chip">RapMachine</span>
|
| 406 |
+
<span class="gm-chip">Chinese Rap</span>
|
| 407 |
+
<span class="gm-chip">Lyric2Vocal</span>
|
| 408 |
+
<span class="gm-chip">Text2Samples</span>
|
| 409 |
+
</div>
|
| 410 |
+
|
| 411 |
+
<div class="gm-label">Active stack</div>
|
| 412 |
+
<div class="gm-lora-row">
|
| 413 |
+
<span class="gm-lora-name">psytrance_v2 <small>Β· custom Β· 47 MB</small></span>
|
| 414 |
+
<span class="gm-slider p95" style="width:100px"></span>
|
| 415 |
+
<span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.95</span>
|
| 416 |
+
<span class="gm-x">Γ</span>
|
| 417 |
+
</div>
|
| 418 |
+
|
| 419 |
+
<div style="margin-top:10px;">
|
| 420 |
+
<span class="gm-chip upload">β drop .safetensors here</span>
|
| 421 |
+
</div>
|
| 422 |
+
</div>
|
| 423 |
+
|
| 424 |
+
<!-- Advanced section, expanded -->
|
| 425 |
+
<div class="gm-section">
|
| 426 |
+
<div class="gm-section-h"><span>Advanced</span><span class="arrow">βΎ</span></div>
|
| 427 |
+
|
| 428 |
+
<div class="gm-grid3">
|
| 429 |
+
<div><div class="gm-label">BPM <span class="hint">inherits from seed if blank</span></div><div class="gm-input" style="margin-bottom:0">135</div></div>
|
| 430 |
+
<div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">C minor</div></div>
|
| 431 |
+
<div><div class="gm-label">Time sig</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
|
| 432 |
+
</div>
|
| 433 |
+
|
| 434 |
+
<div class="gm-grid2">
|
| 435 |
+
<div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">βΎ</span></div></div>
|
| 436 |
+
<div><div class="gm-label">Vocal language</div><div class="gm-select">en <span class="arrow">βΎ</span></div></div>
|
| 437 |
+
</div>
|
| 438 |
+
|
| 439 |
+
<div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p20"></span><span class="val">50</span></div>
|
| 440 |
+
<div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
|
| 441 |
+
<div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p30"></span><span class="val">3</span></div>
|
| 442 |
+
|
| 443 |
+
<div class="gm-label" style="margin-top:8px">Negative prompt</div>
|
| 444 |
+
<div class="gm-input gm-textarea" style="font-size:10px">bitcrushed, aliasing, lo-fi hiss</div>
|
| 445 |
+
|
| 446 |
+
<div class="gm-grid2">
|
| 447 |
+
<div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
|
| 448 |
+
<div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">β</span> -14 LUFS</div></div>
|
| 449 |
+
</div>
|
| 450 |
+
|
| 451 |
+
<div class="gm-grid2">
|
| 452 |
+
<div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">9911</div></div>
|
| 453 |
+
<div><div class="gm-label"> </div><div class="gm-toggle"><span class="box"></span> Lock seed</div></div>
|
| 454 |
+
</div>
|
| 455 |
+
</div>
|
| 456 |
+
|
| 457 |
+
<!-- LM planner section, expanded -->
|
| 458 |
+
<div class="gm-section">
|
| 459 |
+
<div class="gm-section-h"><span>LM planner Β· Qwen3 thinking</span><span class="arrow">βΎ</span></div>
|
| 460 |
+
<div class="gm-toggle on"><span class="box">β</span> Thinking enabled</div>
|
| 461 |
+
<div class="gm-toggle on"><span class="box">β</span> Constrained decoding</div>
|
| 462 |
+
<div class="gm-grid4" style="margin-top:8px">
|
| 463 |
+
<div><div class="gm-label">Temp</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
|
| 464 |
+
<div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
|
| 465 |
+
<div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
|
| 466 |
+
<div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
|
| 467 |
+
</div>
|
| 468 |
+
<div class="gm-label">CoT pipeline toggles</div>
|
| 469 |
+
<div class="gm-grid4">
|
| 470 |
+
<div class="gm-toggle"><span class="box"></span> metas</div>
|
| 471 |
+
<div class="gm-toggle"><span class="box"></span> caption</div>
|
| 472 |
+
<div class="gm-toggle"><span class="box"></span> lyrics</div>
|
| 473 |
+
<div class="gm-toggle"><span class="box"></span> language</div>
|
| 474 |
+
</div>
|
| 475 |
+
</div>
|
| 476 |
+
|
| 477 |
+
<!-- DCW section, expanded -->
|
| 478 |
+
<div class="gm-section">
|
| 479 |
+
<div class="gm-section-h"><span>DCW Β· dynamic CFG warping</span><span class="arrow">βΎ</span></div>
|
| 480 |
+
<div class="gm-toggle on"><span class="box">β</span> DCW enabled</div>
|
| 481 |
+
<div class="gm-grid2">
|
| 482 |
+
<div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">βΎ</span></div></div>
|
| 483 |
+
<div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">βΎ</span></div></div>
|
| 484 |
+
</div>
|
| 485 |
+
<div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p10"></span><span class="val">0.02</span></div>
|
| 486 |
+
<div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
|
| 487 |
+
</div>
|
| 488 |
+
|
| 489 |
+
<div class="gm-btn">βΆ Extend Β· est. ~50 s Β· output 2:42 total</div>
|
| 490 |
+
</div>
|
| 491 |
+
|
| 492 |
+
<!-- Output panel -->
|
| 493 |
+
<div class="gm-output">
|
| 494 |
+
<div class="gm-label" style="margin-bottom:10px">Output Β· extended Β· 2:42 Β· seed 9911</div>
|
| 495 |
+
|
| 496 |
+
<div class="gm-toggle on"><span class="box">β</span> Show seed boundary marker</div>
|
| 497 |
+
|
| 498 |
+
<div class="gm-waveform" style="position:relative">
|
| 499 |
+
<div class="gm-bar" style="height:32%"></div><div class="gm-bar" style="height:48%"></div><div class="gm-bar" style="height:64%"></div><div class="gm-bar" style="height:42%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:38%"></div><div class="gm-bar" style="height:52%"></div><div class="gm-bar" style="height:46%"></div><div class="gm-bar" style="height:34%; opacity:0.5"></div>
|
| 500 |
+
<div style="border-left:1px dashed #FFF; height:48px;"></div>
|
| 501 |
+
<div class="gm-bar" style="height:62%"></div><div class="gm-bar" style="height:78%"></div><div class="gm-bar" style="height:92%"></div><div class="gm-bar" style="height:84%"></div><div class="gm-bar" style="height:70%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:40%"></div>
|
| 502 |
+
<div style="position:absolute; bottom:-2px; left:50%; transform:translateX(-50%); font-size:8px; color:#FFF; background:#0A0A0A; padding:0 4px;">β seed ends Β· 1:42</div>
|
| 503 |
+
</div>
|
| 504 |
+
|
| 505 |
+
<div class="gm-player-controls">
|
| 506 |
+
<span class="gm-play">βΆ</span>
|
| 507 |
+
<span>0:00 / 2:42</span>
|
| 508 |
+
<span style="margin-left:auto; cursor:pointer; color:#FFF">β» retake</span>
|
| 509 |
+
</div>
|
| 510 |
+
|
| 511 |
+
<div class="gm-label">Stems Β· Demucs</div>
|
| 512 |
+
<div class="gm-stems">
|
| 513 |
+
<div class="gm-stem"><span>vocals</span><span class="dl">β</span></div>
|
| 514 |
+
<div class="gm-stem"><span>drums</span><span class="dl">β</span></div>
|
| 515 |
+
<div class="gm-stem"><span>bass</span><span class="dl">β</span></div>
|
| 516 |
+
<div class="gm-stem"><span>other</span><span class="dl">β</span></div>
|
| 517 |
+
</div>
|
| 518 |
+
|
| 519 |
+
<div class="gm-label">Export</div>
|
| 520 |
+
<div class="gm-actions">
|
| 521 |
+
<span class="gm-secondary">β full mp3 Β· 6.3 MB</span>
|
| 522 |
+
<span class="gm-secondary">β extension-only mp3 Β· 2.4 MB</span>
|
| 523 |
+
<span class="gm-secondary">β full wav</span>
|
| 524 |
+
<span class="gm-secondary">β stems zip</span>
|
| 525 |
+
<span class="gm-secondary">{ } meta json</span>
|
| 526 |
+
<span class="gm-secondary">β share link</span>
|
| 527 |
+
</div>
|
| 528 |
+
|
| 529 |
+
<div class="gm-label" style="margin-top:14px">Metadata</div>
|
| 530 |
+
<div class="gm-meta-block">
|
| 531 |
+
{<br>
|
| 532 |
+
"mode": "extend",<br>
|
| 533 |
+
"seed_audio_sha256": "e5c0...21ed",<br>
|
| 534 |
+
"seed_duration_s": 102,<br>
|
| 535 |
+
"extension_prompt": "build to climax, layered acid leads...",<br>
|
| 536 |
+
"extension_lyrics_first_line": "[bridge] the drop is coming...",<br>
|
| 537 |
+
"extra_duration_s": 60,<br>
|
| 538 |
+
"repaint_mode": "balanced",<br>
|
| 539 |
+
"repaint_strength": 0.5,<br>
|
| 540 |
+
"latent_crossfade_frames": 10,<br>
|
| 541 |
+
"wav_crossfade_s": 2.0,<br>
|
| 542 |
+
"chunk_mask_mode": "auto",<br>
|
| 543 |
+
"bpm": 135, "key": "C minor",<br>
|
| 544 |
+
"sampler": "heun", "steps": 50, "cfg": 5.0, "shift": 3,<br>
|
| 545 |
+
"lm": {"thinking": true, "temp": 0.85, "top_p": 0.9},<br>
|
| 546 |
+
"dcw": {"enabled": true, "mode": "double", "scaler": 0.02},<br>
|
| 547 |
+
"loras": [{"name": "psytrance_v2", "scale": 0.95, "sha256": "0c94..."}],<br>
|
| 548 |
+
"seed": 9911,<br>
|
| 549 |
+
"output_sha256": "9fbc...4071"<br>
|
| 550 |
+
}
|
| 551 |
+
</div>
|
| 552 |
+
</div>
|
| 553 |
+
</div>
|
| 554 |
+
</div>
|
| 555 |
+
</div>
|
| 556 |
+
|
| 557 |
+
<div class="options" style="margin-top:24px">
|
| 558 |
+
<div class="option" data-choice="approve" onclick="toggleSelect(this)">
|
| 559 |
+
<div class="letter">β</div>
|
| 560 |
+
<div class="content">
|
| 561 |
+
<h3>Both look right β show Edit + Lyrics + Generate (refreshed) next</h3>
|
| 562 |
+
<p>Cover and Extend hierarchies + control depth are correct. Continue.</p>
|
| 563 |
+
</div>
|
| 564 |
+
</div>
|
| 565 |
+
<div class="option" data-choice="revise" onclick="toggleSelect(this)">
|
| 566 |
+
<div class="letter">β</div>
|
| 567 |
+
<div class="content">
|
| 568 |
+
<h3>Revise β tell me which control / section</h3>
|
| 569 |
+
<p>Reply in terminal with specifics. I'll redo a single section without re-pushing the whole thing.</p>
|
| 570 |
+
</div>
|
| 571 |
+
</div>
|
| 572 |
+
</div>
|
|
@@ -0,0 +1,517 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<h2>Edit and Lyrics Β· everything expanded</h2>
|
| 2 |
+
<p class="subtitle">Edit has two sub-modes β Repaint (segment regeneration) and Flow Morph (caption-to-caption transformation). Lyrics tab uses Qwen 2.5 7B Instruct to draft structurally-tagged lyrics for the song modes.</p>
|
| 3 |
+
|
| 4 |
+
<style>
|
| 5 |
+
.gm { background:#0A0A0A; color:#E5E5E5; border:1px solid #1F1F1F; border-radius:10px; padding:18px; font-size:12px; line-height:1.5; margin-top:14px; }
|
| 6 |
+
.gm-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:10px; border-bottom:1px solid #1F1F1F; margin-bottom:14px; }
|
| 7 |
+
.gm-brand { font-size:15px; font-weight:600; }
|
| 8 |
+
.gm-cta { font-size:11px; color:#6B6B6B; }
|
| 9 |
+
.gm-cta strong { color:#E5E5E5; }
|
| 10 |
+
.gm-status { font-size:10px; color:#6B6B6B; letter-spacing:0.08em; text-transform:uppercase; }
|
| 11 |
+
.gm-row { display:flex; gap:16px; align-items:flex-start; }
|
| 12 |
+
.gm-sidebar { background:#000; padding:14px 10px; border-radius:6px; min-width:170px; }
|
| 13 |
+
.gm-side { display:block; padding:8px 10px; border-radius:4px; margin-bottom:3px; font-size:12px; color:#6B6B6B; }
|
| 14 |
+
.gm-side.active { background:#1A1A1A; color:#FFF; border-left:2px solid #FFF; padding-left:8px; }
|
| 15 |
+
.gm-side .em { margin-right:6px; }
|
| 16 |
+
.gm-main { flex:1; display:flex; gap:14px; align-items:flex-start; }
|
| 17 |
+
.gm-form { flex:1.3; background:#141414; padding:16px; border-radius:6px; }
|
| 18 |
+
.gm-output { flex:1; background:#141414; padding:16px; border-radius:6px; min-width:260px; }
|
| 19 |
+
.gm-label { font-size:10px; text-transform:uppercase; letter-spacing:0.08em; color:#6B6B6B; margin-bottom:6px; display:flex; justify-content:space-between; align-items:center; }
|
| 20 |
+
.gm-label .hint { color:#5A5048; font-size:9px; text-transform:none; letter-spacing:normal; font-weight:400; }
|
| 21 |
+
.gm-input { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; margin-bottom:12px; font-size:11px; }
|
| 22 |
+
.gm-textarea { min-height:46px; }
|
| 23 |
+
.gm-grid2 { display:grid; grid-template-columns:1fr 1fr; gap:12px; margin-bottom:12px; }
|
| 24 |
+
.gm-grid3 { display:grid; grid-template-columns:1fr 1fr 1fr; gap:10px; margin-bottom:12px; }
|
| 25 |
+
.gm-grid4 { display:grid; grid-template-columns:1fr 1fr 1fr 1fr; gap:8px; margin-bottom:12px; }
|
| 26 |
+
.gm-slider-row { display:flex; align-items:center; gap:10px; padding:6px 8px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; }
|
| 27 |
+
.gm-slider-row .name { color:#6B6B6B; font-size:10px; min-width:130px; }
|
| 28 |
+
.gm-slider { flex:1; height:3px; background:#2A2A2A; border-radius:2px; position:relative; }
|
| 29 |
+
.gm-slider::after { content:""; position:absolute; top:-4px; width:10px; height:10px; background:#FFF; border-radius:50%; }
|
| 30 |
+
.gm-slider.p10::after { left:10%; }
|
| 31 |
+
.gm-slider.p20::after { left:20%; }
|
| 32 |
+
.gm-slider.p25::after { left:25%; }
|
| 33 |
+
.gm-slider.p33::after { left:33%; }
|
| 34 |
+
.gm-slider.p40::after { left:40%; }
|
| 35 |
+
.gm-slider.p50::after { left:50%; }
|
| 36 |
+
.gm-slider.p60::after { left:60%; }
|
| 37 |
+
.gm-slider.p70::after { left:70%; }
|
| 38 |
+
.gm-slider.p85::after { left:85%; }
|
| 39 |
+
.gm-slider.p95::after { left:95%; }
|
| 40 |
+
.gm-slider-row .val { color:#FFF; font-family:monospace; font-size:11px; min-width:42px; text-align:right; }
|
| 41 |
+
.gm-toggle { display:flex; align-items:center; gap:8px; padding:6px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; cursor:pointer; }
|
| 42 |
+
.gm-toggle .box { width:14px; height:14px; border:1px solid #2A2A2A; border-radius:3px; display:inline-flex; align-items:center; justify-content:center; font-size:9px; }
|
| 43 |
+
.gm-toggle.on { color:#FFF; border-color:#FFF; }
|
| 44 |
+
.gm-toggle.on .box { background:#FFF; color:#0A0A0A; border-color:#FFF; }
|
| 45 |
+
.gm-pills { display:flex; gap:0; background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:2px; margin-bottom:12px; }
|
| 46 |
+
.gm-pill { flex:1; text-align:center; padding:6px 10px; font-size:11px; color:#6B6B6B; border-radius:3px; cursor:pointer; }
|
| 47 |
+
.gm-pill.on { background:#FFF; color:#0A0A0A; }
|
| 48 |
+
.gm-select { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; font-size:11px; display:flex; justify-content:space-between; align-items:center; margin-bottom:8px; }
|
| 49 |
+
.gm-select .arrow { color:#6B6B6B; }
|
| 50 |
+
.gm-section { border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-top:14px; background:#0F0F0F; }
|
| 51 |
+
.gm-section.dim { opacity:0.4; }
|
| 52 |
+
.gm-section-h { display:flex; justify-content:space-between; align-items:center; margin-bottom:12px; font-size:11px; font-weight:600; }
|
| 53 |
+
.gm-section-h .arrow { color:#FFF; }
|
| 54 |
+
.gm-section-h .meta { color:#6B6B6B; font-weight:400; font-size:10px; }
|
| 55 |
+
.gm-chip { display:inline-block; padding:5px 10px; border-radius:14px; font-size:10px; margin-right:5px; margin-bottom:5px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; cursor:pointer; }
|
| 56 |
+
.gm-chip.on { border-color:#FFF; color:#FFF; }
|
| 57 |
+
.gm-chip.upload { border-style:dashed; color:#FFF; }
|
| 58 |
+
.gm-lora-row { display:flex; align-items:center; gap:10px; padding:8px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:6px; font-size:11px; }
|
| 59 |
+
.gm-lora-name { flex:1; }
|
| 60 |
+
.gm-lora-name small { color:#6B6B6B; font-weight:400; margin-left:4px; }
|
| 61 |
+
.gm-x { color:#6B6B6B; cursor:pointer; padding:0 4px; }
|
| 62 |
+
.gm-btn { background:#FFF; color:#0A0A0A; padding:12px 18px; border-radius:4px; font-weight:600; display:block; font-size:13px; text-align:center; cursor:pointer; margin-top:16px; }
|
| 63 |
+
.gm-dropzone { background:#000; border:2px dashed #2A2A2A; border-radius:6px; padding:14px; margin-bottom:12px; text-align:center; font-size:11px; color:#6B6B6B; }
|
| 64 |
+
.gm-dropzone.has-file { border-style:solid; border-color:#FFF; color:#FFF; text-align:left; padding:10px 12px; }
|
| 65 |
+
.gm-dropzone .filename { font-weight:600; }
|
| 66 |
+
.gm-dropzone .meta { color:#6B6B6B; font-size:9px; margin-top:2px; font-weight:400; }
|
| 67 |
+
.gm-dropzone .miniwave { height:18px; background:repeating-linear-gradient(90deg, currentColor 0 1px, transparent 1px 3px); margin-top:6px; opacity:0.5; }
|
| 68 |
+
.gm-waveform { height:60px; background:#000; border:1px solid #2A2A2A; border-radius:4px; display:flex; align-items:center; justify-content:center; gap:2px; padding:8px; margin-bottom:10px; position:relative; }
|
| 69 |
+
.gm-bar { width:2px; background:#E5E5E5; }
|
| 70 |
+
.gm-bar.muted { opacity:0.35; }
|
| 71 |
+
.gm-bar.highlight { background:#FFF; }
|
| 72 |
+
.gm-player-controls { display:flex; align-items:center; gap:10px; color:#6B6B6B; font-size:10px; margin-bottom:14px; }
|
| 73 |
+
.gm-play { width:28px; height:28px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:11px; }
|
| 74 |
+
.gm-meta-block { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:8px 10px; font-size:9px; color:#6B6B6B; font-family:monospace; line-height:1.6; max-height:160px; overflow:hidden; margin-top:8px; }
|
| 75 |
+
.gm-actions { display:flex; flex-wrap:wrap; gap:6px; margin-bottom:10px; }
|
| 76 |
+
.gm-secondary { border:1px solid #2A2A2A; color:#E5E5E5; padding:6px 12px; border-radius:4px; font-size:10px; cursor:pointer; }
|
| 77 |
+
.gm-segment-bar { position:relative; height:18px; background:#0F0F0F; border:1px solid #2A2A2A; border-radius:3px; margin:8px 0 12px; }
|
| 78 |
+
.gm-segment-bar .sel { position:absolute; top:0; bottom:0; background:#FFF; opacity:0.85; }
|
| 79 |
+
.gm-segment-bar .ticks { position:absolute; top:0; left:0; right:0; bottom:0; display:flex; justify-content:space-between; padding:0 2px; align-items:center; font-size:8px; color:#6B6B6B; font-family:monospace; pointer-events:none; }
|
| 80 |
+
.gm-segment-bar .label { position:absolute; top:-14px; font-size:8px; color:#FFF; font-family:monospace; }
|
| 81 |
+
|
| 82 |
+
/* Lyrics-specific */
|
| 83 |
+
.gm-lyrics-output { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-bottom:10px; font-family:Inter, system-ui, sans-serif; font-size:11px; line-height:1.7; color:#E5E5E5; min-height:240px; }
|
| 84 |
+
.gm-lyrics-output .section-tag { color:#FFF; font-weight:600; display:block; margin-top:10px; }
|
| 85 |
+
.gm-lyrics-output .section-tag:first-child { margin-top:0; }
|
| 86 |
+
.gm-lyrics-output .body { color:#B8B0A4; margin-left:0; }
|
| 87 |
+
</style>
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
<h3 style="margin-top:14px">βοΈ Edit β fully expanded Β· Repaint sub-mode active</h3>
|
| 91 |
+
|
| 92 |
+
<div class="gm">
|
| 93 |
+
<div class="gm-header">
|
| 94 |
+
<div>
|
| 95 |
+
<div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
|
| 96 |
+
<div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">β₯</span>. <strong>Drop a like</strong> Β· Follow <strong>@techfreakworm</strong></div>
|
| 97 |
+
</div>
|
| 98 |
+
<div class="gm-status">ready Β· MPS Β· M5 Max</div>
|
| 99 |
+
</div>
|
| 100 |
+
|
| 101 |
+
<div class="gm-row">
|
| 102 |
+
<div class="gm-sidebar">
|
| 103 |
+
<div class="gm-side"><span class="em">π΅</span>Generate</div>
|
| 104 |
+
<div class="gm-side"><span class="em">π€</span>Cover</div>
|
| 105 |
+
<div class="gm-side"><span class="em">β©</span>Extend</div>
|
| 106 |
+
<div class="gm-side active"><span class="em">βοΈ</span>Edit</div>
|
| 107 |
+
<div class="gm-side"><span class="em">βοΈ</span>Lyrics</div>
|
| 108 |
+
</div>
|
| 109 |
+
|
| 110 |
+
<div class="gm-main">
|
| 111 |
+
<div class="gm-form">
|
| 112 |
+
|
| 113 |
+
<div class="gm-label">1 Β· Source audio <span class="hint">the song you want to modify Β· β€ 240 s</span></div>
|
| 114 |
+
<div class="gm-dropzone has-file">
|
| 115 |
+
<div class="filename">β my_song_draft.wav</div>
|
| 116 |
+
<div class="meta">44.1 kHz Β· stereo Β· 2:30 Β· 26.4 MB Β· BPM 138 Β· key A minor</div>
|
| 117 |
+
<div class="miniwave"></div>
|
| 118 |
+
</div>
|
| 119 |
+
|
| 120 |
+
<div class="gm-label">2 Β· Edit sub-mode</div>
|
| 121 |
+
<div class="gm-pills">
|
| 122 |
+
<div class="gm-pill on">Repaint segment</div>
|
| 123 |
+
<div class="gm-pill">Flow morph</div>
|
| 124 |
+
</div>
|
| 125 |
+
|
| 126 |
+
<div class="gm-label">3 Β· Source lyrics <span class="hint">paste the existing lyrics for context</span></div>
|
| 127 |
+
<div class="gm-input gm-textarea">[verse 1] original lyric line one<br>[chorus] original chorus<br>[verse 2] original lyric line two<br>[bridge] ...</div>
|
| 128 |
+
|
| 129 |
+
<div class="gm-label">4 Β· Target lyrics <span class="hint">replace only the segment selected below</span></div>
|
| 130 |
+
<div class="gm-input gm-textarea">[chorus] new chorus replaces the old<br>more punchy, more melodic</div>
|
| 131 |
+
|
| 132 |
+
<div class="gm-label">5 Β· Segment selection <span class="hint">drag handles on the waveform Β· or set timestamps</span></div>
|
| 133 |
+
<div class="gm-segment-bar">
|
| 134 |
+
<div class="sel" style="left:33%; width:25%;"></div>
|
| 135 |
+
<div class="ticks"><span>0:00</span><span>0:30</span><span>1:00</span><span>1:30</span><span>2:00</span><span>2:30</span></div>
|
| 136 |
+
<div class="label" style="left:33%">0:50</div>
|
| 137 |
+
<div class="label" style="left:58%">1:30</div>
|
| 138 |
+
</div>
|
| 139 |
+
<div class="gm-grid2">
|
| 140 |
+
<div><div class="gm-label">Segment start</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">50.0 s</div></div>
|
| 141 |
+
<div><div class="gm-label">Segment end</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">90.0 s</div></div>
|
| 142 |
+
</div>
|
| 143 |
+
|
| 144 |
+
<!-- Repaint sub-options -->
|
| 145 |
+
<div class="gm-section">
|
| 146 |
+
<div class="gm-section-h">
|
| 147 |
+
<span>Repaint options <span class="meta">Β· segment regeneration</span></span>
|
| 148 |
+
<span class="arrow">βΎ</span>
|
| 149 |
+
</div>
|
| 150 |
+
|
| 151 |
+
<div class="gm-grid2">
|
| 152 |
+
<div><div class="gm-label">Repaint mode</div><div class="gm-select">balanced <span class="arrow">βΎ</span></div></div>
|
| 153 |
+
<div><div class="gm-label">Chunk mask</div><div class="gm-select">auto <span class="arrow">βΎ</span></div></div>
|
| 154 |
+
</div>
|
| 155 |
+
<div class="gm-slider-row"><span class="name">Repaint strength</span><span class="gm-slider p50"></span><span class="val">0.50</span></div>
|
| 156 |
+
<div class="gm-slider-row"><span class="name">Latent crossfade frames</span><span class="gm-slider p20"></span><span class="val">10</span></div>
|
| 157 |
+
<div class="gm-slider-row"><span class="name">WAV crossfade seconds</span><span class="gm-slider p10"></span><span class="val">0.0</span></div>
|
| 158 |
+
<div class="gm-toggle on"><span class="box">β</span> Preserve segment boundary phase</div>
|
| 159 |
+
</div>
|
| 160 |
+
|
| 161 |
+
<!-- Flow Morph sub-options, dimmed since Repaint is active -->
|
| 162 |
+
<div class="gm-section dim">
|
| 163 |
+
<div class="gm-section-h">
|
| 164 |
+
<span>Flow morph options <span class="meta">Β· caption-to-caption transformation Β· select "Flow morph" above to use</span></span>
|
| 165 |
+
<span class="arrow">βΎ</span>
|
| 166 |
+
</div>
|
| 167 |
+
|
| 168 |
+
<div class="gm-label">Source caption <span class="hint">describes what the segment currently is</span></div>
|
| 169 |
+
<div class="gm-input">acoustic ballad, gentle piano</div>
|
| 170 |
+
|
| 171 |
+
<div class="gm-label">Target caption <span class="hint">what to morph it into Β· prompt above is reused</span></div>
|
| 172 |
+
<div class="gm-input" style="opacity:0.5">(uses style prompt from step 2)</div>
|
| 173 |
+
|
| 174 |
+
<div class="gm-grid3">
|
| 175 |
+
<div><div class="gm-label">n_min</div><div class="gm-input" style="margin-bottom:0">0.0</div></div>
|
| 176 |
+
<div><div class="gm-label">n_max</div><div class="gm-input" style="margin-bottom:0">1.0</div></div>
|
| 177 |
+
<div><div class="gm-label">n_avg</div><div class="gm-input" style="margin-bottom:0">1</div></div>
|
| 178 |
+
</div>
|
| 179 |
+
<div class="gm-toggle"><span class="box"></span> Enable flow_edit_morph</div>
|
| 180 |
+
</div>
|
| 181 |
+
|
| 182 |
+
<!-- LoRA section, expanded -->
|
| 183 |
+
<div class="gm-section">
|
| 184 |
+
<div class="gm-section-h"><span>LoRA stack <span class="meta">Β· 1 active</span></span><span class="arrow">βΎ</span></div>
|
| 185 |
+
<div class="gm-label">Bundled presets</div>
|
| 186 |
+
<div style="margin-bottom:12px;">
|
| 187 |
+
<span class="gm-chip">RapMachine</span>
|
| 188 |
+
<span class="gm-chip">Chinese Rap</span>
|
| 189 |
+
<span class="gm-chip on">Lyric2Vocal</span>
|
| 190 |
+
<span class="gm-chip">Text2Samples</span>
|
| 191 |
+
</div>
|
| 192 |
+
<div class="gm-label">Active stack</div>
|
| 193 |
+
<div class="gm-lora-row">
|
| 194 |
+
<span class="gm-lora-name">Lyric2Vocal <small>Β· preset</small></span>
|
| 195 |
+
<span class="gm-slider p70" style="width:100px"></span>
|
| 196 |
+
<span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.70</span>
|
| 197 |
+
<span class="gm-x">Γ</span>
|
| 198 |
+
</div>
|
| 199 |
+
<div style="margin-top:10px;">
|
| 200 |
+
<span class="gm-chip upload">β drop .safetensors here</span>
|
| 201 |
+
</div>
|
| 202 |
+
</div>
|
| 203 |
+
|
| 204 |
+
<!-- Advanced section, expanded -->
|
| 205 |
+
<div class="gm-section">
|
| 206 |
+
<div class="gm-section-h"><span>Advanced</span><span class="arrow">βΎ</span></div>
|
| 207 |
+
<div class="gm-grid3">
|
| 208 |
+
<div><div class="gm-label">BPM <span class="hint">inherits from source</span></div><div class="gm-input" style="margin-bottom:0">138</div></div>
|
| 209 |
+
<div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">A minor</div></div>
|
| 210 |
+
<div><div class="gm-label">Time sig</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
|
| 211 |
+
</div>
|
| 212 |
+
<div class="gm-grid2">
|
| 213 |
+
<div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">βΎ</span></div></div>
|
| 214 |
+
<div><div class="gm-label">Vocal language</div><div class="gm-select">en <span class="arrow">βΎ</span></div></div>
|
| 215 |
+
</div>
|
| 216 |
+
<div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p20"></span><span class="val">50</span></div>
|
| 217 |
+
<div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
|
| 218 |
+
<div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p33"></span><span class="val">3</span></div>
|
| 219 |
+
<div class="gm-label" style="margin-top:8px">Negative prompt</div>
|
| 220 |
+
<div class="gm-input" style="font-size:10px; margin-bottom:8px">bitcrushed, aliasing, off-key</div>
|
| 221 |
+
<div class="gm-grid2">
|
| 222 |
+
<div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
|
| 223 |
+
<div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">β</span> -14 LUFS</div></div>
|
| 224 |
+
</div>
|
| 225 |
+
<div class="gm-grid2">
|
| 226 |
+
<div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">7331</div></div>
|
| 227 |
+
<div><div class="gm-label"> </div><div class="gm-toggle"><span class="box"></span> Lock seed</div></div>
|
| 228 |
+
</div>
|
| 229 |
+
</div>
|
| 230 |
+
|
| 231 |
+
<!-- LM planner -->
|
| 232 |
+
<div class="gm-section">
|
| 233 |
+
<div class="gm-section-h"><span>LM planner Β· Qwen3 thinking</span><span class="arrow">βΎ</span></div>
|
| 234 |
+
<div class="gm-toggle on"><span class="box">β</span> Thinking enabled</div>
|
| 235 |
+
<div class="gm-toggle on"><span class="box">β</span> Constrained decoding</div>
|
| 236 |
+
<div class="gm-grid4" style="margin-top:8px">
|
| 237 |
+
<div><div class="gm-label">Temp</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
|
| 238 |
+
<div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
|
| 239 |
+
<div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
|
| 240 |
+
<div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
|
| 241 |
+
</div>
|
| 242 |
+
<div class="gm-label">CoT toggles</div>
|
| 243 |
+
<div class="gm-grid4">
|
| 244 |
+
<div class="gm-toggle"><span class="box"></span> metas</div>
|
| 245 |
+
<div class="gm-toggle"><span class="box"></span> caption</div>
|
| 246 |
+
<div class="gm-toggle on"><span class="box">β</span> lyrics</div>
|
| 247 |
+
<div class="gm-toggle"><span class="box"></span> language</div>
|
| 248 |
+
</div>
|
| 249 |
+
</div>
|
| 250 |
+
|
| 251 |
+
<!-- DCW -->
|
| 252 |
+
<div class="gm-section">
|
| 253 |
+
<div class="gm-section-h"><span>DCW Β· dynamic CFG warping</span><span class="arrow">βΎ</span></div>
|
| 254 |
+
<div class="gm-toggle on"><span class="box">β</span> DCW enabled</div>
|
| 255 |
+
<div class="gm-grid2">
|
| 256 |
+
<div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">βΎ</span></div></div>
|
| 257 |
+
<div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">βΎ</span></div></div>
|
| 258 |
+
</div>
|
| 259 |
+
<div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p10"></span><span class="val">0.02</span></div>
|
| 260 |
+
<div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
|
| 261 |
+
</div>
|
| 262 |
+
|
| 263 |
+
<div class="gm-btn">βΆ Repaint segment 0:50 β 1:30 Β· est. ~25 s on M5 Max</div>
|
| 264 |
+
</div>
|
| 265 |
+
|
| 266 |
+
<!-- Output -->
|
| 267 |
+
<div class="gm-output">
|
| 268 |
+
<div class="gm-label" style="margin-bottom:10px">Output Β· edited Β· 2:30 Β· seed 7331 Β· segment 0:50 β 1:30</div>
|
| 269 |
+
|
| 270 |
+
<div class="gm-toggle on"><span class="box">β</span> Show edited region (highlighted on waveform)</div>
|
| 271 |
+
|
| 272 |
+
<div class="gm-waveform">
|
| 273 |
+
<div class="gm-bar muted" style="height:32%"></div>
|
| 274 |
+
<div class="gm-bar muted" style="height:48%"></div>
|
| 275 |
+
<div class="gm-bar muted" style="height:60%"></div>
|
| 276 |
+
<div class="gm-bar muted" style="height:42%"></div>
|
| 277 |
+
<div class="gm-bar muted" style="height:54%"></div>
|
| 278 |
+
<div class="gm-bar highlight" style="height:78%"></div>
|
| 279 |
+
<div class="gm-bar highlight" style="height:92%"></div>
|
| 280 |
+
<div class="gm-bar highlight" style="height:84%"></div>
|
| 281 |
+
<div class="gm-bar highlight" style="height:70%"></div>
|
| 282 |
+
<div class="gm-bar highlight" style="height:88%"></div>
|
| 283 |
+
<div class="gm-bar highlight" style="height:62%"></div>
|
| 284 |
+
<div class="gm-bar muted" style="height:48%"></div>
|
| 285 |
+
<div class="gm-bar muted" style="height:36%"></div>
|
| 286 |
+
<div class="gm-bar muted" style="height:42%"></div>
|
| 287 |
+
<div class="gm-bar muted" style="height:30%"></div>
|
| 288 |
+
<div class="gm-bar muted" style="height:38%"></div>
|
| 289 |
+
</div>
|
| 290 |
+
|
| 291 |
+
<div class="gm-player-controls">
|
| 292 |
+
<span class="gm-play">βΆ</span>
|
| 293 |
+
<span>0:00 / 2:30</span>
|
| 294 |
+
<span style="margin-left:auto; cursor:pointer; color:#FFF">β» retake segment</span>
|
| 295 |
+
</div>
|
| 296 |
+
|
| 297 |
+
<div class="gm-label">A / B comparison</div>
|
| 298 |
+
<div class="gm-grid2">
|
| 299 |
+
<div class="gm-secondary" style="text-align:center">βΆ original</div>
|
| 300 |
+
<div class="gm-secondary" style="text-align:center; border-color:#FFF; color:#FFF">βΆ edited</div>
|
| 301 |
+
</div>
|
| 302 |
+
|
| 303 |
+
<div class="gm-label" style="margin-top:10px">Stems Β· Demucs</div>
|
| 304 |
+
<div style="display:grid; grid-template-columns:1fr 1fr; gap:6px; margin-bottom:10px;">
|
| 305 |
+
<div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>vocals</span><span style="color:#FFF">β</span></div>
|
| 306 |
+
<div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>drums</span><span style="color:#FFF">β</span></div>
|
| 307 |
+
<div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>bass</span><span style="color:#FFF">β</span></div>
|
| 308 |
+
<div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>other</span><span style="color:#FFF">β</span></div>
|
| 309 |
+
</div>
|
| 310 |
+
|
| 311 |
+
<div class="gm-label">Export</div>
|
| 312 |
+
<div class="gm-actions">
|
| 313 |
+
<span class="gm-secondary">β full mp3</span>
|
| 314 |
+
<span class="gm-secondary">β segment-only mp3</span>
|
| 315 |
+
<span class="gm-secondary">β wav</span>
|
| 316 |
+
<span class="gm-secondary">β stems zip</span>
|
| 317 |
+
<span class="gm-secondary">{ } meta</span>
|
| 318 |
+
</div>
|
| 319 |
+
|
| 320 |
+
<div class="gm-label" style="margin-top:14px">Metadata</div>
|
| 321 |
+
<div class="gm-meta-block">
|
| 322 |
+
{<br>
|
| 323 |
+
"mode": "edit", "sub_mode": "repaint",<br>
|
| 324 |
+
"source_audio_sha256": "1a4f...8e7d",<br>
|
| 325 |
+
"segment_start_s": 50.0, "segment_end_s": 90.0,<br>
|
| 326 |
+
"repaint_mode": "balanced", "repaint_strength": 0.5,<br>
|
| 327 |
+
"latent_crossfade_frames": 10, "wav_crossfade_s": 0.0,<br>
|
| 328 |
+
"chunk_mask_mode": "auto",<br>
|
| 329 |
+
"source_lyrics_hash": "3c2e...44ab",<br>
|
| 330 |
+
"target_lyrics_first_line": "[chorus] new chorus replaces the old...",<br>
|
| 331 |
+
"bpm": 138, "key": "A minor", "sampler": "heun", "steps": 50,<br>
|
| 332 |
+
"loras": [{"name": "Lyric2Vocal", "scale": 0.7}],<br>
|
| 333 |
+
"seed": 7331,<br>
|
| 334 |
+
"output_sha256": "b7a2...c019"<br>
|
| 335 |
+
}
|
| 336 |
+
</div>
|
| 337 |
+
</div>
|
| 338 |
+
</div>
|
| 339 |
+
</div>
|
| 340 |
+
</div>
|
| 341 |
+
|
| 342 |
+
|
| 343 |
+
<h3 style="margin-top:30px">βοΈ Lyrics β fully expanded Β· Qwen 2.5 7B Instruct</h3>
|
| 344 |
+
|
| 345 |
+
<div class="gm">
|
| 346 |
+
<div class="gm-header">
|
| 347 |
+
<div>
|
| 348 |
+
<div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
|
| 349 |
+
<div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">β₯</span>. <strong>Drop a like</strong> Β· Follow <strong>@techfreakworm</strong></div>
|
| 350 |
+
</div>
|
| 351 |
+
<div class="gm-status">ready Β· MPS Β· M5 Max Β· Qwen 2.5 7B</div>
|
| 352 |
+
</div>
|
| 353 |
+
|
| 354 |
+
<div class="gm-row">
|
| 355 |
+
<div class="gm-sidebar">
|
| 356 |
+
<div class="gm-side"><span class="em">π΅</span>Generate</div>
|
| 357 |
+
<div class="gm-side"><span class="em">π€</span>Cover</div>
|
| 358 |
+
<div class="gm-side"><span class="em">β©</span>Extend</div>
|
| 359 |
+
<div class="gm-side"><span class="em">βοΈ</span>Edit</div>
|
| 360 |
+
<div class="gm-side active"><span class="em">βοΈ</span>Lyrics</div>
|
| 361 |
+
</div>
|
| 362 |
+
|
| 363 |
+
<div class="gm-main">
|
| 364 |
+
<div class="gm-form">
|
| 365 |
+
|
| 366 |
+
<div class="gm-label">1 Β· Brief <span class="hint">describe the song in plain language</span></div>
|
| 367 |
+
<div class="gm-input gm-textarea" style="min-height:80px">A driving psytrance anthem about losing yourself on the dancefloor at sunrise. First-person, present tense, references to lights, kick drum, transcendence. Avoid clichΓ©s like "feel the beat".</div>
|
| 368 |
+
|
| 369 |
+
<div class="gm-grid2">
|
| 370 |
+
<div>
|
| 371 |
+
<div class="gm-label">Target structure <span class="hint">section sequence</span></div>
|
| 372 |
+
<div class="gm-input" style="margin-bottom:0">intro, verse, chorus, verse, chorus, bridge, chorus, outro</div>
|
| 373 |
+
</div>
|
| 374 |
+
<div>
|
| 375 |
+
<div class="gm-label">Language</div>
|
| 376 |
+
<div class="gm-select" style="margin-bottom:0">English (en) <span class="arrow">βΎ</span></div>
|
| 377 |
+
</div>
|
| 378 |
+
</div>
|
| 379 |
+
|
| 380 |
+
<div class="gm-grid3">
|
| 381 |
+
<div>
|
| 382 |
+
<div class="gm-label">Verse lines</div>
|
| 383 |
+
<div class="gm-input" style="margin-bottom:0">6</div>
|
| 384 |
+
</div>
|
| 385 |
+
<div>
|
| 386 |
+
<div class="gm-label">Chorus lines</div>
|
| 387 |
+
<div class="gm-input" style="margin-bottom:0">4</div>
|
| 388 |
+
</div>
|
| 389 |
+
<div>
|
| 390 |
+
<div class="gm-label">Bridge lines</div>
|
| 391 |
+
<div class="gm-input" style="margin-bottom:0">2</div>
|
| 392 |
+
</div>
|
| 393 |
+
</div>
|
| 394 |
+
|
| 395 |
+
<div class="gm-label">Tone / mood <span class="hint">optional Β· comma-separated descriptors</span></div>
|
| 396 |
+
<div class="gm-input">euphoric, hypnotic, transcendent, not cheesy</div>
|
| 397 |
+
|
| 398 |
+
<div class="gm-label">Rhyme preference</div>
|
| 399 |
+
<div class="gm-pills">
|
| 400 |
+
<div class="gm-pill">Strict (AABB)</div>
|
| 401 |
+
<div class="gm-pill on">Loose (ABAB / free)</div>
|
| 402 |
+
<div class="gm-pill">None</div>
|
| 403 |
+
</div>
|
| 404 |
+
|
| 405 |
+
<!-- LM parameters, expanded -->
|
| 406 |
+
<div class="gm-section">
|
| 407 |
+
<div class="gm-section-h">
|
| 408 |
+
<span>LM parameters <span class="meta">Β· Qwen 2.5 7B Instruct (Apache 2.0)</span></span>
|
| 409 |
+
<span class="arrow">βΎ</span>
|
| 410 |
+
</div>
|
| 411 |
+
|
| 412 |
+
<div class="gm-grid4">
|
| 413 |
+
<div><div class="gm-label">Temperature</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
|
| 414 |
+
<div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
|
| 415 |
+
<div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">40</div></div>
|
| 416 |
+
<div><div class="gm-label">Rep. penalty</div><div class="gm-input" style="margin-bottom:0">1.10</div></div>
|
| 417 |
+
</div>
|
| 418 |
+
|
| 419 |
+
<div class="gm-grid2">
|
| 420 |
+
<div><div class="gm-label">Max new tokens</div><div class="gm-input" style="margin-bottom:0">600</div></div>
|
| 421 |
+
<div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">42</div></div>
|
| 422 |
+
</div>
|
| 423 |
+
|
| 424 |
+
<div class="gm-toggle"><span class="box"></span> Show system prompt</div>
|
| 425 |
+
<div class="gm-toggle on"><span class="box">β</span> Enforce structural-tag format <span style="color:#6B6B6B; font-size:9px; margin-left:auto">stop at [end]</span></div>
|
| 426 |
+
</div>
|
| 427 |
+
|
| 428 |
+
<div class="gm-btn">βΆ Draft lyrics Β· est. ~8 s on M5 Max</div>
|
| 429 |
+
</div>
|
| 430 |
+
|
| 431 |
+
<!-- Lyrics output -->
|
| 432 |
+
<div class="gm-output">
|
| 433 |
+
<div class="gm-label" style="margin-bottom:10px">Draft Β· 1 of 1 Β· 312 tokens Β· 6.2 s</div>
|
| 434 |
+
|
| 435 |
+
<div class="gm-lyrics-output">
|
| 436 |
+
<span class="section-tag">[intro]</span>
|
| 437 |
+
<span class="body">the lights start low, the bass starts slow<br>another night, another holy show</span>
|
| 438 |
+
|
| 439 |
+
<span class="section-tag">[verse 1]</span>
|
| 440 |
+
<span class="body">six in the morning, the sun's still pretending<br>kick drum carries what the night was sending<br>shoes off, eyes closed, the city's still bending<br>no one's here for an ending<br>just one more lift, one more descending<br>the room is breathing, the floor is mending</span>
|
| 441 |
+
|
| 442 |
+
<span class="section-tag">[chorus]</span>
|
| 443 |
+
<span class="body">we let go, we let go, we let go<br>oh the morning, oh the morning<br>arms up, head down, no warning<br>we let go, we let go, we let go</span>
|
| 444 |
+
|
| 445 |
+
<span class="section-tag">[verse 2]</span>
|
| 446 |
+
<span class="body">...</span>
|
| 447 |
+
|
| 448 |
+
<span class="section-tag">[bridge]</span>
|
| 449 |
+
<span class="body">...</span>
|
| 450 |
+
|
| 451 |
+
<span class="section-tag">[outro]</span>
|
| 452 |
+
<span class="body">...</span>
|
| 453 |
+
</div>
|
| 454 |
+
|
| 455 |
+
<div class="gm-actions" style="margin-bottom:14px">
|
| 456 |
+
<span class="gm-secondary" style="border-color:#FFF; color:#FFF">β Use these in Generate</span>
|
| 457 |
+
<span class="gm-secondary">β» regenerate</span>
|
| 458 |
+
<span class="gm-secondary">β» continue from cursor</span>
|
| 459 |
+
<span class="gm-secondary">β edit inline</span>
|
| 460 |
+
<span class="gm-secondary">β .txt</span>
|
| 461 |
+
</div>
|
| 462 |
+
|
| 463 |
+
<div class="gm-label">Quick refinements <span class="hint">click to apply to next regeneration</span></div>
|
| 464 |
+
<div style="margin-bottom:14px;">
|
| 465 |
+
<span class="gm-chip">more cryptic</span>
|
| 466 |
+
<span class="gm-chip">less rhyme</span>
|
| 467 |
+
<span class="gm-chip">more concrete imagery</span>
|
| 468 |
+
<span class="gm-chip">shorter lines</span>
|
| 469 |
+
<span class="gm-chip">add chorus hook</span>
|
| 470 |
+
</div>
|
| 471 |
+
|
| 472 |
+
<div class="gm-label">Variants</div>
|
| 473 |
+
<div class="gm-grid2">
|
| 474 |
+
<div class="gm-secondary" style="text-align:center; border-color:#FFF; color:#FFF">v1 Β· current</div>
|
| 475 |
+
<div class="gm-secondary" style="text-align:center">+ generate v2</div>
|
| 476 |
+
</div>
|
| 477 |
+
|
| 478 |
+
<div class="gm-label" style="margin-top:14px">Metadata</div>
|
| 479 |
+
<div class="gm-meta-block">
|
| 480 |
+
{<br>
|
| 481 |
+
"mode": "lyrics",<br>
|
| 482 |
+
"model": "Qwen2.5-7B-Instruct",<br>
|
| 483 |
+
"brief_first_line": "A driving psytrance anthem about losing yourself...",<br>
|
| 484 |
+
"structure": ["intro", "verse", "chorus", "verse", "chorus", "bridge", "chorus", "outro"],<br>
|
| 485 |
+
"language": "en",<br>
|
| 486 |
+
"tone": "euphoric, hypnotic, transcendent, not cheesy",<br>
|
| 487 |
+
"verse_lines": 6, "chorus_lines": 4, "bridge_lines": 2,<br>
|
| 488 |
+
"rhyme_preference": "loose",<br>
|
| 489 |
+
"temperature": 0.85, "top_p": 0.9, "top_k": 40,<br>
|
| 490 |
+
"repetition_penalty": 1.1, "max_new_tokens": 600,<br>
|
| 491 |
+
"seed": 42,<br>
|
| 492 |
+
"tokens_generated": 312, "wall_seconds": 6.2,<br>
|
| 493 |
+
"output_sha256": "f1a3...88e2"<br>
|
| 494 |
+
}
|
| 495 |
+
</div>
|
| 496 |
+
</div>
|
| 497 |
+
</div>
|
| 498 |
+
</div>
|
| 499 |
+
</div>
|
| 500 |
+
|
| 501 |
+
|
| 502 |
+
<div class="options" style="margin-top:24px">
|
| 503 |
+
<div class="option" data-choice="approve" onclick="toggleSelect(this)">
|
| 504 |
+
<div class="letter">β</div>
|
| 505 |
+
<div class="content">
|
| 506 |
+
<h3>Both look right β refresh Generate next, then mobile + error states</h3>
|
| 507 |
+
<p>Edit (with both sub-modes visible) and Lyrics (with LM params + quick-refinement chips) work. Continue.</p>
|
| 508 |
+
</div>
|
| 509 |
+
</div>
|
| 510 |
+
<div class="option" data-choice="revise" onclick="toggleSelect(this)">
|
| 511 |
+
<div class="letter">β</div>
|
| 512 |
+
<div class="content">
|
| 513 |
+
<h3>Revise β tell me which control or section</h3>
|
| 514 |
+
<p>Reply in terminal with specifics.</p>
|
| 515 |
+
</div>
|
| 516 |
+
</div>
|
| 517 |
+
</div>
|
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ACE Music Studio β UI mockups
|
| 2 |
+
|
| 3 |
+
Visual source-of-truth for the design spec at `../2026-05-18-ace-music-studio-design.md`. Open the HTML files in a browser to see the rendered Brutalist Mono interface.
|
| 4 |
+
|
| 5 |
+
| File | Tabs / screens covered | Source |
|
| 6 |
+
|---|---|---|
|
| 7 |
+
| [`01_generate_mobile_errors.html`](./01_generate_mobile_errors.html) | **Generate** tab fully expanded Β· 3 phone screens (Generate, Cover, Lyrics) Β· 6 error / edge-case states Β· in-progress generation banner | brainstorm session 24743 |
|
| 8 |
+
| [`02_cover_extend.html`](./02_cover_extend.html) | **Cover** tab fully expanded Β· **Extend** tab fully expanded | brainstorm session 24743 |
|
| 9 |
+
| [`03_edit_lyrics.html`](./03_edit_lyrics.html) | **Edit** tab fully expanded with both sub-modes (Repaint active, Flow Morph dimmed) Β· **Lyrics** tab fully expanded with Qwen 2.5 LM params | brainstorm session 24743 |
|
| 10 |
+
|
| 11 |
+
## What every tab shares
|
| 12 |
+
|
| 13 |
+
- Sticky header with brand "ACE Music Studio." and CTA: *Built with β₯. Drop a like Β· Follow @techfreakworm for what's next.*
|
| 14 |
+
- Sidebar with 5 mode pills + session History list (desktop β₯ 1024 px)
|
| 15 |
+
- 2-column body: form on left, output on right
|
| 16 |
+
- LoRA stack section with 4 bundled preset chips + active stack rows (per-row strength slider + Γ) + custom upload zone
|
| 17 |
+
- Advanced accordion: BPM, key/scale, time sig, sampler, language, steps, CFG, shift, negative prompt, audio format, loudness, fade in/out, seed + lock
|
| 18 |
+
- LM planner accordion: thinking, constrained decoding, temp / top-k / top-p / LM CFG, CoT toggles (metas / caption / lyrics / language), LM negative prompt, CoT override fields
|
| 19 |
+
- DCW accordion: enabled, mode (single / double), wavelet, scaler, high scaler
|
| 20 |
+
- Output panel: waveform Β· play/scrub Β· retake Β· stems (Demucs htdemucs_ft) Β· export (mp3 / wav / stems zip / meta JSON / share link) Β· full metadata JSON
|
| 21 |
+
|
| 22 |
+
## What each tab adds
|
| 23 |
+
|
| 24 |
+
- **Generate** β duration slider, vocals/instrumental pills, CFG-interval start/end, latent shift/rescale
|
| 25 |
+
- **Cover** β reference-audio dropzone, cover-strength slider, cover-noise slider, compare-side-by-side toggle in output
|
| 26 |
+
- **Extend** β seed-audio dropzone with auto-detected BPM/key, extension prompt, extra-duration slider, repaint mode, repaint strength, latent crossfade frames, WAV crossfade seconds, chunk mask mode, seed-boundary marker on output waveform, separate "extension-only" download
|
| 27 |
+
- **Edit** β source audio + source/target lyrics, repaint-vs-flow-morph sub-mode pills, segment-selection bar with start/end timestamps, repaint sub-options (mode / chunk-mask / strength / crossfade), flow-morph sub-options (source caption / n_min / n_max / n_avg), A/B comparison in output
|
| 28 |
+
- **Lyrics** β brief, structure sequence, language, per-section line counts (verse / chorus / bridge), tone descriptors, rhyme preference pills (strict / loose / none), LM params accordion (temp / top-p / top-k / rep penalty / max tokens / seed / show system prompt / enforce-tag-format), quick-refinement chips (more cryptic, less rhyme, etc.), variants
|
| 29 |
+
|
| 30 |
+
## Mobile (phone)
|
| 31 |
+
|
| 32 |
+
- Native `gr.Tabs` horizontal scroll strip at top (icons + first label visible)
|
| 33 |
+
- Sidebar hidden via CSS media query at `< 640 px`
|
| 34 |
+
- Output stacks below form
|
| 35 |
+
- Sliders bounded by parent width (the desktop's pixel-art `β` characters were replaced with proper CSS slider tracks for mobile)
|
| 36 |
+
|
| 37 |
+
## Error / edge states
|
| 38 |
+
|
| 39 |
+
- **LoRAValidationError** β toast with module-mismatch diagnostics + "Remove from stack" / "View header diagnostics" actions
|
| 40 |
+
- **ZeroGPU timeout** β auto-retry once at 2Γ duration, then warning toast with "Lower steps" / "Reduce duration" hints
|
| 41 |
+
- **MPS op fallback** β info toast naming the op (e.g., `aten::_fft_r2c`), CPU fallback engaged via `PYTORCH_ENABLE_MPS_FALLBACK=1`
|
| 42 |
+
- **Audio format rejected** β clear constraints (wav/mp3/flac, β€ 60 s for Cover, β€ 50 MB) + "Auto-convert + trim" action
|
| 43 |
+
- **First-request warm-up** β informational banner ("Loading ACE-Step v1.5 XL SFT into MPS memory ~45 s")
|
| 44 |
+
- **In-progress generation** β `gr.Progress`-driven banner with step / total, ETA, elapsed, cancel link
|
| 45 |
+
|
| 46 |
+
## Note on the "approve / revise" cards
|
| 47 |
+
|
| 48 |
+
Each HTML file has a card-options block at the bottom β vestigial from the visual-companion brainstorm flow. It's harmless when viewed outside the companion (the `toggleSelect` call is a no-op without the companion's helper.js).
|
| 49 |
+
|
| 50 |
+
If they bother you, delete the trailing `<div class="options">β¦</div>` block from each file. Otherwise leave them β they document which question each mockup answered.
|
|
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Open-Source Song Generation for a Suno-Like Platform β Executive Summary
|
| 2 |
+
|
| 3 |
+
*Research compiled 2026-05-18. Target hardware: Apple M5 Max, 128 GB unified memory, MPS backend. Deployment target: **free non-profit Hugging Face Space.** Commercial license is NOT a constraint.*
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## TL;DR
|
| 8 |
+
|
| 9 |
+
**Use ACE-Step 1.5 XL as the default base model.** It is the open-source full-song-with-vocals foundation model in May 2026 that combines:
|
| 10 |
+
|
| 11 |
+
1. **First-class Apple Silicon support** (hybrid MLX + PyTorch MPS, dedicated `clockworksquirrel/ace-step-apple-silicon` fork) β best local-dev experience.
|
| 12 |
+
2. **MIT license** β clean for forks, attribution, and weight redistribution on the HF Space.
|
| 13 |
+
3. **State-of-art-or-better quality** β 4.4/5 vs Suno v4's 4.1/5 vocal naturalness in a 50-person blind test (folk, classical, jazz; Suno still wins pop/EDM polish).
|
| 14 |
+
4. **Sub-minute generation** on M5 Max (projected ~30 β 50 s for a 4-min song). Sub-2 s/song on A100 β fits inside HF ZeroGPU's free 60 s budget.
|
| 15 |
+
5. **Cheap LoRA fine-tuning** β 8 songs trainable in ~1 hour on a single 3090, LoRA training works on MPS.
|
| 16 |
+
6. **50+ languages**, vocals + instrumentation natively, **<4 GB VRAM minimum** β runs on free ZeroGPU Spaces.
|
| 17 |
+
7. **Active 10.4 k-star repo**, native ComfyUI integration, AMD vendor-blessed for production.
|
| 18 |
+
|
| 19 |
+
**Now that commercial use is not a constraint** (free non-profit HF Space deployment), **SongGeneration 2 / LeVo 2** comes back into contention as a premium-quality alternative β its Tencent non-commercial license permits academic/research/education use. Vendor benchmarks (unverified) put it ahead of Suno v5 on lyric accuracy. The trade-off is **22 β 28 GB VRAM** (needs paid Space tier, not free ZeroGPU) and no first-party MPS path (only a buggy community `SongGen-Mac` fork) β meaning M5 Max local dev is painful.
|
| 20 |
+
|
| 21 |
+
Pair the primary pick with **HeartMuLa-MLX** as an alternate-quality choice (Apache 2.0, 2.1Γ faster than ACE-Step on M-series via Apple's MLX) and **YuE on Replicate** as the multilingual fallback.
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## Ranking (non-profit HF Space context)
|
| 26 |
+
|
| 27 |
+
| Rank | Model | Params | bf16 weights | License | MPS | Vocal Quality vs Suno | LoRA | Verdict |
|
| 28 |
+
|---|---|---|---|---|---|---|---|---|
|
| 29 |
+
| **1** | **ACE-Step 1.5 XL** | ~8 B (4 B DiT + 4 B planner) | ~16 GB | MIT | First-class | 4.4/5 vs Suno v4 4.1 (blind test) | β
1h on 3090 | **Default base.** Fits free ZeroGPU. |
|
| 30 |
+
| **2** | **SongGeneration 2 / LeVo 2** | 4 B | ~8 GB | Tencent non-commercial (OK for non-profit Space) | Buggy community fork only | Vendor PER 8.55 % vs Suno v5 12.4 % | β | Premium quality. Needs paid Space (22 β 28 GB VRAM). |
|
| 31 |
+
| **3** | **HeartMuLa** | ~6.8 B (4 B MuLa + 2 B Codec + 0.8 B ASR) | ~13.6 GB | Apache 2.0 | Strong MLX port | Vendor: lowest PER per-language, unverified | β public | Strong A/B alternate. |
|
| 32 |
+
| **4** | **DiffRhythm 2** | ~1.17 B (1 B DiT + 170 M VAE-dec) | ~2.4 GB | Apache 2.0 | Likely OK, untested | Authors admit gap vs Suno v4.5 | β no training code | Speed tier. 210 s ceiling. Cheapest to host. |
|
| 33 |
+
| **5** | **YuE** | ~8 B (7 B + 1 B + upsampler) | ~16 GB | Apache 2.0 | β broken (flash-attn hard dep) | Vocal range matches Suno v4 | β
LoRA, CUDA-only | Multilingual specialist; via Replicate only. |
|
| 34 |
+
| β | SongBloom | 2 B | ~4 GB | Custom (likely NC) | Reported OK | unknown | β | Research baseline. |
|
| 35 |
+
| β | InspireMusic / FunMusic | 1.5 B | ~3 GB | Apache 2.0 | β CUDA-only deps | No vocals yet | n/a | Skip until vocal release. |
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## Decision tree (non-profit HF Space deployment)
|
| 40 |
+
|
| 41 |
+
```
|
| 42 |
+
HF Space tier?
|
| 43 |
+
βββ Free ZeroGPU (60s/req on shared A100) ββ
|
| 44 |
+
β βββ ACE-Step 1.5 (turbo workflow generates a song well under 60 s)
|
| 45 |
+
β βββ DiffRhythm 2 (smallest, fastest, fits easily)
|
| 46 |
+
β
|
| 47 |
+
βββ Paid GPU Space (A10G / A100 dedicated) ββ
|
| 48 |
+
βββ Default: ACE-Step 1.5 XL (best speed-quality, MPS for local dev)
|
| 49 |
+
βββ Premium tier: SongGeneration 2 v2-large (best vendor benchmarks)
|
| 50 |
+
βββ Multilingual breadth: YuE (50+ via Replicate; local broken)
|
| 51 |
+
βββ Alternate: HeartMuLa via heartlib-mlx
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## What the research surfaced that changes the picture
|
| 57 |
+
|
| 58 |
+
1. **Non-profit HF Space deployment removes the Tencent-license blocker.** SongGeneration 2 / LeVo 2 is back in contention as a premium-quality alternative. Its custom license permits "academic, research, and education purposes" β a free non-profit Space sits comfortably inside that scope. Practical blockers remain (22 β 28 GB VRAM means paid Space tier, no working MPS) but the licence is no longer a no-go.
|
| 59 |
+
|
| 60 |
+
2. **The YuE team migrated to ACE-Step.** The ACE-Step paper (Jun 2025) explicitly critiques YuE for "slow inference and structural artifacts." YuE's repo has been dormant since 2025-06-04. Treat YuE as a frozen capability, not a developing one.
|
| 61 |
+
|
| 62 |
+
3. **Vocal-support contradiction on ACE-Step is resolved: yes, it does vocals.** Several search results said "instrumental only" β that's confused with the `Text2Samples` LoRA. The base model produces vocals + instruments natively, lyric-conditioned, with `[verse] [chorus] [bridge]` structural tags.
|
| 63 |
+
|
| 64 |
+
4. **DiffRhythm 2's biggest fix is structural coherence**, not raw quality. Its v1's brutal Hacker News thread complained "no identifiable chorus in any of the demo songs"; v2's block flow-matching (semi-autoregressive over 2 s blocks) closes that gap. Its **210 s ceiling is a regression** from v1-full's 4m45s.
|
| 65 |
+
|
| 66 |
+
5. **HeartMuLa is the dark-horse 2026 entrant.** Apache 2.0, 4 B params, modular (CLAP + Transcriptor + Codec + MuLa LM), MLX port available. Vendor PER claims are aggressive (0.09 EN / 0.12 ZH) but not in comparable units to LeVo's 8.55 % β direct comparison unreliable until somebody runs a neutral A/B.
|
| 67 |
+
|
| 68 |
+
6. **Every "beats Suno v5" claim is vendor-published.** The only neutral preference study located ([arXiv 2506.19085](https://arxiv.org/html/2506.19085v1)) stops at Suno v3.5. **Plan an in-house blind A/B before betting product positioning on any vendor number.**
|
| 69 |
+
|
| 70 |
+
7. **Apple Silicon is fine for music gen β much friendlier than LTX-Video 2.3.** No complex64, no SDPA-on-meta-tensor traps, no multimodal-Gemma gotchas. The mundane MPS issues here are: `flash-attn` substitution with SDPA, fp16 conv1d β fp32 in audio decoders, `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` for OOM tuning. Three of the five candidate models already ship a working MPS or MLX path.
|
| 71 |
+
|
| 72 |
+
8. **HF Space hardware tier dictates the model choice as much as quality does.** Free ZeroGPU = 60 s budget per request, shared A100 β only ACE-Step or DiffRhythm 2 finish in time. Paid A10G/A100 Spaces unlock SongGeneration 2 v2-large but the user has to pay (or get an HF community grant).
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## Recommended starting setup for the M5 Max (with HF Space deploy in mind)
|
| 77 |
+
|
| 78 |
+
```bash
|
| 79 |
+
# 1. Primary base model β ACE-Step 1.5 XL via the Apple Silicon fork
|
| 80 |
+
git clone https://github.com/clockworksquirrel/ace-step-apple-silicon \
|
| 81 |
+
~/Projects/llm/music-generator/ace-step
|
| 82 |
+
cd ~/Projects/llm/music-generator/ace-step
|
| 83 |
+
python3.11 -m venv .venv && source .venv/bin/activate
|
| 84 |
+
pip install -r requirements.txt
|
| 85 |
+
# Hybrid backend: Qwen3 planner β MLX, DiT decoder β PyTorch MPS, bf16 throughout
|
| 86 |
+
# ~16 GB bf16 weights for the XL stack; M5 Max 128 GB has massive headroom
|
| 87 |
+
|
| 88 |
+
# 2. Production UI β ace-step-ui (stem extraction, library, LAN access)
|
| 89 |
+
git clone https://github.com/fspecii/ace-step-ui \
|
| 90 |
+
~/Projects/llm/music-generator/ace-step-ui
|
| 91 |
+
|
| 92 |
+
# 3. Alternate model β HeartMuLa via MLX port (~13.6 GB bf16)
|
| 93 |
+
git clone https://github.com/Acelogic/heartlib-mlx \
|
| 94 |
+
~/Projects/llm/music-generator/heartlib-mlx
|
| 95 |
+
|
| 96 |
+
# 4. (Optional) Premium-quality experiment β SongGeneration 2 / LeVo 2
|
| 97 |
+
# Mac fork has a pre-chorus bug; only do this if you're OK developing on a rented
|
| 98 |
+
# Linux+CUDA box and the M5 Max becomes just your control plane.
|
| 99 |
+
git clone https://github.com/tencent-ailab/SongGeneration \
|
| 100 |
+
~/Projects/llm/music-generator/songgeneration
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
For the throughput-sensitive **multilingual fallback (YuE)**, use Replicate's `fofr/yue` endpoint β do *not* attempt local inference on M5 Max until somebody ports Stage-1 to MPS. Treat YuE as remote-only for now.
|
| 104 |
+
|
| 105 |
+
**HF Space deployment notes:**
|
| 106 |
+
- **Free ZeroGPU Space** β only ACE-Step or DiffRhythm 2 will finish a song inside the 60 s shared-A100 budget. Use ACE-Step's turbo workflow.
|
| 107 |
+
- **Paid GPU Space** β A10G (24 GB) handles ACE-Step XL comfortably; A100 (40 GB) opens the door to SongGeneration 2 v2-large.
|
| 108 |
+
- **Apply for a [Community GPU Grant](https://huggingface.co/docs/hub/en/spaces-gpus#community-gpu-grants)** if budget is the deciding factor β HF approves these regularly for non-profit demos.
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## Sources
|
| 113 |
+
|
| 114 |
+
All claims are cited inline in the per-model deep-dives:
|
| 115 |
+
|
| 116 |
+
- [01_yue.md](./01_yue.md)
|
| 117 |
+
- [02_diffrhythm.md](./02_diffrhythm.md)
|
| 118 |
+
- [03_acestep.md](./03_acestep.md)
|
| 119 |
+
- [04_newcomers_and_survey.md](./04_newcomers_and_survey.md)
|
| 120 |
+
- [05_apple_silicon_mps_audit.md](./05_apple_silicon_mps_audit.md)
|
| 121 |
+
- [06_comparison_matrix.md](./06_comparison_matrix.md) β side-by-side spec table
|
| 122 |
+
- [07_platform_architecture.md](./07_platform_architecture.md) β Suno-clone system design with ACE-Step at the core
|
|
@@ -0,0 +1,268 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# YuE β Open Full-Song Music Generation Foundation Model
|
| 2 |
+
|
| 3 |
+
*Research date: 2026-05-18*
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## 1. Overview
|
| 8 |
+
|
| 9 |
+
**YuE** (δΉ, "yue" β Chinese for "music") is an open-source family of long-form, lyrics-to-song foundation models that produce vocals + accompaniment end-to-end, explicitly positioned as the open competitor to Suno.ai and Udio. It was built by the **M-A-P (Multimodal Art Projection) collective**, led by researchers at **HKUST (Hong Kong University of Science and Technology)** with collaborators from multiple academic and industry institutions (58 authors are credited on the paper, with hardware support from Geely and Moonshot AI) ([arXiv 2503.08638](https://arxiv.org/abs/2503.08638), [HF model card](https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-icl)).
|
| 10 |
+
|
| 11 |
+
**Release timeline:**
|
| 12 |
+
|
| 13 |
+
- **2025-01-26** β Initial YuE-s1-7B series released ([GitHub README](https://github.com/multimodal-art-projection/YuE))
|
| 14 |
+
- **2025-01-30** β Apache 2.0 license adopted; dual-track ICL mode added
|
| 15 |
+
- **2025-02-07** β Windows / Pinokio support
|
| 16 |
+
- **2025-02-17** β Music continuation + Google Colab support
|
| 17 |
+
- **2025-03-11/12** β Anneal checkpoints + technical report on arXiv (v1)
|
| 18 |
+
- **2025-06-04** β LoRA fine-tuning code merged (PR #126)
|
| 19 |
+
- **ICLR 2026** β Paper presented
|
| 20 |
+
|
| 21 |
+
**Current status (May 2026): effectively frozen / community-maintained.** The official `multimodal-art-projection/YuE` repo's last commit is **2025-06-04** (GitHub API, retrieved 2026-05-18), nearly 12 months stale. There is no announced YuE-2 or successor from the M-A-P org. All forward development (quantization, ComfyUI, GUI, MPS attempts, exllama, mp3 extension) now happens in community forks like [YuEGP](https://github.com/deepbeepmeep/YuEGP), [YuE-exllamav2](https://github.com/sgsdxzy/YuE-exllamav2), and [YuE-extend](https://github.com/Mozer/YuE-extend). The space the team itself has moved into is **ACE-Step** (released January 2026), which the ACE-Step paper explicitly critiques YuE for "slow inference and structural artifacts" ([arXiv 2506.00045](https://arxiv.org/abs/2506.00045)).
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## 2. Architecture
|
| 26 |
+
|
| 27 |
+
YuE is a **two-stage autoregressive LLM** pipeline built on the **LLaMA2** decoder-only transformer backbone β *not* a diffusion model ([paper](https://arxiv.org/html/2503.08638v1)).
|
| 28 |
+
|
| 29 |
+
**Stage-1 LM (the headline 7B model):**
|
| 30 |
+
- LLaMA2-style decoder, ~6Bβ7B parameters (HF metadata reports 6B for the s1 checkpoints).
|
| 31 |
+
- Performs **track-decoupled next-token prediction**: interleaves *vocal* and *instrumental* token streams in a single sequence, so a single AR pass produces both tracks rather than mixing them. This is YuE's central architectural innovation.
|
| 32 |
+
- Conditioned on (genre tags || lyrics) using **structural progressive conditioning** β lyrics are chunked per section (verse/chorus/bridge) and re-injected so attention does not lose alignment over a 5-minute generation.
|
| 33 |
+
- Native context: 8192 tokens (~163 s of mix-track audio, ~81 s of dual-track); extended to **16384** in the anneal phase.
|
| 34 |
+
|
| 35 |
+
**Stage-2 LM:**
|
| 36 |
+
- 1B-parameter LLaMA2 model (HF reports ~2B for `YuE-s2-1B-general`).
|
| 37 |
+
- Predicts the **residual RVQ codebooks (layers 1β7)** conditioned on Stage-1's codebook-0 output, restoring acoustic fidelity that the semantic-rich layer-0 tokens omit.
|
| 38 |
+
- Context length 8192.
|
| 39 |
+
|
| 40 |
+
**Audio tokenizer β X-Codec:**
|
| 41 |
+
- YuE uses **X-Codec** (from the same M-A-P lineage as MERT), a *semantic-acoustic fused* RVQ codec that bolts a HuBERT-based semantic stream onto an RVQ-VAE acoustic stream.
|
| 42 |
+
- 12 RVQ codebooks total; YuE uses the first **8** (codebook size 1024 each).
|
| 43 |
+
- 50 Hz frame rate over 16 kHz audio.
|
| 44 |
+
- A separate **YuE-upsampler** (GAN-based) converts the 16 kHz output up to higher sample rate / better fidelity for delivery ([paper Β§3](https://arxiv.org/html/2503.08638v1), [HF Transformers X-Codec docs](https://huggingface.co/docs/transformers/main/model_doc/xcodec)).
|
| 45 |
+
|
| 46 |
+
**Track handling:** Dual-track. Vocal and accompaniment are *separately tokenized* via X-Codec, then interleaved in the AR sequence β this is the paper's claimed advantage over single-track-mixture baselines (less information loss, cleaner vocal/inst separation).
|
| 47 |
+
|
| 48 |
+
**Max generation length:** Up to **~5 minutes** per song, generated in chunks/sessions and stitched.
|
| 49 |
+
|
| 50 |
+
**Lyrics conditioning:** Plain text lyrics with section tags ([verse], [chorus], etc.) + a genre tag prompt (a vocabulary from `top_200_tags.json` such as "pop", "female vocal", "energetic", "120 bpm"). The progressive conditioning means each new section re-references the relevant lyric chunk.
|
| 51 |
+
|
| 52 |
+
**Training scale:** Stage-1 used ~**2T tokens** across phases; data includes ~**650K hours of in-the-wild music** plus ~**70K hours of TTS** for vocal grounding ([paper](https://arxiv.org/html/2503.08638v1)).
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## 3. Variants and Sizes
|
| 57 |
+
|
| 58 |
+
From the [M-A-P YuE collection on HuggingFace](https://huggingface.co/collections/m-a-p/yue-6797d55e22990ae89b90a3d6) (downloads accurate as of mid-2026):
|
| 59 |
+
|
| 60 |
+
| Model | Params | Stage | Language | Mode | Downloads (last month) |
|
| 61 |
+
|---|---|---|---|---|---|
|
| 62 |
+
| `YuE-s1-7B-anneal-en-cot` | 6B | 1 | English | Chain-of-Thought (default) | 8.48k |
|
| 63 |
+
| `YuE-s1-7B-anneal-en-icl` | 6B | 1 | English | In-Context Learning (style cloning) | 805 |
|
| 64 |
+
| `YuE-s1-7B-anneal-zh-cot` | 6B | 1 | Mandarin/Cantonese | CoT | 203 |
|
| 65 |
+
| `YuE-s1-7B-anneal-zh-icl` | 6B | 1 | Mandarin/Cantonese | ICL | 89 |
|
| 66 |
+
| `YuE-s1-7B-anneal-jp-kr-cot` | 6B | 1 | Japanese/Korean | CoT | 95 |
|
| 67 |
+
| `YuE-s1-7B-anneal-jp-kr-icl` | 6B | 1 | Japanese/Korean | ICL | 25 |
|
| 68 |
+
| `YuE-s2-1B-general` | 2B | 2 | language-agnostic | residual decoder | 6.01k |
|
| 69 |
+
| `YuE-s1-0.5B` | 0.5B | 1 | research/ablation | partial training | 94 |
|
| 70 |
+
| `YuE-upsampler` | β | post | n/a | GAN upsampler | β |
|
| 71 |
+
| `xcodec_mini_infer` | β | tokenizer | n/a | X-Codec encoder/decoder | β |
|
| 72 |
+
|
| 73 |
+
**Naming key:**
|
| 74 |
+
- `s1` / `s2` = Stage-1 (semantic) / Stage-2 (acoustic residual).
|
| 75 |
+
- `anneal` = checkpoints after the final "annealing" pretraining phase (highest quality public weights).
|
| 76 |
+
- `cot` = chain-of-thought prompting variant; `icl` = in-context learning variant (used for *style/voice cloning* from a reference audio).
|
| 77 |
+
- A community **GGUF quantization** of the Stage-2 model exists at [`multimodalart/YuE-s2-1B-general-Q8_0-GGUF`](https://huggingface.co/multimodalart/YuE-s2-1B-general-Q8_0-GGUF) β useful for Mac llama.cpp paths.
|
| 78 |
+
|
| 79 |
+
There is **no official "YuE-2" or major version bump**. The team's successor effort is the separately branded ACE-Step.
|
| 80 |
+
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
## 4. License
|
| 84 |
+
|
| 85 |
+
**Apache License 2.0** for code *and* weights β switched on 2025-01-30 in response to community pressure ([GitHub README news entry](https://github.com/multimodal-art-projection/YuE), [HF model card](https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-icl)).
|
| 86 |
+
|
| 87 |
+
- **Commercial use:** *Permitted and explicitly encouraged.* The model card says: "Artists and content creators are encouraged to sample and incorporate outputs into their own works, and even monetize them, with attribution to the model's name (\"YuE by HKUST/M-A-P\")."
|
| 88 |
+
- **Attribution:** Required for public / commercial outputs.
|
| 89 |
+
- **Recommended labeling:** outputs should be marked "AI-generated", "YuE-generated", "AI-assisted", or "AI-auxiliated".
|
| 90 |
+
- **No training-data redistribution clause** β Apache 2.0 covers code and the released weights; training data itself was *not* released, so no redistribution permission is granted on data.
|
| 91 |
+
- **Liability:** users bear sole responsibility for any copyright infringement, plagiarism, or misuse. Likely β no explicit watermarking or content-credentials are baked into output (no direct confirmation in docs).
|
| 92 |
+
|
| 93 |
+
Practical takeaway for the user's Suno-like platform: **YuE is one of the very few music-generation foundation models with a clean, no-strings commercial license**, which is the single most valuable thing about it.
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
## 5. Languages Supported
|
| 98 |
+
|
| 99 |
+
Five officially: **English, Mandarin Chinese, Cantonese, Japanese, Korean** ([GitHub README](https://github.com/multimodal-art-projection/YuE), [demo page](https://map-yue.github.io/)).
|
| 100 |
+
|
| 101 |
+
- English has the deepest training and the most-downloaded checkpoint.
|
| 102 |
+
- `zh` covers Mandarin and Cantonese (sharing a checkpoint).
|
| 103 |
+
- `jp-kr` shares one checkpoint for Japanese and Korean.
|
| 104 |
+
- The demo site shows code-switching (English β Mandarin within the same song) working.
|
| 105 |
+
- No official support for Spanish, French, German, Hindi, Arabic, etc. β outputs in those languages will likely be poor or accented (no direct user reports confirm, but architecturally the model has never seen them at scale).
|
| 106 |
+
|
| 107 |
+
---
|
| 108 |
+
|
| 109 |
+
## 6. Quality Assessment
|
| 110 |
+
|
| 111 |
+
**Strengths (from paper + demos):**
|
| 112 |
+
- Wide vocal range β the paper reports YuE "closely matching top-performing closed-source systems like Suno V4" on vocal-range metrics ([WhiteFiber summary](https://www.whitefiber.com/blog/yue-ai-music-generator)).
|
| 113 |
+
- Strong **musical structure** β verse/chorus/bridge transitions are coherent over 3β5 min, which most diffusion music models still struggle with.
|
| 114 |
+
- Demos show death-growl metal, scatting jazz, Beijing opera, rap, ballad, country, and soul β *genre breadth* is genuinely impressive ([map-yue.github.io](https://map-yue.github.io/)).
|
| 115 |
+
- ICL mode can clone the timbre/style of a reference clip β closest open-source analogue to Suno's "cover" or Udio's style transfer.
|
| 116 |
+
|
| 117 |
+
**Weaknesses (from paper's own discussion + community feedback):**
|
| 118 |
+
- **Acoustic fidelity gap.** Multiple sources, including the paper itself, note "clear deficiencies in vocal and accompaniment acoustic quality, likely due to limitations of its current audio tokenization method"; the authors propose super-resolution / better decoders as future work.
|
| 119 |
+
- **Mono / narrow stereo image** β third-party reviews call out that output "lacks the production quality needed for commercial music platforms" and is essentially mono ([articlex review](https://www.articlex.com/open-source-ai-music-generation-breakthrough-with-yue-software/)).
|
| 120 |
+
- **Slow inference + structural artifacts** β the explicit critique from the ACE-Step authors (ICLR 2026 submission): "LLM-based models like YuE excel at lyrics alignment but suffer from slow inference and structural artifacts" ([ACE-Step paper](https://arxiv.org/abs/2506.00045)).
|
| 121 |
+
- **Mumbling / lyric drift** appears in long sections β there is no explicit Reddit thread surfacing here, but the paper's "Section 12 Unsuccessful Attempts" and `--repetition-penalty` / decoding-temperature emphasis in the GitHub Issues suggest users hit it.
|
| 122 |
+
|
| 123 |
+
**Quality verdict vs Suno v4 / v5:**
|
| 124 |
+
- Suno v4 β YuE on *vocal range and genre breadth.*
|
| 125 |
+
- Suno v4/v5 clearly ahead on *mix polish, stereo width, vocal clarity, and emotional nuance.*
|
| 126 |
+
- YuE ahead of Suno only on *openness, controllability via lyrics tags, and structural macro-form for niche genres*.
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## 7. Inference Performance
|
| 131 |
+
|
| 132 |
+
From the README's official hardware table:
|
| 133 |
+
|
| 134 |
+
| GPU | Time for 30 s of audio (Stage-1 + Stage-2) |
|
| 135 |
+
|---|---|
|
| 136 |
+
| NVIDIA H800 80GB | **~150 s** |
|
| 137 |
+
| NVIDIA RTX 4090 24GB | **~360 s** |
|
| 138 |
+
| β€24GB GPU | Max ~2 concurrent sessions; cannot generate a full song in one pass |
|
| 139 |
+
| β₯80GB GPU (H100/A100/H800) | Recommended for a full 4+ session song |
|
| 140 |
+
|
| 141 |
+
Extrapolating to a **3-minute song** (~6Γ a 30 s clip, plus some overhead for stitching):
|
| 142 |
+
- H800: ~15β18 minutes
|
| 143 |
+
- A100 80GB: ~18β22 minutes (likely β close to H800 throughput)
|
| 144 |
+
- RTX 4090: ~35β45 minutes
|
| 145 |
+
- M5 Max MPS (user's machine): **no official support, no public benchmark.**
|
| 146 |
+
|
| 147 |
+
**VRAM:** Full-precision FP16 Stage-1 needs ~16β18 GB; Stage-2 + upsampler add ~4β6 GB. Single-pass full-song generation comfortably wants 40β80 GB.
|
| 148 |
+
|
| 149 |
+
**Quantized / community paths:**
|
| 150 |
+
- **YuEGP** ("YuE for the GPU Poor") brings VRAM down to **<10 GB** via 8-bit quantization and sequential offload ([YuEGP repo](https://github.com/deepbeepmeep/YuEGP)).
|
| 151 |
+
- **YuE-exllamav2** claims up to **5Γ speedup** via ExLlamaV2 + FlashAttention-2 + BF16 ([YuE-exllamav2](https://github.com/sgsdxzy/YuE-exllamav2)) β NVIDIA-only.
|
| 152 |
+
- **GGUF Stage-2** exists ([multimodalart/YuE-s2-1B-general-Q8_0-GGUF](https://huggingface.co/multimodalart/YuE-s2-1B-general-Q8_0-GGUF)). Stage-1 7B GGUF is not officially published as of 2026-05.
|
| 153 |
+
|
| 154 |
+
**Apple Silicon / MPS:**
|
| 155 |
+
- **No official MPS support.** GitHub README references `--cuda_idx`, no `mps` or `mac` mentions.
|
| 156 |
+
- No HF Space or fork advertises working MPS inference. The architecture is plain LLaMA2 + standard transformer ops, so MPS port is *technically feasible* (likely β Stage-1 fits well within the user's 128GB unified memory), but the X-Codec encoder/decoder has Flash-Attention CUDA kernels that would need replacement. Realistic path on M5 Max today: run the Stage-2 GGUF via llama.cpp Metal backend, but Stage-1 has no public Metal/MPS port.
|
| 157 |
+
- A community attempt to MPS-port has *not* surfaced in any search or GitHub issue as of May 2026.
|
| 158 |
+
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
+
## 8. Repo Health
|
| 162 |
+
|
| 163 |
+
Data from the GitHub API on 2026-05-18 for `multimodal-art-projection/YuE`:
|
| 164 |
+
|
| 165 |
+
- **Stars:** 6,219
|
| 166 |
+
- **Forks:** 741
|
| 167 |
+
- **Open issues:** 86
|
| 168 |
+
- **License:** Apache-2.0
|
| 169 |
+
- **Default branch last push:** `2025-06-04T13:08:48Z` β **~11 months stale**
|
| 170 |
+
- **Most-recent commits:** all README edits and the finetune-merge PRs on the same day (2025-06-04).
|
| 171 |
+
- **Recent issue traffic (sampled 2025-Q4 through 2026-Q2):** install errors (CUDA / `codecmanipulator` missing), ComfyUI integration questions, attention-mask warnings, "how do I generate a full song" basics, a Feb-2026 PR proposing `SDPA as default attention` that received zero engagement. Maintainer responses are essentially absent in 2026.
|
| 172 |
+
- **Fine-tuning support:** present, merged June 2025 via PR #126 (LoRA, no QLoRA, requires CUDA 12.1+, PyTorch 2.4, Megatron-formatted JSONL data).
|
| 173 |
+
- **vLLM / SGLang:** listed in TODO, never implemented.
|
| 174 |
+
- **llama.cpp:** community Stage-2 GGUF exists but no official integration; Stage-1 not converted.
|
| 175 |
+
- **Tensor parallel / Stemgen mode:** TODO, never shipped.
|
| 176 |
+
|
| 177 |
+
**Verdict:** The repo is in **maintenance/abandonment limbo.** Apache 2.0 + open weights mean anyone can fork; community forks are where the energy is.
|
| 178 |
+
|
| 179 |
+
---
|
| 180 |
+
|
| 181 |
+
## 9. Real-World Adoption
|
| 182 |
+
|
| 183 |
+
- **Replicate:** Hosted at [`fofr/yue`](https://replicate.com/fofr/yue/api) with an official cog wrapper at [`replicate/cog-yue`](https://github.com/replicate/cog-yue) β production-ready pay-per-second API.
|
| 184 |
+
- **HuggingFace Spaces:** at least three live demos β [`fffiloni/YuE`](https://huggingface.co/spaces/fffiloni/YuE), [`innova-ai/YuE-music-generator-demo`](https://huggingface.co/spaces/innova-ai/YuE-music-generator-demo), `Harveyu/YuE-music-generator-demo`.
|
| 185 |
+
- **ComfyUI:** community node [`smthemex/ComfyUI_YuE`](https://github.com/smthemex/ComfyUI_YuE) exposes YuE as a node graph (issue #148 confirms active users in 2026).
|
| 186 |
+
- **Pinokio:** one-click Windows installer ships in the official Pinokio script directory ([pinokio.co](https://pinokio.co/)).
|
| 187 |
+
- **GPU-poor / consumer forks:** `deepbeepmeep/YuEGP` (sub-10 GB VRAM), `sgsdxzy/YuE-exllamav2` (5Γ speedup), `Mozer/YuE-extend` (mp3 extension + GUI), `Sorrymakershen/YuE-for-windows`.
|
| 188 |
+
- **SiliconFlow:** no public listing found as of 2026-05 (likely β search returned no SiliconFlow YuE endpoint).
|
| 189 |
+
- **Forks:** 741 total, dominated by consumer-VRAM optimization rather than research extension.
|
| 190 |
+
|
| 191 |
+
For a Suno-like platform, the **Replicate `fofr/yue` endpoint is the lowest-friction starting point** to test quality before self-hosting.
|
| 192 |
+
|
| 193 |
+
---
|
| 194 |
+
|
| 195 |
+
## 10. Fine-Tuning
|
| 196 |
+
|
| 197 |
+
- **LoRA fine-tuning is documented and supported** since June 2025, in the [`finetune/` directory](https://github.com/multimodal-art-projection/YuE/tree/main/finetune) with `scripts/preprocess_data.sh` and `scripts/run_finetune.sh`.
|
| 198 |
+
- Configurable `LORA_R`, `LORA_ALPHA`, `LORA_DROPOUT`.
|
| 199 |
+
- **Training scripts are open** β Megatron-style data pipeline; data must be converted to JSONL containing X-Codec tokens + lyric/structure/genre metadata, then to Megatron binary.
|
| 200 |
+
- **QLoRA: not documented.** No 4-bit fine-tuning path is described in the official repo (likely β community forks may have hacked it together).
|
| 201 |
+
- Requires CUDA 12.1+, PyTorch 2.4, Python 3.10; GPU memory not explicitly stated but realistically wants β₯40 GB VRAM for the 7B Stage-1 LoRA.
|
| 202 |
+
- No published guide for full-parameter fine-tuning of Stage-1 β implied to need multi-node H100.
|
| 203 |
+
|
| 204 |
+
---
|
| 205 |
+
|
| 206 |
+
## 11. Pros and Cons
|
| 207 |
+
|
| 208 |
+
**Pros**
|
| 209 |
+
- True open weights (Apache 2.0), commercial-use-friendly, with strong attribution-only requirements.
|
| 210 |
+
- Genuine dual-track output (vocals + instrumentals as separable streams), not just a mix.
|
| 211 |
+
- Multilingual coverage of EN / ZH / Cantonese / JP / KR with code-switching demos.
|
| 212 |
+
- Strong macro-structure for 3β5 minute songs β verses, choruses, bridges hold together.
|
| 213 |
+
- Healthy ecosystem of quantized / consumer-VRAM forks and a turnkey Replicate endpoint.
|
| 214 |
+
- LoRA fine-tuning code is shipped and merged.
|
| 215 |
+
- Comparable vocal range to Suno v4 on the paper's metrics.
|
| 216 |
+
|
| 217 |
+
**Cons**
|
| 218 |
+
- **Repo is effectively dormant since June 2025** β no maintainer engagement on 2026 issues/PRs.
|
| 219 |
+
- Acoustic fidelity is noticeably below Suno v4/v5 β mono-ish, less polished mix, occasional vocal artifacts/mumbling on long passages.
|
| 220 |
+
- **No MPS / Apple Silicon support**, official or community β a real problem for the user's M5 Max workflow.
|
| 221 |
+
- Slow inference even on H800 (~150 s per 30 s clip, β 15+ minutes per full song before quantization).
|
| 222 |
+
- VRAM hungry: full-song single-pass wants 80 GB; consumer GPUs need session-stitching tricks.
|
| 223 |
+
- No QLoRA / no vLLM / no SGLang / no tensor parallel β all in TODO purgatory.
|
| 224 |
+
- Training data not released β fine-tuning needs you to bring your own licensed corpus.
|
| 225 |
+
- Tokenizer (X-Codec) is the bottleneck for fidelity, and YuE inherits this ceiling β no upgrade path planned in this codebase.
|
| 226 |
+
- An explicit successor effort (ACE-Step) from an adjacent team claims to fix YuE's specific weaknesses.
|
| 227 |
+
|
| 228 |
+
---
|
| 229 |
+
|
| 230 |
+
## 12. Verdict for the User's Suno-like Platform
|
| 231 |
+
|
| 232 |
+
**Best fit for the user's M5 Max / 128 GB platform if:**
|
| 233 |
+
- The product needs **commercial-grade licensing freedom** above all else β YuE is one of the very few open music models you can ship in a paid product without licensing carve-outs.
|
| 234 |
+
- You target **multilingual song generation (EN + Mandarin/Cantonese + JP/KR)** with code-switching β YuE is the strongest open option here.
|
| 235 |
+
- You can offload generation to a **rented H100/H800 (Replicate, Runpod, Lambda)** rather than insisting on local M5 Max inference β *MPS support is the blocker on the user's hardware.*
|
| 236 |
+
- You want a base to **LoRA fine-tune on a proprietary genre/voice corpus** β the official fine-tune scripts work today, and Apache 2.0 lets you keep your LoRA private and commercial.
|
| 237 |
+
|
| 238 |
+
**Where YuE will underperform competitors:**
|
| 239 |
+
- **Acoustic polish** β Suno v4/v5 and Udio will sound noticeably more professional out of the box. If your platform's selling point is "studio-quality vocals", YuE is not there.
|
| 240 |
+
- **Throughput per dollar** β diffusion-based ACE-Step and DiffRhythm-2 are dramatically faster (ACE-Step claims ~15Γ speedup); for a high-volume product, the AR-LLM architecture is expensive.
|
| 241 |
+
- **Real-time / interactive generation** β not viable; YuE is batch-only.
|
| 242 |
+
- **Local Mac inference** β until somebody ports Stage-1 to MPS or ships a Stage-1 GGUF, the user's M5 Max can at best play around with the Stage-2 model in llama.cpp Metal mode.
|
| 243 |
+
|
| 244 |
+
**Concrete recommendation for the user:** use YuE via Replicate's `fofr/yue` endpoint as the **commercial-license-clean fallback / multilingual specialist** in the platform's model router, and seriously evaluate ACE-Step in parallel for the throughput-sensitive default path. Plan a future LoRA fine-tune on YuE only after the platform has clear vertical (genre, language, or vocal-style) demand that the closed APIs cannot serve.
|
| 245 |
+
|
| 246 |
+
---
|
| 247 |
+
|
| 248 |
+
## References
|
| 249 |
+
|
| 250 |
+
- GitHub repo: <https://github.com/multimodal-art-projection/YuE>
|
| 251 |
+
- Paper (arXiv): <https://arxiv.org/abs/2503.08638>
|
| 252 |
+
- Paper (HTML): <https://arxiv.org/html/2503.08638v1>
|
| 253 |
+
- OpenReview: <https://openreview.net/forum?id=hZy6YG2Ij8>
|
| 254 |
+
- Project / demos: <https://map-yue.github.io/>
|
| 255 |
+
- HF collection: <https://huggingface.co/collections/m-a-p/yue-6797d55e22990ae89b90a3d6>
|
| 256 |
+
- HF s1 English ICL card: <https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-icl>
|
| 257 |
+
- Replicate: <https://replicate.com/fofr/yue/api>
|
| 258 |
+
- Replicate cog: <https://github.com/replicate/cog-yue>
|
| 259 |
+
- YuEGP fork: <https://github.com/deepbeepmeep/YuEGP>
|
| 260 |
+
- YuE-exllamav2 fork: <https://github.com/sgsdxzy/YuE-exllamav2>
|
| 261 |
+
- YuE-extend fork: <https://github.com/Mozer/YuE-extend>
|
| 262 |
+
- ComfyUI node: <https://github.com/smthemex/ComfyUI_YuE>
|
| 263 |
+
- GGUF Stage-2: <https://huggingface.co/multimodalart/YuE-s2-1B-general-Q8_0-GGUF>
|
| 264 |
+
- HF X-Codec docs: <https://huggingface.co/docs/transformers/main/model_doc/xcodec>
|
| 265 |
+
- ACE-Step paper (successor-style critique): <https://arxiv.org/abs/2506.00045>
|
| 266 |
+
- WhiteFiber technical summary: <https://www.whitefiber.com/blog/yue-ai-music-generator>
|
| 267 |
+
- HF Space demo (fffiloni): <https://huggingface.co/spaces/fffiloni/YuE>
|
| 268 |
+
- HF Space demo (innova-ai): <https://huggingface.co/spaces/innova-ai/YuE-music-generator-demo>
|
|
@@ -0,0 +1,138 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DiffRhythm and DiffRhythm 2 β Deep Technical Review
|
| 2 |
+
|
| 3 |
+
*Compiled 2026-05-18. All claims cited; speculation flagged inline.*
|
| 4 |
+
|
| 5 |
+
## 1. Overview
|
| 6 |
+
|
| 7 |
+
DiffRhythm is the first open-source **latent-diffusion full-song generator** β vocals + accompaniment, end-to-end, from lyrics and a style prompt β built by the **Audio, Speech and Language Processing (ASLP) Lab at Northwestern Polytechnical University (NWPU)** in Xi'an, China, with later contributions from **Xiaomi Research** ([arxiv.org/abs/2503.01183](https://arxiv.org/abs/2503.01183), [github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)). DiffRhythm v1 dropped on **arXiv 3 Mar 2025**; the full 4m45s variant followed on **15 Mar 2025**, and an iterative v1.2 fixed repetition and audio-quality issues mid-2025 ([HF v1.2 commit](https://huggingface.co/spaces/ASLP-lab/DiffRhythm/commit/f5b749d65f62e30bdaad11e6866edc8d3b078b71)). **DiffRhythm 2** appeared on **arXiv 27 Oct 2025** (v3 revised 3 Feb 2026) under [arxiv.org/abs/2510.22950](https://arxiv.org/abs/2510.22950), and was open-sourced at [github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2) (forked from `xiaomi-research/diffrhythm2`) on **30 Oct 2025**, with HuggingFace weights at [huggingface.co/ASLP-lab/DiffRhythm2](https://huggingface.co/ASLP-lab/DiffRhythm2). The series is the leading **diffusion-side** alternative to the LLM-style approach taken by Suno, YuE, and SongBloom.
|
| 8 |
+
|
| 9 |
+
## 2. Architecture
|
| 10 |
+
|
| 11 |
+
DiffRhythm v1 is a **non-autoregressive (NAR) latent diffusion** model with two pieces: a music **VAE** that compresses raw 44.1 kHz stereo audio into a latent grid, and a **DiT** (Diffusion Transformer) that denoises that grid conditioned on lyrics + style ([nzqian.github.io/DiffRhythm](https://nzqian.github.io/DiffRhythm/)). The DiT uses **16 LLaMA-style decoder layers, 2048 hidden dim, 32 heads Γ 64 dim, totaling ~1.1B parameters** ([arxiv.org/html/2503.01183](https://arxiv.org/html/2503.01183v1)). Vocals and accompaniment are produced **jointly in a single latent stream** β not dual-track β which is what makes it "embarrassingly simple" vs. cascaded systems. Lyric conditioning is **sentence-level via LRC (timestamped) phonemes**, with the diffusion model expected to align internally; style is conditioned either via a reference audio embedding or a text prompt. Inference uses a **32-step Euler ODE with CFG scale 4** and 20% dropout on both conditions during training to enable CFG ([diffrhythm.us](https://diffrhythm.us/)).
|
| 12 |
+
|
| 13 |
+
**DiffRhythm 2** replaces the pure-NAR DiT with a **semi-autoregressive block flow-matching** transformer: the latent sequence is sliced into **blocks of 10 frames (2s at 5 Hz)**, and "each block is generated with flow matching, while the dependency across blocks is handled autoregressively" ([alphaxiv.org/overview/2510.22950v3](https://www.alphaxiv.org/overview/2510.22950v3) β quoted via search snippet). This is the key innovation: it preserves NAR-style fast within-block parallelism while letting the model attend to prior blocks for **structural coherence** (verse β chorus β verse) and **lyric alignment without any external aligner**. The audio codec is a new **music VAE at 5 Hz frame rate** (vs. the much higher rates of EnCodec/DAC) with a **170M-param decoder**, enabling 210s of latent context to fit on a single GPU ([arxiv abs](https://arxiv.org/abs/2510.22950)). The full DiT is **~1B parameters**. Two new training objectives appear: **Stochastic Block Representation Alignment (REPA) loss** to align hidden states of clean vs. noisy blocks (improves musicality/structure), and **Cross-Pair Preference Optimization** β an RLHF variant that groups the four preference dimensions (musicality, style similarity, lyric alignment, audio quality) into pairs to dodge the merging-induced regression that plain DPO causes. **Max song length: 210 s** in v2 vs. **4m45s (~285 s)** in v1-full ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)).
|
| 14 |
+
|
| 15 |
+
## 3. Variants and sizes
|
| 16 |
+
|
| 17 |
+
| Checkpoint | Duration | DiT params | Notes | Source |
|
| 18 |
+
|---|---|---|---|---|
|
| 19 |
+
| `DiffRhythm-base` | 1m35s | ~1.1B | Original Mar 2025 | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-base) |
|
| 20 |
+
| `DiffRhythm-full` | 4m45s | ~1.1B | Released 15 Mar 2025 | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-full) |
|
| 21 |
+
| `DiffRhythm-vae` | β | β | Shared audio VAE | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-vae) |
|
| 22 |
+
| `DiffRhythm-1_2-base` | 1m35s | ~1.1B | v1.2 quality fix | [GH README](https://github.com/ASLP-lab/DiffRhythm) |
|
| 23 |
+
| `DiffRhythm-1_2-full` | 4m45s | ~1.1B | v1.2, text-style + instrumental | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-1_2-full) |
|
| 24 |
+
| `DiffRhythm+` (paper) | full | ~1.1B | Adds DPO; not headlined as separate checkpoint | [arxiv 2507.12890](https://arxiv.org/html/2507.12890v2) |
|
| 25 |
+
| `DiffRhythm2` | 210 s | ~1B DiT + 170M VAE-dec | Block flow matching | [HF](https://huggingface.co/ASLP-lab/DiffRhythm2) |
|
| 26 |
+
|
| 27 |
+
(Speculation: I did not find an explicit param count posted for v2's DiT; the **~1B figure comes from a paper-extraction snippet** and aligns with v1's ~1.1B body. Treat as approximate.)
|
| 28 |
+
|
| 29 |
+
## 4. License
|
| 30 |
+
|
| 31 |
+
**Apache 2.0** for both code and DiT weights, declared on the v1 GitHub README and reaffirmed on the v2 README ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm), [github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2)). **Commercial use is permitted** with attribution. The v2 model card adds a **non-binding ethical disclaimer** asking users to verify originality, disclose AI involvement, and respect stylistic copyright β this is a notice, not an enforceable license restriction ([HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)).
|
| 32 |
+
|
| 33 |
+
## 5. Languages supported
|
| 34 |
+
|
| 35 |
+
Training is heavily **bilingual (Mandarin + English)** β v2's dataset is reported as **Chinese : English : Instrumental β 4 : 5 : 1** ([alphaXiv extract](https://www.alphaxiv.org/overview/2510.22950v3)). The v1 README and several mirrors claim **cross-lingual capability** for Japanese, Korean, Spanish ([diffrhythm.us](https://diffrhythm.us/), [diffrhythm.ai](https://diffrhythmai.com/)) β but these are demo-site marketing claims, **not benchmarked in the paper**. Verdict: production-safe for **EN and ZH**; treat JP/KR/ES as best-effort. Phoneme front-end is **espeak-ng**, which itself supports 100+ languages ([HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)).
|
| 36 |
+
|
| 37 |
+
## 6. Quality assessment
|
| 38 |
+
|
| 39 |
+
**Objective (v2 paper, lower=better for PER, higher=better for Mulan-T):**
|
| 40 |
+
|
| 41 |
+
| Metric | DiffRhythm 2 | DiffRhythm+ | ACE-Step | LeVo |
|
| 42 |
+
|---|---|---|---|---|
|
| 43 |
+
| PER (lyric alignment) β | **0.13** | 0.15 | 0.23 | 0.19 |
|
| 44 |
+
| Mulan-T (style match) β | **0.40** | 0.25 | 0.28 | 0.35 |
|
| 45 |
+
| RTF (speed) β | 0.213 | 0.153 | 0.127 | 1.225 |
|
| 46 |
+
|
| 47 |
+
So v2 has **best-in-open-source lyric alignment and style match**, slightly slower than v1+/ACE-Step but ~6Γ faster than LeVo ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).
|
| 48 |
+
|
| 49 |
+
**Subjective:** v2 is the strongest open model by MOS in the paper's own user study, **but the authors explicitly state "in aspects such as musicality, it still shows a clear gap compared to commercial systems like SUNO V4.5"** ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)). The **block flow-matching does close the structural-coherence gap** that the original Hacker News thread criticized v1 for β multiple HN commenters complained "there's no identifiable chorus in any of the demo songs" and rhythm was unstable ([news.ycombinator.com/item?id=43255467](https://news.ycombinator.com/item?id=43255467)). v2 demos show real verse/chorus structure ([aslp-lab.github.io/DiffRhythm2.github.io](https://aslp-lab.github.io/DiffRhythm2.github.io/)). Specific Reddit reception threads in r/LocalLLaMA/r/StableDiffusion were not surfaced by search (low signal).
|
| 50 |
+
|
| 51 |
+
## 7. Inference performance
|
| 52 |
+
|
| 53 |
+
- v1-full: **~10 s for a 4m45s song on a single RTX 4090** (claimed in paper abstract, [arxiv 2503.01183](https://arxiv.org/abs/2503.01183)) β 32 ODE steps. Real-world ComfyUI users report **~62 s for 4 min** on consumer GPUs ([comfyui.org](https://comfyui.org/en/generate-music-with-comfyui-diffrhythm)).
|
| 54 |
+
- **VRAM:** DiffRhythm-base needs β₯ **8 GB** with `--chunked`; full needs **24 GB** for headroom ([chutes.ai docs](https://chutes.ai/docs/examples/music-generation)).
|
| 55 |
+
- v2: **RTF 0.213 on RTX 4090** β ~45 s for a 210 s song ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).
|
| 56 |
+
- **Apple Silicon / MPS:** The v1 README claims Apple Silicon is "supported as of March 2025" but the GitHub issues list does not surface dedicated MPS benchmarks, and the Pinokio launcher ([github.com/pinokiofactory/diffrhythm](https://github.com/pinokiofactory/diffrhythm)) does not advertise macOS in its description. **No published M3/M4/M5 numbers exist.** Speculation: on the user's **M5 Max with 128 GB unified memory**, v1-full should run via `PYTORCH_ENABLE_MPS_FALLBACK=1`, likely 3β5Γ slower than 4090 β needs hands-on validation. v2 is newer and has not been tested on MPS publicly.
|
| 57 |
+
|
| 58 |
+
## 8. DiffRhythm 2 specifics
|
| 59 |
+
|
| 60 |
+
What changed from v1 β v2 ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950), [alphaxiv overview](https://www.alphaxiv.org/overview/2510.22950v3)):
|
| 61 |
+
|
| 62 |
+
1. **Architecture shift:** pure NAR DiT β **semi-AR block flow-matching** (2 s blocks).
|
| 63 |
+
2. **New 5 Hz music VAE** (vs. v1's higher-rate codec) β enables 210 s context within budget.
|
| 64 |
+
3. **Stochastic Block REPA loss:** aligns clean vs. noisy hidden states β better musicality + structure.
|
| 65 |
+
4. **Cross-Pair Preference Optimization:** four-dim RLHF without the model-merging regression that plain DPO causes.
|
| 66 |
+
5. **Dataset scaling:** **~1.4 M songs / ~70,000 hours**, with a **20 k-hour high-quality subset** for SFT and **40 k preference pairs** for DPO β a step-change from v1's undisclosed-but-smaller corpus.
|
| 67 |
+
6. **Lyric alignment without external constraints:** v1 needed LRC timestamps; v2 learns alignment end-to-end via the AR block dependency.
|
| 68 |
+
7. **Quality numbers (paper):** PER **0.15 β 0.13**, Mulan-T **0.25 β 0.40** vs. DiffRhythm+ β i.e. **lyric-error reduced ~13 % and style-match nearly doubled**.
|
| 69 |
+
|
| 70 |
+
## 9. Repo health
|
| 71 |
+
|
| 72 |
+
- **DiffRhythm v1:** ~**2.2β2.3 k stars**, **268 forks**, active through 2025, last major release Mar 2025 ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)).
|
| 73 |
+
- **DiffRhythm 2:** **157 stars / 11 forks / 27 commits** as of late Oct 2025 β young repo, recently pushed ([github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2)).
|
| 74 |
+
- Training/fine-tuning scripts: **"Coming soon"** is the status on v1; community has filed [Issue #46](https://github.com/ASLP-lab/DiffRhythm/issues/46) asking for fine-tuning docs. v2 ships **inference only** in the public repo as of writing.
|
| 75 |
+
|
| 76 |
+
## 10. Real-world adoption
|
| 77 |
+
|
| 78 |
+
- **ComfyUI:** [billwuhao/ComfyUI_DiffRhythm](https://github.com/billwuhao/ComfyUI_DiffRhythm) β 153 stars, supports v1.2 + full, includes bilingual subtitle gen ([runcomfy.com node](https://www.runcomfy.com/comfyui-nodes/ComfyUI_DiffRhythm)).
|
| 79 |
+
- **Pinokio:** [pinokiofactory/diffrhythm](https://github.com/pinokiofactory/diffrhythm) β 19 stars, 69 commits, one-click installer.
|
| 80 |
+
- **Chutes.ai:** Public serverless endpoint for DiffRhythm-full ([chutes.ai/docs/examples/music-generation](https://chutes.ai/docs/examples/music-generation)).
|
| 81 |
+
- **Replicate:** No first-party DiffRhythm 2 model found in search β gap in the ecosystem (speculation).
|
| 82 |
+
- Multiple unofficial web frontends: diffrhythm.com, diffrhythm.us, diffrhythm.ai, diffrhythmai.com β quality and origin unverified, likely wrappers over the HF Space.
|
| 83 |
+
|
| 84 |
+
## 11. Fine-tuning
|
| 85 |
+
|
| 86 |
+
The official answer is **none yet**. The v1 repo's training code is listed as "Coming soon," and v2 only ships inference. There is no LoRA support, no published fine-tuning recipe, and no `transformers`/`diffusers` integration as of May 2026. Community workaround would require reverse-engineering the DiT class β non-trivial for a 1 B-param flow-matching model. **For the user's Suno-clone platform, fine-tuning DiffRhythm today means forking + writing your own training loop.** This is the single biggest practical weakness.
|
| 87 |
+
|
| 88 |
+
## 12. Pros and cons
|
| 89 |
+
|
| 90 |
+
**Pros**
|
| 91 |
+
- Permissive **Apache 2.0** for code + weights β clean commercial path.
|
| 92 |
+
- **Fastest open full-song model** (~10 s for 4 min on a 4090; v2's block-FM is competitive even with AR-like coherence).
|
| 93 |
+
- v2 has **state-of-the-art lyric alignment (PER 0.13)** in open source.
|
| 94 |
+
- Lightweight: 8 GB VRAM possible with chunking β runs on consumer GPUs.
|
| 95 |
+
- Strong ecosystem: ComfyUI nodes, Pinokio installer, Chutes serverless.
|
| 96 |
+
- v2's block flow-matching meaningfully **closes the structural-coherence gap** that doomed v1 demos on HN.
|
| 97 |
+
|
| 98 |
+
**Cons**
|
| 99 |
+
- Still a **clear musicality gap vs. Suno v4.5** (authors admit it; [arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).
|
| 100 |
+
- **No fine-tuning / LoRA path** β training code unreleased.
|
| 101 |
+
- v2's max length is **210 s** (3m30s), *shorter* than v1-full's 4m45s β a regression for radio-length pop.
|
| 102 |
+
- Multilingual claims (JP/KR/ES) are **unbenchmarked**; only EN/ZH have paper-backed quality.
|
| 103 |
+
- **No published MPS benchmarks** for Apple Silicon; v2 untested on Mac.
|
| 104 |
+
- Demo-site proliferation (`diffrhythm.us`, etc.) muddies the brand β confusing for product positioning.
|
| 105 |
+
- License disclaimer adds soft ethical obligations re. copyright that legal review may flag.
|
| 106 |
+
|
| 107 |
+
## 13. Verdict for the user's platform
|
| 108 |
+
|
| 109 |
+
For a Suno-style platform on an **M5 Max (128 GB unified, MPS)**, DiffRhythm 2 is the **best diffusion-side open option in May 2026**, *but* it should be paired with an **AR-style backup** (YuE / SongBloom / LeVo) covering its weak points.
|
| 110 |
+
|
| 111 |
+
**Where DiffRhythm 2 wins:**
|
| 112 |
+
- Fast, cheap inference per song β viable for high-throughput web generation.
|
| 113 |
+
- Best-in-open lyric intelligibility β critical for a karaoke / lyrics-first UX.
|
| 114 |
+
- Stereo 44.1 kHz output out of the box.
|
| 115 |
+
- Apache-2.0 + commercial freedom.
|
| 116 |
+
|
| 117 |
+
**Where it underperforms:**
|
| 118 |
+
- **Pop musicality, hook quality, vocal timbre** are still below Suno v4.5 β premium-tier output is not there.
|
| 119 |
+
- **No fine-tuning** means you cannot specialize on a target sound or your platform's curated catalog without doing R&D.
|
| 120 |
+
- **210 s ceiling on v2** limits "full album track" formats β you'd fall back to v1-full (4m45s) at a quality cost.
|
| 121 |
+
- **MPS path is unproven** β the user should plan a same-week feasibility test on the M5 Max before committing v2 to the inference layer; CUDA cloud (Chutes / a 4090 server) is the safer near-term backend.
|
| 122 |
+
|
| 123 |
+
**Recommended posture:** ship v2 as the default *fast* generator behind a feature flag, keep v1.2-full for >3.5 min songs, evaluate Suno / YuE / SongBloom as quality-tier alternatives, and track the v2 repo for an eventual training-code release that would unlock fine-tuning on your platform's data.
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
### Primary sources
|
| 128 |
+
- [DiffRhythm 2 paper (arxiv 2510.22950)](https://arxiv.org/abs/2510.22950)
|
| 129 |
+
- [DiffRhythm v1 paper (arxiv 2503.01183)](https://arxiv.org/abs/2503.01183)
|
| 130 |
+
- [DiffRhythm v1 GitHub](https://github.com/ASLP-lab/DiffRhythm)
|
| 131 |
+
- [DiffRhythm 2 GitHub](https://github.com/ASLP-lab/DiffRhythm2)
|
| 132 |
+
- [DiffRhythm 2 HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)
|
| 133 |
+
- [alphaXiv overview v3](https://www.alphaxiv.org/overview/2510.22950v3)
|
| 134 |
+
- [HN thread on v1](https://news.ycombinator.com/item?id=43255467)
|
| 135 |
+
- [ComfyUI_DiffRhythm](https://github.com/billwuhao/ComfyUI_DiffRhythm)
|
| 136 |
+
- [Pinokio DiffRhythm](https://github.com/pinokiofactory/diffrhythm)
|
| 137 |
+
- [Chutes serving docs](https://chutes.ai/docs/examples/music-generation)
|
| 138 |
+
- [DiffRhythm+ paper (arxiv 2507.12890)](https://arxiv.org/html/2507.12890v2)
|
|
@@ -0,0 +1,224 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ACE-Step β Deep Technical Report
|
| 2 |
+
|
| 3 |
+
*Researched 2026-05-18 for a Suno-like platform build on M5 Max (128 GB unified) / MPS.*
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## 1. Overview
|
| 8 |
+
|
| 9 |
+
ACE-Step is a foundation model for music generation jointly built by **ACE Studio** (the consumer music-tech outfit behind ACE Studio's vocal synth) and **StepFun** ("Step-AI"), a Beijing-based foundation-model lab. Core authors: Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, Joe Guo ([ace-step.github.io](https://ace-step.github.io/)).
|
| 10 |
+
|
| 11 |
+
Release timeline:
|
| 12 |
+
- **v1 (3.5B)** β open-sourced May 2025; technical report posted on arXiv on 2 Jun 2025 as 2506.00045 ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).
|
| 13 |
+
- **v1.5** β released **28 Jan 2026** as a separate repo, [`ace-step/ACE-Step-1.5`](https://github.com/ace-step/ACE-Step-1.5). Adds a hybrid Language-Model + Diffusion-Transformer planner.
|
| 14 |
+
- **XL series (4B DiT decoder)** β released 2 Apr 2026 as a higher-quality variant inside the v1.5 family.
|
| 15 |
+
- **Latest tag** β v0.1.7 on 24 Apr 2026 ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
|
| 16 |
+
- **v2** β **no public roadmap or announcement** as of 18 May 2026.
|
| 17 |
+
|
| 18 |
+
Current status: actively maintained, 10.4k stars on the v1.5 repo and 4.5k on the original v1 repo, with a thriving ComfyUI ecosystem and third-party UIs ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step), [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 2. Architecture
|
| 23 |
+
|
| 24 |
+
**v1 (3.5B):** a hybrid that fuses three pieces (per the paper, [arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)):
|
| 25 |
+
1. **Sana Deep Compression AutoEncoder (DCAE)** β high-compression audio latent space borrowed from NVIDIA's Sana image work.
|
| 26 |
+
2. **Lightweight linear transformer** β the diffusion backbone, deliberately linear-attention to keep RTF low.
|
| 27 |
+
3. **Diffusion training** with **MERT + m-HuBERT** providing semantic-alignment supervision (REPA-style) during training so latents stay musically coherent.
|
| 28 |
+
|
| 29 |
+
This sits between LLM-token approaches (Suno/YuE, slow but lyric-tight) and pure diffusion (DiffRhythm, fast but structurally weak). The design goal stated in the paper is "a fast, general-purpose, efficient yet flexible architecture" β explicitly a *foundation model*, not just a text-to-song pipeline ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).
|
| 30 |
+
|
| 31 |
+
**v1.5:** a hybrid **LM-as-planner + Diffusion-Transformer (DiT)**. A small Qwen3-based LM (0.6B / 1.7B / 4B) turns the user prompt into a structured "song blueprint" (sections, key, bpm, lyrics, vocal style) which the DiT (2B standard or 4B XL) decodes into audio. This brings chain-of-thought reasoning to music structure, lifting long-range coherence β Suno's main historic advantage ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)).
|
| 32 |
+
|
| 33 |
+
**Parameter counts:**
|
| 34 |
+
| Variant | DiT | LM planner | Total |
|
| 35 |
+
|---|---|---|---|
|
| 36 |
+
| v1-3.5B | 3.5B (DiT only) | β | 3.5B |
|
| 37 |
+
| v1.5 standard | 2B | 0.6B / 1.7B | ~2.6 β 3.7B |
|
| 38 |
+
| v1.5 XL | 4B | up to 4B | up to 8B |
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## 3. Variants and checkpoints
|
| 43 |
+
|
| 44 |
+
All on Hugging Face under the `ACE-Step/` org ([ACE-Step org on HF](https://huggingface.co/ACE-Step)):
|
| 45 |
+
- `ACE-Step-v1-3.5B` β the original generalist model ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)).
|
| 46 |
+
- `ACE-Step-v1-chinese-rap-LoRA` ("RapMachine") β genre-specific LoRA.
|
| 47 |
+
- **LoRA family** shipped by the team: `RapMachine`, `Lyric2Vocal` (vocal-only stem from lyrics), `Text2Samples` (instrumental loops/samples) ([ace-step.github.io](https://ace-step.github.io/)).
|
| 48 |
+
- **v1.5 DiT checkpoints:** 2B standard and 4B XL.
|
| 49 |
+
- **v1.5 LM planners:** 0.6B, 1.7B, 4B.
|
| 50 |
+
- A public **Space demo** at [huggingface.co/spaces/ACE-Step/ACE-Step](https://huggingface.co/spaces/ACE-Step/ACE-Step).
|
| 51 |
+
|
| 52 |
+
No v2 checkpoint exists yet.
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## 4. License
|
| 57 |
+
|
| 58 |
+
**Apache 2.0** for v1 ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)) and **MIT** for v1.5 ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). Both are unambiguously **commercial-use-permitted, royalty-free**. This is the single biggest licensing advantage over Suno/Udio and even over YuE (which carries non-commercial clauses in parts of its weights chain).
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
## 5. Vocal support β CRITICAL VERIFICATION
|
| 63 |
+
|
| 64 |
+
**Verdict: YES β ACE-Step generates vocals natively. The "instrumental-only" claim circulating in some reviews is wrong (likely conflating it with `Text2Samples` LoRA or with DiffRhythm).**
|
| 65 |
+
|
| 66 |
+
Evidence:
|
| 67 |
+
- The **v1 HF model card** describes the model as full-song (vocals + instruments) with the explicit caveat: *"Coarse vocal synthesis lacking nuance"* and *"Rare instruments may not render perfectly"* ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)).
|
| 68 |
+
- The paper claims **lyric alignment across melody/harmony/rhythm metrics** β only meaningful for sung vocals ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).
|
| 69 |
+
- The ComfyUI native node `TextEncodeAceStepAudio` accepts lyrics with `[verse] [chorus] [bridge]` structural tags ([comfyui-wiki guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1)).
|
| 70 |
+
- `Lyric2Vocal` LoRA exists *because* the base model already does vocals β the LoRA isolates the vocal stem ([ace-step.github.io](https://ace-step.github.io/)).
|
| 71 |
+
- Blind-listening review of 50 participants scored ACE-Step v1.5 **4.4/5 on SongEval Vocal vs Suno v4 at 4.1/5** ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)).
|
| 72 |
+
|
| 73 |
+
**Quality reality check:** v1 vocals are admitted to be "coarse"; v1.5 markedly improves vocal clarity and now beats Suno v4 in blind tests on naturalness for folk/classical/jazz, while Suno still wins on "radio-ready polish" for pop/EDM ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)).
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## 6. Languages supported
|
| 78 |
+
|
| 79 |
+
- **v1:** 19 languages, with the top 10 (English, Mandarin Chinese, Russian, Spanish, Japanese, German, French, Portuguese, Italian, Korean) performing best ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). Less-represented languages underperform due to training-data imbalance.
|
| 80 |
+
- **v1.5:** Expanded to **50+ languages** with lyric control, alongside the planner LM ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
|
| 81 |
+
|
| 82 |
+
Known weakness from the team itself: Chinese rap was historically weak, motivating the `chinese-rap-LoRA` ([ace-step.github.io](https://ace-step.github.io/)).
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
## 7. Speed claims β verified
|
| 87 |
+
|
| 88 |
+
The famous claim: *"synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPU β 15Γ faster than LLM-based baselines"* ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045), [ace-step.github.io](https://ace-step.github.io/)). Hardware: **NVIDIA A100 80GB**.
|
| 89 |
+
|
| 90 |
+
Published RTF table from the v1 HF card ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)):
|
| 91 |
+
|
| 92 |
+
| Device | 27 steps RTF | 60 steps RTF |
|
| 93 |
+
|---|---|---|
|
| 94 |
+
| RTX 4090 | 34.48Γ | 15.63Γ |
|
| 95 |
+
| A100 | 27.27Γ | 12.27Γ |
|
| 96 |
+
| RTX 3090 | 12.76Γ | 6.48Γ |
|
| 97 |
+
| **M2 Max** | **2.27Γ** | **1.03Γ** |
|
| 98 |
+
|
| 99 |
+
v1.5 is faster still: *"under 2 seconds per full song on A100 and under 10 seconds on an RTX 3090"* ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
|
| 100 |
+
|
| 101 |
+
**Apple-Silicon equivalents** (from the dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) port):
|
| 102 |
+
|
| 103 |
+
| Task | M1 Pro 16 GB | M3 Pro 36 GB | A100 |
|
| 104 |
+
|---|---|---|---|
|
| 105 |
+
| 30 s turbo | ~45 s | ~25 s | ~2 s |
|
| 106 |
+
| 30 s SFT (full) | ~3 min | ~1.5 min | ~8 s |
|
| 107 |
+
|
| 108 |
+
**M5 Max projection:** The M5 Max's GPU TFLOPS lineage (MPS SGEMM scaled M1βM4: 1.36 β 2.24 β 2.47 β 2.9 TFLOPS, per [arxiv 2502.05317](https://arxiv.org/html/2502.05317v1)) plus the M5 generation's ~30 % uplift suggests roughly **3.5β4Γ the throughput of M2 Max**, i.e. an **estimated 8β10Γ RTF at 27 steps** for v1, and full-song generation in **~30β50 s for a 4-minute song**. No M5-specific public benchmark exists yet.
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## 8. Quality assessment
|
| 113 |
+
|
| 114 |
+
From the cross-model evaluation summarised in research-aggregator coverage ([researchgate paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model), [fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)):
|
| 115 |
+
|
| 116 |
+
| Dimension | Leader | Where ACE-Step sits |
|
| 117 |
+
|---|---|---|
|
| 118 |
+
| Aesthetic quality | Hailuo > DiffRhythm | mid-upper |
|
| 119 |
+
| Musicality (coherence) | Suno v3 | competitive, strong on memorability/clarity |
|
| 120 |
+
| Style alignment | Udio v1 > Hailuo | 3rd |
|
| 121 |
+
| Lyric alignment | Hailuo | strong, beats Suno v3, Udio, YuE |
|
| 122 |
+
| **Vocal naturalness (v1.5)** | **ACE-Step 4.4/5** | beats Suno v4 (4.1/5) |
|
| 123 |
+
| Speed (RTF) | **ACE-Step 15.63Γ** | best in class; DiffRhythm 10.03Γ, YuE 0.083Γ |
|
| 124 |
+
|
| 125 |
+
User-facing reception is positive on customisability and speed; the most-cited weakness is "gacha"-style seed sensitivity β re-rolls produce noticeably different outputs ([ace-step.github.io](https://ace-step.github.io/)).
|
| 126 |
+
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## 9. Inference performance & Apple Silicon
|
| 130 |
+
|
| 131 |
+
- **VRAM (v1):** minimum **8 GB with CPU offload**; comfortable on 12 GB+ ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)).
|
| 132 |
+
- **VRAM (v1.5):** **<4 GB** for 2B-turbo with offload; **β₯12 GB** for XL with offload; **β₯20 GB** without offload; **β₯24 GB optimal** ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)).
|
| 133 |
+
- **MPS support:** **first-class.** Use `--bf16 false` on M-series to avoid kernel issues ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). The dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) fork adds: bfloat16 throughout, MPS-safe pipeline with `torch.mps.empty_cache()` synchronisation, **MLX backend (567 LoC)** that auto-converts the Qwen3 planner LM to MLX with quantisation, and **LoRA training on MPS**.
|
| 134 |
+
- **ComfyUI:** **native nodes** ship in upstream ComfyUI (`TextEncodeAceStepAudio` etc.) plus the official [`ace-step/ACE-Step-ComfyUI`](https://github.com/ace-step/ACE-Step-ComfyUI). v1.5 has dedicated workflows (split-LLM and AIO checkpoint variants) on comfy.org ([Purz blog post](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui)).
|
| 135 |
+
- **128 GB unified on M5 Max** comfortably fits the full XL stack plus the 4B planner LM with no offload needed; user's hardware is essentially overkill for ACE-Step.
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
## 10. Repo health
|
| 140 |
+
|
| 141 |
+
| Repo | Stars | Forks | Last release |
|
| 142 |
+
|---|---|---|---|
|
| 143 |
+
| `ace-step/ACE-Step` (v1) | 4.5k | 568 | quiet since v1.5 fork |
|
| 144 |
+
| `ace-step/ACE-Step-1.5` | **10.4k** | 1.3k | v0.1.7 on 24 Apr 2026 |
|
| 145 |
+
| `fspecii/ace-step-ui` (popular community UI) | 3.8k | 561 | active |
|
| 146 |
+
| `clockworksquirrel/ace-step-apple-silicon` | β (smaller) | β | active |
|
| 147 |
+
|
| 148 |
+
The team also curates [`ace-step/awesome-ace-step`](https://github.com/ace-step/awesome-ace-step). Issue activity, ComfyUI integration cadence, and the LM-planner architectural jump in v1.5 all indicate a project that is healthier and growing faster than YuE or DiffRhythm.
|
| 149 |
+
|
| 150 |
+
---
|
| 151 |
+
|
| 152 |
+
## 11. Real-world adoption
|
| 153 |
+
|
| 154 |
+
- **AMD vendor-backed deployment:** AMD published a blog *"Commercial-grade AI music generation on AMD Ryzen AI processors and Radeon graphics with ACE Step 1.5"* in 2026, explicitly endorsing it for Ryzen AI / Radeon production stacks ([AMD blog](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html)).
|
| 155 |
+
- **Third-party SaaS:** `acestep.io` and `ace-step.app` run hosted song-generation services on the open weights ([acestep.io](https://acestep.io/), [ace-step.app](https://ace-step.app/)).
|
| 156 |
+
- **Production-grade UI:** `fspecii/ace-step-ui` brands itself as *"the Ultimate Open Source Suno Alternative"* with stem extraction (Demucs), batch generation, library/playlist management, LAN access ([fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui)).
|
| 157 |
+
- Heart-MuLa and similar music platforms cite ACE-Step 1.5 in their stack comparisons ([heart-mula.com/ace-step](https://heart-mula.com/ace-step)).
|
| 158 |
+
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
+
## 12. Fine-tuning + LoRA
|
| 162 |
+
|
| 163 |
+
- **Training code released**; documented in [`TRAIN_INSTRUCTION.md`](https://github.com/ace-step/ACE-Step) and `ZH_RAP_LORA.md` ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)).
|
| 164 |
+
- **Genre / task LoRAs from the team:** `RapMachine` (general rap), `Chinese-Rap-LoRA`, `Lyric2Vocal`, `Text2Samples` ([HF org](https://huggingface.co/ACE-Step), [ace-step.github.io](https://ace-step.github.io/)).
|
| 165 |
+
- v1.5 quotes **"8 songs trainable in ~1 hour on a single RTX 3090"** for LoRA personalisation ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
|
| 166 |
+
- LoRA training is verified working on **MPS** via the Apple-Silicon fork ([clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)).
|
| 167 |
+
|
| 168 |
+
---
|
| 169 |
+
|
| 170 |
+
## 13. Pros and cons
|
| 171 |
+
|
| 172 |
+
**Pros**
|
| 173 |
+
- Apache-2.0 / MIT β **fully commercial-friendly**, unique in this tier.
|
| 174 |
+
- **Fastest open music model**: 15.63Γ RTF on a 4090; sub-2 s/song on A100 (v1.5).
|
| 175 |
+
- Vocals **and** instruments natively; v1.5 vocal quality now beats Suno v4 in blind tests.
|
| 176 |
+
- 50+ languages with lyric structural tags.
|
| 177 |
+
- First-class **MPS + MLX** support and a dedicated Apple-Silicon fork.
|
| 178 |
+
- ComfyUI native + thriving UI ecosystem (`ace-step-ui`).
|
| 179 |
+
- LoRA training is cheap (~1 hour for 8 songs on 3090), well-documented.
|
| 180 |
+
- Hybrid LM-planner (v1.5) closes the long-range structure gap with Suno.
|
| 181 |
+
|
| 182 |
+
**Cons**
|
| 183 |
+
- v1 vocals are admitted "coarse"; even v1.5 trails Suno on pop/EDM polish.
|
| 184 |
+
- High **seed sensitivity** β "gacha" outputs; multiple re-rolls needed in production.
|
| 185 |
+
- Less-represented languages underperform.
|
| 186 |
+
- Memory for XL series can exceed 24 GB without offload.
|
| 187 |
+
- No official **v2** announced; the rapid v1 β v1.5 β XL fork hints at API/checkpoint churn.
|
| 188 |
+
- Smaller benchmark literature than Suno/YuE; some metrics still self-reported.
|
| 189 |
+
|
| 190 |
+
---
|
| 191 |
+
|
| 192 |
+
## 14. Verdict for the user's platform
|
| 193 |
+
|
| 194 |
+
For a **Suno-like platform on M5 Max with 128 GB unified memory**, ACE-Step is currently the **single strongest open-source choice** and should be the **default base model**:
|
| 195 |
+
|
| 196 |
+
- **Best for:** full-song generation with vocals in 50+ languages, fast iteration (sub-minute per song expected on M5 Max), genre-specific LoRA fine-tuning, and any deployment where commercial rights matter (Apache/MIT vs Suno's locked-down terms).
|
| 197 |
+
- **Recommended stack:** ACE-Step **v1.5 XL (4B DiT) + 1.7B Qwen3 planner**, run via the `clockworksquirrel/ace-step-apple-silicon` MPS/MLX fork, served behind the `fspecii/ace-step-ui` frontend, with ComfyUI workflows for power-user editing.
|
| 198 |
+
- **Weaknesses to mitigate:** budget for **n-of-k re-roll selection** in the product UX (the gacha problem); pair with a **Demucs stem-extraction post-process** (already in `ace-step-ui`) so users can mix-down; do not pitch the platform on pop/EDM polish alone β lean into folk/classical/jazz and rap, where ACE-Step now leads.
|
| 199 |
+
- **Where you may still need Suno-style commercial APIs:** clients demanding broadcast-radio pop polish; otherwise, ACE-Step is sufficient.
|
| 200 |
+
|
| 201 |
+
---
|
| 202 |
+
|
| 203 |
+
### Sources
|
| 204 |
+
|
| 205 |
+
- [ACE-Step paper, arXiv 2506.00045](https://arxiv.org/abs/2506.00045)
|
| 206 |
+
- [ace-step.github.io](https://ace-step.github.io/)
|
| 207 |
+
- [ace-step/ACE-Step (v1 repo)](https://github.com/ace-step/ACE-Step)
|
| 208 |
+
- [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)
|
| 209 |
+
- [ACE-Step v1-3.5B model card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)
|
| 210 |
+
- [ACE-Step org on Hugging Face](https://huggingface.co/ACE-Step)
|
| 211 |
+
- [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)
|
| 212 |
+
- [fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui)
|
| 213 |
+
- [ace-step/ACE-Step-ComfyUI](https://github.com/ace-step/ACE-Step-ComfyUI)
|
| 214 |
+
- [ace-step/awesome-ace-step](https://github.com/ace-step/awesome-ace-step)
|
| 215 |
+
- [ComfyUI native ACE-Step tutorial](https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1)
|
| 216 |
+
- [ComfyUI Wiki ACE-Step guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1)
|
| 217 |
+
- [Purz blog β ACE-Step 1.5 in ComfyUI](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui)
|
| 218 |
+
- [AMD blog β ACE-Step 1.5 on Ryzen AI / Radeon](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html)
|
| 219 |
+
- [FM9 β ACE-Step vs Suno blind test](https://fm9.ai/ace-step/vs-suno)
|
| 220 |
+
- [HeartMuLa β ACE-Step 1.5 review](https://heart-mula.com/ace-step)
|
| 221 |
+
- [ResearchGate β ACE-Step paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model)
|
| 222 |
+
- [Apple Silicon HPC benchmark, arXiv 2502.05317](https://arxiv.org/html/2502.05317v1)
|
| 223 |
+
- [acestep.io β hosted service](https://acestep.io/)
|
| 224 |
+
- [ace-step.app β hosted service](https://ace-step.app/)
|
|
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 2026 Open-Source Music Generation Models β Newcomers and Survey
|
| 2 |
+
|
| 3 |
+
*Date: 2026-05-18. Target hardware: M5 Max, 128 GB unified memory, MPS backend.*
|
| 4 |
+
|
| 5 |
+
This report investigates the freshest 2026 open-source song-with-vocals generators relevant to building a Suno-like platform locally. Primary focus: **SongGeneration 2 / LeVo 2** (Tencent, March 2026) and **HeartMuLa** (Jan 2026). Also covered: DiffRhythm 2, ACE-Step 1.5 XL, SongBloom, YuE, FunMusic/InspireMusic, NotaGen. Independent benchmark sources are sparse for releases this fresh; vendor claims are flagged.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. SongGeneration 2 / LeVo 2 (Tencent AI Lab)
|
| 10 |
+
|
| 11 |
+
**Overview.** Builder: Tencent AI Lab. Release: 2026-03-01 (v2-large weights), arXiv paper "LeVo" appeared 2025-06-09 (2506.07520). Status: actively updated, v2 is the headline model on the repo ([GitHub](https://github.com/tencent-ailab/SongGeneration), [HF](https://huggingface.co/tencent/SongGeneration)).
|
| 12 |
+
|
| 13 |
+
**Architecture.** Hybrid LLM + Diffusion. The **LeLM** language model handles global structure and performance details with a hierarchical scheme that parallel-models *Mixed Tokens* (melody/structure) and *Dual-Track Tokens* (separate vocal vs. accompaniment streams). A downstream diffusion module synthesises the high-fidelity acoustic waveform from those tokens. Multi-preference DPO alignment (~200k positive/negative pairs) is applied offline ([repo README](https://github.com/tencent-ailab/SongGeneration/blob/main/README.md)).
|
| 14 |
+
|
| 15 |
+
**Variants and sizes.** Five tiers ([HF model card](https://huggingface.co/tencent/SongGeneration/blob/main/README.md)):
|
| 16 |
+
- `base` (2:30 max, zh) β 10/16 GB VRAM, RTF 0.67
|
| 17 |
+
- `base-new` (zh + en) β same VRAM
|
| 18 |
+
- `base-full` (4:30, zh + en) β 12/18 GB VRAM, RTF 0.69
|
| 19 |
+
- `large` (zh + en) β 22/28 GB VRAM, RTF 0.82
|
| 20 |
+
- **`v2-large` β 4 B params, multilingual (zh/en/es/ja/β¦), 22/28 GB VRAM, RTF 0.82, 4:30 max length**
|
| 21 |
+
|
| 22 |
+
**License.** Custom Tencent "academic, research and education purposes" license, **commercial use explicitly prohibited** ([LICENSE](https://github.com/tencent-ailab/SongGeneration/blob/main/LICENSE)). This is the headline blocker for a Suno-like SaaS product.
|
| 23 |
+
|
| 24 |
+
**Languages.** v2-large: Chinese, English, Spanish, Japanese plus others (multilingual lyrics input).
|
| 25 |
+
|
| 26 |
+
**Vocals.** Yes. Separable dual-track output (vocals + accompaniment, instrumental-only, or a cappella).
|
| 27 |
+
|
| 28 |
+
**Speed and hardware.** Reference numbers measured on Tencent's H20 (96 GB) GPU: RTF 0.82 for v2-large. No first-party MPS code path, but a community fork **[SongGen-Mac](https://github.com/Rdx-ai-art/SongGen-Mac)** runs the older base/large models via PyTorch MPS on M-series Macs β author reports **~6 min wall-clock per ~2 min song on M1 Max 64 GB (base), ~12 min for large**, and notes RAM+swap usage hits ~70 GB during inference. The fork is tiny (9 GitHub stars) and does **not** yet wrap v2-large β porting that to MPS on the M5 Max 128 GB is a real engineering task and will likely need careful attention bf16 casts (LeLM) + diffusion sampler patches.
|
| 29 |
+
|
| 30 |
+
**Benchmarks.** Vendor claims ([repo README](https://github.com/tencent-ailab/SongGeneration)): Phoneme Error Rate **8.55 %** vs. Suno v5 12.4 % and Mureka v8 9.96 %. Subjective panel: 20 industry professionals scored across Overall Quality, Melody, Arrangement, Sound Quality (instrument and vocal), Structure on 100 songs/model β Tencent reports v2-large above all open-source baselines and parity with top commercial. **All numbers vendor-reported; no independent re-run located.** The arXiv "Benchmarking Music Generation Models via Human Preference Studies" paper (2506.19085) precedes v2 and tops out at Suno v3.5 / Udio β does not cover LeVo ([arXiv](https://arxiv.org/html/2506.19085v1)).
|
| 31 |
+
|
| 32 |
+
**Repo health.** 1.6 k stars / 191 forks, last meaningful update 2026-03-01. 12 active discussion threads ([repo](https://github.com/tencent-ailab/SongGeneration)).
|
| 33 |
+
|
| 34 |
+
**Adoption.** Hugging Face Space (free demo), WaveSpeed AI hosted endpoint ([WaveSpeed](https://wavespeed.ai/models/wavespeed-ai/song-generation)), SECourses Patreon GUI wrapper, vllm-omni issue tracking integration ([HF Space](https://huggingface.co/spaces/tencent/SongGeneration)). No production SaaS adoption seen.
|
| 35 |
+
|
| 36 |
+
**Pros.** State-of-art lyric accuracy (vendor); dual-track outputs ready for mixing; multilingual; clear inference budget; 4 B params fits comfortably in 128 GB unified memory in fp16.
|
| 37 |
+
|
| 38 |
+
**Cons.** **License kills commercial use** for a Suno-clone product. No official MPS path. Community Mac fork lags v2. Inference time on Apple Silicon is multi-minute per song. No independent benchmark verification.
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## 2. HeartMuLa (HeartMuLa team / academic group)
|
| 43 |
+
|
| 44 |
+
**Overview.** Builder: HeartMuLa research collective, paper credited to Jordi Pons-affiliated group ([Substack explainer](https://artintech.substack.com/p/heartmula-explained)). First weights: 2026-01-19 (`HeartMuLa-oss-3B`), latest: 2026-02-13 (`HeartMuLa-oss-3B-happy-new-year`). arXiv 2601.10547 ([abs](https://arxiv.org/abs/2601.10547)).
|
| 45 |
+
|
| 46 |
+
**Architecture.** Four-stage family ([landing page](https://heartmula.github.io/)): **HeartCLAP** (audio-text alignment / retrieval), **HeartTranscriptor** (Whisper-style lyric ASR), **HeartCodec** (12.5 Hz neural audio codec, low frame rate but high-fi), **HeartMuLa** (LLM-based song generator conditioned on lyrics, tags, and reference audio). Section-level fine-grained control (intro/verse/chorus) is a stated feature.
|
| 47 |
+
|
| 48 |
+
**Variants and sizes.** Six published weights on [HF](https://huggingface.co/HeartMuLa):
|
| 49 |
+
- `HeartMuLa-oss-3B` β 4 B text-to-audio (1.21 k downloads, 255 likes)
|
| 50 |
+
- `HeartMuLa-RL-oss-3B-20260123` β 4 B RL-tuned variant
|
| 51 |
+
- `HeartMuLa-oss-3B-happy-new-year` β 4 B latest checkpoint
|
| 52 |
+
- `HeartCodec-oss-20260123` β 2 B codec
|
| 53 |
+
- `HeartTranscriptor-oss` β 0.8 B ASR
|
| 54 |
+
- `HeartMuLa-7B` β internal/unreleased
|
| 55 |
+
|
| 56 |
+
(Note the naming oddity: HF model card lists "3B" name but 4 B parameter size; treat as ~4 B.)
|
| 57 |
+
|
| 58 |
+
**License.** **Apache 2.0** β confirmed via [LICENSE](https://github.com/HeartMuLa/heartlib/blob/main/LICENSE). Commercial use permitted. This is the strongest licensing position of any model in this report.
|
| 59 |
+
|
| 60 |
+
**Languages.** Multilingual; demo page covers en, zh, ja, ko, es. Paper claims "almost all languages."
|
| 61 |
+
|
| 62 |
+
**Vocals.** Yes β lyric-conditioned vocal synthesis is the core capability. The paper claims best-in-class lyric intelligibility.
|
| 63 |
+
|
| 64 |
+
**Speed and hardware.** RTF β 1.0 (paper). VRAM via the ComfyUI integration ([FL-HeartMuLa](https://github.com/filliptm/ComfyUI_FL-HeartMuLa)): 3 B model needs **12 GB+ VRAM** at full precision, **6 GB with 4-bit bnb quantisation** (CUDA-only). 7 B will need 24 GB / 12 GB quantised. **MPS supported** on M1/M2/M3/M4 (M5 implied), but 4-bit quantisation does not work on MPS, so the M5 Max will run native bf16. 128 GB unified memory is plenty headroom for the 4 B model and an eventual 7 B release.
|
| 65 |
+
|
| 66 |
+
**Benchmarks.** Vendor PER claims: **0.09 (English), 0.12 (Chinese)** β flagged "lowest across every language tested," beating Suno v5 and MiniMax Music 2.0 ([blog](https://huggingface.co/blog/azhan77168/heartmula)). **Note PER unit mismatch with SongGeneration's 8.55 % β these are likely measured on different scales (HeartMuLa percentages may be normalised differently); direct comparison unreliable.** Demo page compares against Suno v4.5, Mureka v7.6, YuE, DiffRhythm 2, ACE-Step ([demos](https://heartmula.github.io/)). The single HN comment ([46691275](https://news.ycombinator.com/item?id=46691275)) said "initial results promising, more so than recent ACE-Step 1.5." Otherwise **no independent A/B tests located**; the HF promo blog is vendor-aligned content.
|
| 67 |
+
|
| 68 |
+
**Repo health.** [github.com/HeartMuLa/heartlib](https://github.com/HeartMuLa/heartlib): 3.6 k stars / 396 forks / 71 open issues. Last release Feb 2026. Larger and more active than SongGeneration's repo.
|
| 69 |
+
|
| 70 |
+
**Adoption.** WaveSpeed AI hosted endpoint ([blog](https://wavespeed.ai/blog/posts/introducing-wavespeed-ai-heartmula-generate-music-on-wavespeedai/)); ComfyUI node `FL-HeartMuLa`; HeartMuse local app integrating Ollama for lyric writing ([HN](https://news.ycombinator.com/item?id=46871828)).
|
| 71 |
+
|
| 72 |
+
**Pros.** Apache 2.0 β usable for a commercial product. Modular architecture (codec + ASR + CLAP + gen) is reusable. Strong lyric intelligibility claim. Active repo. Explicit MPS support documented downstream.
|
| 73 |
+
|
| 74 |
+
**Cons.** Heavy marketing tone in third-party coverage; benchmarks all vendor-published. 7 B not yet released. No standardised MOS or ELO numbers from a neutral evaluator. PER values reported in non-comparable units to peers.
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## 3. DiffRhythm 2 (ASLP-Lab)
|
| 79 |
+
|
| 80 |
+
**Overview.** Successor to DiffRhythm v1.2. arXiv 2510.22950, v3 2026-02-03 ([arXiv](https://arxiv.org/abs/2510.22950)). Original repo: [ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm).
|
| 81 |
+
|
| 82 |
+
**Architecture.** Music VAE at 5 Hz frame rate + Diffusion Transformer with **block flow matching** for lyric-to-vocal alignment. Adds cross-pair preference optimisation (RLHF) and a stochastic block representation alignment loss for musicality. Semi-autoregressive blockwise generation.
|
| 83 |
+
|
| 84 |
+
**License.** Apache 2.0 (inherited from v1, confirmed 2025-03-07).
|
| 85 |
+
|
| 86 |
+
**Languages, vocals, hardware.** Multilingual; full vocals + instrumental; uses 44.1 kHz stereo; up to 4:45 song length. DiffRhythm v1 can generate a full song in ~10 s on a single A100 β v2 should be in the same ballpark. MPS not officially stated but PyTorch DiT models port relatively cleanly. Parameter count not disclosed in v2 abstract.
|
| 87 |
+
|
| 88 |
+
**Benchmarks.** Vendor claims top-of-class fidelity; no independent verification specific to v2.
|
| 89 |
+
|
| 90 |
+
**Pros/cons.** Pros: very fast, permissive license, mature codebase. Cons: no public param count, no first-party MPS path, lyric clarity historically the weak spot vs LeVo/HeartMuLa.
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## 4. ACE-Step 1.5 XL (ACE Studio Γ StepFun)
|
| 95 |
+
|
| 96 |
+
**Overview.** [github.com/ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5). arXiv 2602.00744. Most user-tested local-first option. 10.4 k stars / 1.3 k forks β **biggest community by far**.
|
| 97 |
+
|
| 98 |
+
**Architecture.** LM planner (0.6 B / 1.7 B / 4 B selectable) + DiT decoder (2 B or 4 B XL). XL DiT ~9 GB bf16.
|
| 99 |
+
|
| 100 |
+
**License.** **MIT**. Commercial use allowed.
|
| 101 |
+
|
| 102 |
+
**Languages.** 50+.
|
| 103 |
+
|
| 104 |
+
**Speed and hardware.** Under 2 s/song on A100, under 10 s on RTX 3090, **<4 GB VRAM** for DiT-only minimum. **Explicit Mac MPS support** with `start_gradio_ui_macos.sh`; MLX backend optimisation noted. Easiest M5 Max install of any model in this list.
|
| 105 |
+
|
| 106 |
+
**Benchmarks.** Vendor: SongEval 8.12, AudioBox 7.76, claims to beat Suno v5 and MiniMax 2.5 across 11 dimensions ([project page](https://ace-step.github.io/ace-step-v1.5.github.io/)). DEV Community write-up positions it "between Suno v4.5 and v5" β more honest framing.
|
| 107 |
+
|
| 108 |
+
**Pros.** Best Mac story, MIT licence, LoRA personalisation in days, tiny VRAM. **Cons.** Vocal naturalness still trails Suno v5 in casual user tests.
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## 5. SongBloom (Tencent AI Lab)
|
| 113 |
+
|
| 114 |
+
[github.com/tencent-ailab/SongBloom](https://github.com/tencent-ailab/SongBloom). 778 stars. Interleaved autoregressive sketch + diffusion refinement, 2 B params, MPS supported, lengths up to 240 s in Oct 2025 update. Same Tencent academic-only LICENSE pattern (not Apache). Up to 150 s songs from lyrics + 10 s reference audio. Useful as a research baseline; **same commercial-use prohibition as SongGeneration** likely applies β verify before deploying.
|
| 115 |
+
|
| 116 |
+
---
|
| 117 |
+
|
| 118 |
+
## 6. YuE (M-A-P / HKUST)
|
| 119 |
+
|
| 120 |
+
[github.com/multimodal-art-projection/YuE](https://github.com/multimodal-art-projection/YuE). LLaMA-2 backbone, lyric-to-song, **Apache 2.0** since 2025-01-30, 5 min max length, dual-track ICL mode, no v2 announced. Strong vocal emotion for ballads/R&B. Llama.cpp issue 11467 still tracks GGUF support. Solid permissive fallback if HeartMuLa underperforms.
|
| 121 |
+
|
| 122 |
+
---
|
| 123 |
+
|
| 124 |
+
## 7. FunMusic / InspireMusic (Alibaba FunAudioLLM)
|
| 125 |
+
|
| 126 |
+
[github.com/FunAudioLLM/FunMusic](https://github.com/FunAudioLLM/FunMusic). Qwen2.5 backbone + flow-matching super-res. 1.3 k stars. Apache 2.0. **No MPS support, requires Flash Attention 2.6 + CUDA 11.8+** β effectively NVIDIA-only. Song-with-vocals models announced but not yet released; current ships are music-only/audio.
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## Survey table β 2026 open-source song generators
|
| 131 |
+
|
| 132 |
+
| Model | Builder | Release | Params | License | Vocals | Repo |
|
| 133 |
+
|---|---|---|---|---|---|---|
|
| 134 |
+
| SongGeneration 2 / LeVo 2 | Tencent AI Lab | 2026-03 | 4 B | Custom non-commercial | Yes, dual-track | [link](https://github.com/tencent-ailab/SongGeneration) |
|
| 135 |
+
| HeartMuLa-oss-3B | HeartMuLa | 2026-01 | ~4 B + 2 B codec + 0.8 B ASR | Apache 2.0 | Yes, multilingual | [link](https://github.com/HeartMuLa/heartlib) |
|
| 136 |
+
| DiffRhythm 2 | ASLP-Lab | 2025-10 β 2026-02 (v3) | undisclosed | Apache 2.0 | Yes | [link](https://github.com/ASLP-lab/DiffRhythm) |
|
| 137 |
+
| ACE-Step 1.5 XL | ACE Studio Γ StepFun | 2026-01 | LM 0.6β4 B + DiT 2β4 B | MIT | Yes | [link](https://github.com/ace-step/ACE-Step-1.5) |
|
| 138 |
+
| SongBloom | Tencent AI Lab | 2025-06 β 2025-10 | 2 B | Custom (likely non-commercial) | Yes | [link](https://github.com/tencent-ailab/SongBloom) |
|
| 139 |
+
| YuE | M-A-P / HKUST | 2025-01 | up to 7 B | Apache 2.0 | Yes | [link](https://github.com/multimodal-art-projection/YuE) |
|
| 140 |
+
| InspireMusic (FunMusic) | Alibaba FunAudioLLM | 2025-01 | 1.5 B | Apache 2.0 | Coming (music only today) | [link](https://github.com/FunAudioLLM/FunMusic) |
|
| 141 |
+
| NotaGen / NotaGen-X | Central Conservatory + ElectricAlexis | 2025 | symbolic-only | MIT | n/a (ABC/XML) | [link](https://github.com/ElectricAlexis/NotaGen) |
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## Dark horses / experimental
|
| 146 |
+
|
| 147 |
+
- **NotaGen-X** β DeepSeek-R1-style RL on symbolic music. Outputs ABC/MusicXML (not audio). Could feed a TTS-vocal model for a hybrid composer β singer pipeline ([repo](https://github.com/ElectricAlexis/NotaGen), [arXiv](https://arxiv.org/abs/2502.18008)).
|
| 148 |
+
- **LLaSA / LLaSA+** β Llama-3B-backbone TTS pipeline ([arXiv](https://arxiv.org/html/2508.06262v1)); not music, but emergent prosody good enough to consider as the vocal layer behind a NotaGen score.
|
| 149 |
+
- **DiffRhythm+** β preference-optimised DiffRhythm variant, arXiv 2507.12890; mid-stage between v1 and v2.
|
| 150 |
+
- **AudioX** β anything-to-audio DiT, 2503.10522; useful for sound design and SFX layering, not full-song.
|
| 151 |
+
- **MelodyFlow** β text-controllable DiT with flow-matching for music editing.
|
| 152 |
+
- **HeartMuse** β local Ollama-orchestrated lyric β HeartMuLa song app ([HN](https://news.ycombinator.com/item?id=46871828)); reference for building a thin product wrapper.
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## Skeptic's bottom line for the M5 Max 128 GB build
|
| 157 |
+
|
| 158 |
+
1. **For a commercial Suno-clone**: **HeartMuLa** (Apache 2.0, native MPS, 4 B fits easily, Feb-2026 checkpoint, modular components reusable) is the strongest pick. Verify their PER claims yourself before fundraising-style messaging.
|
| 159 |
+
2. **For best raw quality, research only**: **SongGeneration 2 v2-large** β but the Tencent licence forbids commercial deployment and the v2 weights don't yet have a maintained MPS port. The community SongGen-Mac fork targets the older base/large.
|
| 160 |
+
3. **For fastest iteration / smallest VRAM**: **ACE-Step 1.5 XL** (MIT, native Mac script, <4 GB VRAM) β under-promises vocal naturalness vs HeartMuLa but ships today on Apple Silicon with the cleanest licence story.
|
| 161 |
+
4. Reliable independent benchmark for these specific 2026 releases does not yet exist; the only neutral preference study found ([arXiv 2506.19085](https://arxiv.org/html/2506.19085v1)) stops at Suno v3.5 and does not cover LeVo, HeartMuLa, or ACE-Step. **Run your own blind A/B before betting a product on any vendor PER number.**
|
|
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Apple Silicon / MPS Compatibility Audit β Music Generation Models
|
| 2 |
+
|
| 3 |
+
Hardware target: **M5 Max, 128 GB unified memory**. Date: 2026-05-18.
|
| 4 |
+
|
| 5 |
+
Honest read: MPS is a second-class citizen for almost every music-gen repo. CUDA is the assumed default; Mac support, when it exists, is community-driven. Below is the per-model evidence with verdicts.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. YuE (multimodal-art-projection/YuE)
|
| 10 |
+
|
| 11 |
+
- **Official MPS support:** None. The README requires `cuda >= 11.8`, conda-installed `cudatoolkit=11.8`, and **flash-attn 2 is mandatory** to avoid OOM on long sequences ([YuE README](https://github.com/multimodal-art-projection/YuE/blob/main/README.md)).
|
| 12 |
+
- **Community reports:** Issue #51 ("Instructions to run on Mac") is open and **unanswered** ([#51](https://github.com/multimodal-art-projection/YuE/issues/51)). No working Mac fork.
|
| 13 |
+
- **Backend compatibility:** Hard CUDA dependency through flash-attn; xformers/triton flash paths are CUDA-only ([HF forum thread](https://discuss.huggingface.co/t/best-practices-to-use-models-requiring-flash-attn-on-apple-silicon-macs-or-non-cuda/97562)). Stage 1 (7B LLaMA-2-style) and Stage 2 (1B) both transformer-based; in principle portable, but no one has shipped it.
|
| 14 |
+
- **Memory:** 7B + 1B + upsampler. Author recommends **β₯80 GB VRAM** for full song; 24 GB OK for short clips. On 128 GB unified memory this fits, *if* you can swap flash-attn for SDPA.
|
| 15 |
+
- **Apple-Silicon timing:** None reported.
|
| 16 |
+
- **Verdict:** **Doesn't work out of the box. Likely broken on MPS.** Would need a non-trivial fork: strip flash-attn, replace with `torch.nn.functional.scaled_dot_product_attention`, and audit RoPE/KV-cache for MPS dtype quirks. There is also a "GPU Poor" fork ([deepbeepmeep/YuEGP](https://github.com/deepbeepmeep/YuEGP)) but it targets CUDA/ROCm with 8-bit quant β **no Mac path**.
|
| 17 |
+
|
| 18 |
+
## 2. DiffRhythm v1 and v2 (ASLP-lab)
|
| 19 |
+
|
| 20 |
+
- **Official MPS support:** DiffRhythm v1 explicitly states *"DiffRhythm can now run on MacOS!"* with `brew install espeak-ng` ([Readme](https://github.com/ASLP-lab/DiffRhythm/blob/main/Readme.md)). No specific MPS notes, but it works.
|
| 21 |
+
- **DiffRhythm 2:** `requirements.txt` is **clean of CUDA-only packages** β no flash-attn, xformers, triton, mamba_ssm, deepspeed, bitsandbytes ([requirements.txt](https://github.com/ASLP-lab/DiffRhythm2/blob/main/requirements.txt)). Just `torch==2.7`, `torchaudio==2.7`, `transformers`, `safetensors`, `muq`, `librosa`. The 3.9 % "CUDA" language stat in the repo is benign β auto-detected from a small kernel file, but no compiled extensions in the pip install path.
|
| 22 |
+
- **Community reports:** No GitHub issues or Reddit threads surface specific MPS bugs for DiffRhythm β implying it either works quietly or no one has tried at scale. The architecture (latent diffusion + DiT with flow matching, very similar to Stable Audio Open / SD3) is the same class that *does* work on MPS via diffusers.
|
| 23 |
+
- **Memory:** DiffRhythm-base needs **β₯8 GB VRAM**; `--chunked` decoding reduces it further. Trivial on 128 GB.
|
| 24 |
+
- **Apple-Silicon timing:** Not benchmarked publicly, but extrapolating from Stable Audio Open MPS (β3Γ CPU speedup) the 285-second full-song run should land in the low minutes on M5 Max.
|
| 25 |
+
- **Verdict:** **Just works on MPS (likely) / Works with workarounds.** Highest confidence pick.
|
| 26 |
+
|
| 27 |
+
## 3. ACE-Step 1.5 (ace-step/ACE-Step)
|
| 28 |
+
|
| 29 |
+
- **Official MPS support:** **First-class.** README explicitly advertises Mac + AMD + Intel + CUDA. macOS scripts auto-set `ACESTEP_LM_BACKEND=mlx --backend mlx` β the language-model side runs on Apple's **MLX**, the DiT side on **PyTorch MPS** ([INSTALL.md](https://github.com/ace-step/ACE-Step-1.5/blob/main/docs/en/INSTALL.md)). bfloat16 supported on MPS since PyTorch 2.4.
|
| 30 |
+
- **Community reports:** Real-world M2 Air 16 GB run: 5β10 min per song, hit MPS-OOM, fixed with `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` ([bioerrorlog](https://en.bioerrorlog.work/entry/ace-step-15-local-m2-macbook)). A dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) fork already centralised MPS detection, swapped CUDA cache calls for `torch.mps.empty_cache()` / `torch.mps.synchronize()`, and tuned VAE conv1d tile sizes for Metal limits.
|
| 31 |
+
- **Backend compatibility:** Flash-attn auto-disabled on MPS. `torch.compile` disabled on MPS. nanovllm not on Mac. Otherwise clean.
|
| 32 |
+
- **Memory:** 4 GB DiT-only / 6 GB LLM+DiT minimum; ~10 GB total install.
|
| 33 |
+
- **Apple-Silicon timing (M1 Pro 16 GB vs M3 Pro 36 GB vs A100, from the AS fork's benchmarks):**
|
| 34 |
+
|
| 35 |
+
| Task | M1 Pro | M3 Pro | A100 |
|
| 36 |
+
| --- | --- | --- | --- |
|
| 37 |
+
| 30 s turbo song | ~45 s | ~25 s | ~2 s |
|
| 38 |
+
| 30 s SFT song | ~3 min | ~1.5 min | ~8 s |
|
| 39 |
+
|
| 40 |
+
**Extrapolated M5 Max:** turbo ~10β15 s, SFT ~45β60 s for 30 s output. Best Mac-citizen of the bunch.
|
| 41 |
+
|
| 42 |
+
- **Verdict:** **Just works on MPS.** Already production-grade on M-series.
|
| 43 |
+
|
| 44 |
+
## 4. SongGeneration 2 / LeVo 2 (Tencent)
|
| 45 |
+
|
| 46 |
+
- **Official MPS support:** None. Official repo pins `flash-attn 2.7.4.post1` for CUDA 12 + torch 2.6, though `--not_use_flash_attn` flag exists ([Tencent SongGeneration](https://github.com/tencent-ailab/SongGeneration)).
|
| 47 |
+
- **Community reports:** [Rdx-ai-art/SongGen-Mac](https://github.com/Rdx-ai-art/SongGen-Mac) fork β "Runs completely on your Mac's GPU via MPS on PyTorch." Tested on M1 Max 64 GB / macOS 15.7.2. **Pre-chorus block produces gibberish vocals** β known regression vs CUDA.
|
| 48 |
+
- **Backend compatibility:** Hybrid LLM + diffusion architecture. Once flash-attn is stripped, the LLM side uses SDPA fine on MPS.
|
| 49 |
+
- **Memory (Mac fork):** Base β₯24 GB RAM, ~70 GB total app RAM including swap during inference. Large β₯32 GB, hits ~80 GB. **On 128 GB M5 Max this fits cleanly without swap.**
|
| 50 |
+
- **Apple-Silicon timing (M1 Max 64 GB):** Base ~4β6 min for ~2 min of audio. Large ~10β25 min for ~2:30. M5 Max should be roughly 2β3Γ faster (better mem bandwidth + more GPU cores).
|
| 51 |
+
- **Verdict:** **Works with workarounds (community fork only).** Functional but watch the pre-chorus bug.
|
| 52 |
+
|
| 53 |
+
## 5. HeartMuLa (HeartMuLa/heartlib)
|
| 54 |
+
|
| 55 |
+
- **Official MPS support:** Not in the README. CUDA-first design with `--mula_device` / `--codec_device` flags ([heartlib](https://github.com/HeartMuLa/heartlib)). RTF β 1.0 on CUDA.
|
| 56 |
+
- **Community reports:** **Strong MLX port exists**: [Acelogic/heartlib-mlx](https://github.com/Acelogic/heartlib-mlx). Claims **2.1Γ faster than PyTorch MPS** on M2 Max (13.4 s vs 27.9 s end-to-end), 8.7Γ faster model load, 100 % numerical parity with PyTorch.
|
| 57 |
+
- **Backend compatibility:** No flash-attn / mamba / triton in the official deps β clean transformer + neural codec. MLX port supports bfloat16.
|
| 58 |
+
- **Memory (MLX port):** 3B model ~6 GB, HeartCodec ~2 GB, KV-cache ~1 GB/min of audio. **Full 1-min song β 11 GB.** 32 GB minimum recommended; M5 Max 128 GB blows past this. 7B variant not yet released as of Feb 2026.
|
| 59 |
+
- **Apple-Silicon timing:** M2 Max β 11.6 s to generate 50 frames; M5 Max should comfortably exceed real-time for the 3B model.
|
| 60 |
+
- **Verdict:** **Just works on MPS via MLX port.** Second-best Mac story after ACE-Step. The official PyTorch path is untested but should run on MPS once you bypass any CUDA cache calls.
|
| 61 |
+
|
| 62 |
+
## 6. MusicGen (Meta / audiocraft) β reference
|
| 63 |
+
|
| 64 |
+
- **Official MPS support:** None. AudioCraft officially supports CUDA or CPU only ([audiocraft README](https://github.com/facebookresearch/audiocraft)). Issues [#13](https://github.com/facebookresearch/audiocraft/issues/13) and [#31](https://github.com/facebookresearch/audiocraft/issues/31) are open requests, no merged PR. EnCodec decoder ops misbehave on MPS β common workaround is to **move decoder to CPU** while keeping the LM on MPS.
|
| 65 |
+
- **Community / MLX:** Multiple solid ports β [Andrade Olivier's port](https://medium.com/@andradeolivier/i-ported-musicgen-to-apple-silicon-generate-music-from-text-on-your-macbook-9eaf95992053), [Nat Taylor's MusicGen MLX test](https://nattaylor.com/blog/2024/musicgen-via-mlx/). M4 Max: small model 8 s audio in ~6 s (faster than realtime). M1: ~60 s for 9 s of audio at 500 steps. AudioGen (sibling model) [works on MPS](https://blog.peddals.com/en/apple-mps-to-generate-audio-with-meta-audiogen/) by moving decoder ops to CPU.
|
| 66 |
+
- **Memory:** 300 M small / 1.5 B medium / 3.3 B large. Trivial on 128 GB.
|
| 67 |
+
- **Verdict:** **Partial on raw PyTorch MPS (CPU fallback for decoder); Just works via MLX port.**
|
| 68 |
+
|
| 69 |
+
## 7. Stable Audio Open (Stability AI) β reference
|
| 70 |
+
|
| 71 |
+
- **Official MPS support:** Diffusers supports `device="mps"` for the SAO pipeline ([Stable Audio docs](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_audio)).
|
| 72 |
+
- **Community reports:** [phlo.info](https://phlo.info/posts/using-stable-audio-tools-on-apple-silicon/) reports 51 s CPU β 17 s MPS by swapping `cuda` β `mps` in two files. **fp16 conv1d in the decoder is pathologically slow on MPS** β fix is `model.pretransform.model_half = False; model.to(torch.float32)` ([HF discussion](https://huggingface.co/stabilityai/stable-audio-open-small/discussions/1)).
|
| 73 |
+
- **Memory:** 1.21 B params. Trivial.
|
| 74 |
+
- **Apple-Silicon timing:** ~17 s per 3-s sample on M1-class; M5 Max should be a few seconds.
|
| 75 |
+
- **Verdict:** **Works with workarounds** (force fp32 in decoder).
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## Metal / MLX Apple-Native Equivalents
|
| 80 |
+
|
| 81 |
+
- **ACE-Step**: Native MLX backend in the official repo for the LM side. **Closest thing to a first-party Mac music model.**
|
| 82 |
+
- **HeartMuLa**: [heartlib-mlx](https://github.com/Acelogic/heartlib-mlx) β 2.1Γ speedup over PyTorch MPS, full numerical parity.
|
| 83 |
+
- **MusicGen**: Multiple MLX ports, faster than real-time on M4 Max small model.
|
| 84 |
+
- **Stable Audio Open**: MLX-Audio family ([Blaizzy/mlx-audio](https://github.com/Blaizzy/mlx-audio)) covers TTS/STT; SAO has unofficial MLX ports.
|
| 85 |
+
- **YuE / DiffRhythm / SongGeneration**: **No MLX ports** as of May 2026.
|
| 86 |
+
|
| 87 |
+
There is no umbrella "MLX-music" framework; each project rolls its own port.
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
## Practical Recommendation
|
| 92 |
+
|
| 93 |
+
**Start with ACE-Step 1.5.** It is the only model with first-party Apple Silicon support, hybrid MLX + MPS execution, published M-series benchmarks, and no CUDA-only dependencies. The user's 128 GB unified memory completely eliminates the OOM workaround other Mac users hit on 16β36 GB machines.
|
| 94 |
+
|
| 95 |
+
**Second pick: HeartMuLa via the MLX port** ([heartlib-mlx](https://github.com/Acelogic/heartlib-mlx)). Faster than the PyTorch MPS path, bfloat16, well-benchmarked. 3B only for now; 7B unreleased.
|
| 96 |
+
|
| 97 |
+
**Third pick: DiffRhythm v2** β clean deps, README claims macOS support, similar architecture class to Stable Audio Open which is known to work on MPS with the fp32 decoder workaround.
|
| 98 |
+
|
| 99 |
+
**Avoid on MPS unless you enjoy yak-shaving:**
|
| 100 |
+
- **YuE** β flash-attn-mandatory, no Mac fork, no MLX port.
|
| 101 |
+
- **SongGeneration / LeVo** β only via [SongGen-Mac](https://github.com/Rdx-ai-art/SongGen-Mac) fork, pre-chorus bug, 70+ GB RAM pressure with swap. Workable on 128 GB but not pleasant.
|
| 102 |
+
|
| 103 |
+
**Remote-dev path:** For YuE specifically, **train/develop on a rented H100 or A100** (RunPod, Lambda, Modal, Replicate) and pull weights for inference on M5 Max **only if** you fork it to drop flash-attn. Otherwise treat YuE as a remote-only model. For everything else on this list, M5 Max is sufficient as the primary development machine.
|
| 104 |
+
|
| 105 |
+
**On the user's prior LTX-Video burns:** music models are LM/diffusion stacks without the multi-modal Gemma + complex64 + SDPA-on-meta-tensor traps that bit LTX-2.3. The main MPS gotchas here are mundane: flash-attn substitution, fp16 conv1d in audio decoders, and `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` for high-watermark allocator behaviour.
|
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Open-Source Song Generation Models β Side-by-Side Comparison
|
| 2 |
+
|
| 3 |
+
*Compiled 2026-05-18 for M5 Max / 128 GB unified memory target.*
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Headline matrix
|
| 8 |
+
|
| 9 |
+
| Property | **ACE-Step 1.5 XL** | **HeartMuLa 4B** | **DiffRhythm 2** | **YuE 7B** | SongGeneration 2 |
|
| 10 |
+
|---|---|---|---|---|---|
|
| 11 |
+
| **Builder** | ACE Studio Γ StepFun | HeartMuLa | NWPU ASLP-lab + Xiaomi | M-A-P / HKUST | Tencent AI Lab |
|
| 12 |
+
| **Release** | 2026-01-28 | 2026-01-19 | 2025-10-27 β 2026-02-03 (v3) | 2025-01-26 | 2026-03-01 |
|
| 13 |
+
| **License** | **MIT** | **Apache 2.0** | **Apache 2.0** | **Apache 2.0** | **Custom NON-commercial** |
|
| 14 |
+
| **Repo stars** | 10.4 k | 3.6 k | ~2.3 k (v1) + 0.16 k (v2) | 6.2 k | 1.6 k |
|
| 15 |
+
| **Last major commit** | v0.1.7 (2026-04-24) | 2026-02 | 2026-02 | 2025-06-04 (stale) | 2026-03-01 |
|
| 16 |
+
| **Architecture** | LM-planner (Qwen3 0.6/1.7/4 B) + DiT (2/4 B) | CLAP + ASR + 12.5 Hz Codec + 4 B LLM | 5 Hz Music VAE + DiT w/ block flow matching | LLaMA2 7B AR Stage-1 + 1B Stage-2 + X-Codec | LeLM hybrid + diffusion decoder |
|
| 17 |
+
| **Params (largest)** | up to 8 B (4 B DiT + 4 B LM) | ~4 B + 2 B codec + 0.8 B ASR | ~1 B DiT + 170 M VAE-dec | 7 B + 1 B + upsampler | 4 B (v2-large) |
|
| 18 |
+
| **Audio rate** | 44.1 kHz stereo | 24 kHz neural codec | 44.1 kHz stereo | 16 kHz then upsampled | High-fi via diffusion |
|
| 19 |
+
| **Max length** | 4+ min | β₯1 min, scaling | **210 s (regression from v1)** | 5 min | 4:30 |
|
| 20 |
+
| **Vocals + Instruments** | β
Native | β
Native | β
Native, single stream | β
Native, dual-track AR | β
Dual-track |
|
| 21 |
+
| **Languages** | 50+ | 5+ (en/zh/ja/ko/es benchmarked) | Bilingual EN/ZH + JP/KR/ES marketing-only | EN, Mandarin, Cantonese, JP, KR | zh/en/es/ja + others |
|
| 22 |
+
| **VRAM (minimum)** | **<4 GB** with offload (turbo) | 6 GB 4-bit / 12 GB bf16 | 8 GB v1 with `--chunked` | 24 GB consumer / 80 GB single-pass | 22β28 GB |
|
| 23 |
+
| **VRAM (recommended)** | 12 GB+ offload, 24 GB optimal | 24 GB for 7B (unreleased) | 24 GB | 80 GB H100/H800 | 28 GB |
|
| 24 |
+
| **MPS / Apple Silicon** | **First-class, MLX + MPS, dedicated fork** | **MLX port, 2.1Γ PyTorch MPS** | Likely OK; clean deps; untested | β Mandatory flash-attn | Community fork, pre-chorus bug |
|
| 25 |
+
| **MPS bench M-series (30 s clip)** | M3 Pro 25 s turbo / 1.5 min SFT | M2 Max 11.6 s for 50 frames | not published | not published | M1 Max 4β6 min for 2 min |
|
| 26 |
+
| **MPS bench M5 Max (projected)** | turbo ~10β15 s / SFT ~45β60 s | <real-time | low-minute range | n/a | ~2β3Γ M1 Max |
|
| 27 |
+
| **Speed (RTF on A100 / 4090)** | sub-2 s/song on A100 (v1.5) | RTF β 1.0 | v2 RTF 0.213 (4090) β ~45 s for 210 s | 27 steps RTF 27.27Γ on A100 (v1, ~15 min/song) | RTF 0.82 (H20) |
|
| 28 |
+
| **Vocal naturalness vs Suno v4** | **4.4/5 vs 4.1/5** (blind 50-person test) | Vendor only, unverified | Authors admit clear gap vs v4.5 | Comparable vocal range; weaker mix | Vendor claim parity, unverified |
|
| 29 |
+
| **Lyric alignment (PER)** | Strong (lyric tags) | Vendor: 0.09 EN / 0.12 ZH (unit mismatch) | **0.13 (open-source SOTA)** | Strong from lyric tags | Vendor: 8.55 % |
|
| 30 |
+
| **Fine-tuning support** | β
LoRA, 8 songs/1h on 3090, **MPS-validated** | β public training code | β "Coming soon" since Mar 2025 | β
LoRA (Megatron pipeline, CUDA 12.1+) | β |
|
| 31 |
+
| **ComfyUI integration** | β
Native, official workflows | β
FL-HeartMuLa | β
billwuhao/ComfyUI_DiffRhythm | β
smthemex/ComfyUI_YuE | β
|
|
| 32 |
+
| **Replicate hosted** | β no first-party | β | β | β
fofr/yue | β |
|
| 33 |
+
| **Style/audio reference** | LoRA + lyric tags | Reference audio supported | Reference audio supported | ICL mode (style cloning) | Limited |
|
| 34 |
+
| **Stem separation** | Built into `fspecii/ace-step-ui` via Demucs | Modular Codec is reusable | β single stream | β
AR dual-track is inherently separable | β
Dual-track output |
|
| 35 |
+
| **Continuation / extension** | Supported in workflows | Limited | Supported | β
explicit continuation mode | Supported |
|
| 36 |
+
| **Production deployments** | acestep.io, ace-step.app, fspecii/ace-step-ui, AMD-blessed | WaveSpeed AI, HeartMuse local app | Chutes serverless | Replicate fofr/yue, HF Spaces | WaveSpeed AI, HF Space |
|
| 37 |
+
| **Watermarking / content credentials** | None baked-in | None baked-in | None baked-in | None baked-in | None baked-in |
|
| 38 |
+
| **License gotchas** | None (MIT) | None (Apache 2.0) | Ethical disclaimer (non-binding) | Attribution required ("YuE by HKUST/M-A-P"), label "AI-generated" | **Commercial use prohibited** |
|
| 39 |
+
| **Independent benchmarks** | Yes β 50-person blind test, AMD vendor-validated | None located | Internal MOS only | Paper + community | None β Tencent only |
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## Quality dimensions (qualitative)
|
| 44 |
+
|
| 45 |
+
| Dimension | Best (open source) | Notes |
|
| 46 |
+
|---|---|---|
|
| 47 |
+
| **Pop / EDM polish** | (none β Suno v4/v5 still wins) | All open models lag commercial. |
|
| 48 |
+
| **Folk / classical / jazz vocal naturalness** | **ACE-Step 1.5 XL** | Wins blind test vs Suno v4 in these genres. |
|
| 49 |
+
| **Lyric intelligibility (PER)** | **DiffRhythm 2** (0.13) | HeartMuLa claims lower but unit-incomparable. |
|
| 50 |
+
| **Musical macro-structure (verse/chorus/bridge over 3-5 min)** | **YuE** or **ACE-Step 1.5** (planner) | LM-planner models lead diffusion-only here. |
|
| 51 |
+
| **Stereo image, mix depth** | **DiffRhythm 2** (44.1 kHz stereo native) | YuE is mono-ish; ACE-Step is stereo but variable. |
|
| 52 |
+
| **Genre breadth** | **YuE** | Death-growl metal to Beijing opera to rap. |
|
| 53 |
+
| **Multilingual breadth** | **ACE-Step 1.5** | 50+ languages w/ lyric tags; YuE deep on 5 only. |
|
| 54 |
+
| **Code-switching (English β Mandarin in one song)** | **YuE** | Explicit demos. |
|
| 55 |
+
| **Speed / cost per song** | **ACE-Step 1.5** | Sub-2 s/song on A100; <minute on M5 Max. |
|
| 56 |
+
| **Modular reusability of components** | **HeartMuLa** | Codec/ASR/CLAP separately exportable. |
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## Cost model (rough)
|
| 61 |
+
|
| 62 |
+
| Path | Per-song cost | Latency | Best for |
|
| 63 |
+
|---|---|---|---|
|
| 64 |
+
| Self-host ACE-Step 1.5 on M5 Max | $0 marginal (electricity) | ~30-50 s | Dev, beta, low-volume |
|
| 65 |
+
| Self-host ACE-Step 1.5 on rented A100 80 GB | ~$0.0001 (sub-2 s Γ $1.50/hr) | <2 s | Production, paid SaaS |
|
| 66 |
+
| Replicate `fofr/yue` | ~$0.30-1.00 per song (estimated from 4090 cog runtime) | 5-15 min | Multilingual fallback, occasional |
|
| 67 |
+
| Self-host DiffRhythm 2 on 4090 | $0 marginal on owned 4090 | ~45 s | Speed tier, instrumentals |
|
| 68 |
+
| Replicate / WaveSpeed managed endpoints | varies | varies | Cold-start / spike capacity |
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## License risk matrix
|
| 73 |
+
|
| 74 |
+
| License | Commercial SaaS | Output ownership | Risk |
|
| 75 |
+
|---|---|---|---|
|
| 76 |
+
| MIT (ACE-Step 1.5) | β
| User owns | Lowest |
|
| 77 |
+
| Apache 2.0 (ACE-Step v1, HeartMuLa, DiffRhythm v1/v2, YuE) | β
with attribution | User owns | Low |
|
| 78 |
+
| Tencent custom (SongGeneration, SongBloom) | β **prohibited** | n/a | **Blocks SaaS** |
|
| 79 |
+
| Suno API (closed-source baseline) | $ paid tier | platform terms | Medium |
|
| 80 |
+
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
## Hardware sizing on M5 Max (128 GB unified memory)
|
| 84 |
+
|
| 85 |
+
| Model | Fits? | Headroom | Notes |
|
| 86 |
+
|---|---|---|---|
|
| 87 |
+
| ACE-Step 1.5 XL (4 B DiT + 4 B planner) | β
huge | ~120 GB free | Overkill; LoRA training viable in-RAM |
|
| 88 |
+
| HeartMuLa 4B + 2 B codec + 0.8 B ASR | β
huge | ~120 GB free | 7 B variant when released will also fit |
|
| 89 |
+
| DiffRhythm 2 (~1 B + 170 M VAE-dec) | β
trivial | ~125 GB free | Tiny by 2026 standards |
|
| 90 |
+
| YuE 7B Stage-1 + 1B Stage-2 + upsampler | β
but blocked | n/a | Memory fine, **flash-attn dep blocks MPS** |
|
| 91 |
+
| SongGeneration 2-large (4 B + diffusion) | β
comfortable | ~100 GB free | Community fork bug aside, fits |
|
| 92 |
+
|
| 93 |
+
**Conclusion:** the user's 128 GB unified memory completely eliminates memory pressure for every model in this list. The constraint is software (MPS kernel compat, flash-attn substitution), not hardware.
|
|
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Suno-Clone Platform Architecture β Build Plan
|
| 2 |
+
|
| 3 |
+
*Compiled 2026-05-18. Target hardware: Apple M5 Max, 128 GB unified memory. Core model decision: ACE-Step 1.5 XL.*
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Mental model
|
| 8 |
+
|
| 9 |
+
Suno (and Udio) are not just a song-generation model. They are a **product stack** with at least five distinct AI components and a few non-AI scaffolds. If we want to replicate the product experience, we have to plan for all of them. The song-gen model is the headline; everything else is what makes it usable.
|
| 10 |
+
|
| 11 |
+
```
|
| 12 |
+
βββββββββββββββββββββββββββββββββββββββ
|
| 13 |
+
β Web / mobile UI β
|
| 14 |
+
β (text prompt + style + lyrics) β
|
| 15 |
+
βββββββββββββββββββββββββββββββββββββββ
|
| 16 |
+
β
|
| 17 |
+
βΌ
|
| 18 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 19 |
+
β Orchestrator API β
|
| 20 |
+
β - prompt routing, queue, billing, history, sharing β
|
| 21 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 22 |
+
β β β β
|
| 23 |
+
βΌ βΌ βΌ βΌ
|
| 24 |
+
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
|
| 25 |
+
β Lyrics LLM β β Style/Tag β β Song-gen β β Voice β
|
| 26 |
+
β (Llama 3.3 β β rewriter β β router β β cloning β
|
| 27 |
+
β or Qwen) β β (small LM) β β β β (RVC) β
|
| 28 |
+
βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ βββββββββββββββ
|
| 29 |
+
β
|
| 30 |
+
βΌ
|
| 31 |
+
βββββββββββββββββββββββββββββββββββ
|
| 32 |
+
β Model pool (the actual research)β
|
| 33 |
+
β - ACE-Step 1.5 XL (default) β
|
| 34 |
+
β - HeartMuLa-MLX (A/B) β
|
| 35 |
+
β - DiffRhythm 2 (speed tier) β
|
| 36 |
+
β - YuE on Replicate (intl.) β
|
| 37 |
+
βββββββββββββββββββββββββββββββββββ
|
| 38 |
+
β
|
| 39 |
+
βΌ
|
| 40 |
+
βββββββββββββββββββββββββββββββββββ
|
| 41 |
+
β Post-processing pipeline β
|
| 42 |
+
β - Loudness normalization β
|
| 43 |
+
β - Demucs stem separation β
|
| 44 |
+
β - Watermarking (audible+meta) β
|
| 45 |
+
β - FFmpeg encoding β m4a/mp3 β
|
| 46 |
+
βββββββββββββββββββββββββββββββββββ
|
| 47 |
+
β
|
| 48 |
+
βΌ
|
| 49 |
+
βββββββββββββββββββββββββββββββββββ
|
| 50 |
+
β Storage + streaming β
|
| 51 |
+
β - S3 / R2 origin β
|
| 52 |
+
β - HLS for in-browser playback β
|
| 53 |
+
β - CDN β
|
| 54 |
+
βββββββββββββββββββββββββββββββββββ
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## Component-by-component plan
|
| 60 |
+
|
| 61 |
+
### 1. Song generation β primary model
|
| 62 |
+
|
| 63 |
+
- **ACE-Step 1.5 XL** via [`clockworksquirrel/ace-step-apple-silicon`](https://github.com/clockworksquirrel/ace-step-apple-silicon) on M5 Max.
|
| 64 |
+
- Hybrid backend: Qwen3 planner on **MLX**, DiT decoder on **PyTorch MPS**, bf16 throughout.
|
| 65 |
+
- Why XL over standard 2B: 128 GB unified eats the cost, and the 4 B DiT closes meaningful quality gaps for paying users.
|
| 66 |
+
|
| 67 |
+
**LoRA fine-tuning path (when needed):**
|
| 68 |
+
- Document the platform's target genres β curate ~50β200 song lyric/audio pairs per genre.
|
| 69 |
+
- Train a per-genre LoRA on the 3090-class budget (~1 hour per LoRA per [`ace-step-1.5 README`](https://github.com/ace-step/ACE-Step-1.5)).
|
| 70 |
+
- Serve via the same inference pipeline with LoRA hot-swap.
|
| 71 |
+
|
| 72 |
+
**Fallback / A-B candidates:**
|
| 73 |
+
- **HeartMuLa-MLX** ([`Acelogic/heartlib-mlx`](https://github.com/Acelogic/heartlib-mlx)) β 2.1Γ faster than PyTorch MPS, full numerical parity, Apache 2.0.
|
| 74 |
+
- **DiffRhythm 2** ([`ASLP-lab/DiffRhythm`](https://github.com/ASLP-lab/DiffRhythm)) β for the speed/instrumental tier (210 s ceiling acceptable for short-form features like background loops).
|
| 75 |
+
- **YuE via Replicate** ([`replicate.com/fofr/yue`](https://replicate.com/fofr/yue/api)) β only for EN+Mandarin+Cantonese+JP+KR generations that ACE-Step underperforms; pay-per-second, no local infra cost.
|
| 76 |
+
|
| 77 |
+
### 2. Lyrics generation β separate LLM
|
| 78 |
+
|
| 79 |
+
The song-gen model takes **lyrics + style** as input, not raw user prompts. Suno's "song description" flow is actually two stages: prompt β lyrics LLM β lyrics β song model.
|
| 80 |
+
|
| 81 |
+
- Use any decent open LLM running on the user's M5 Max. Candidates:
|
| 82 |
+
- **Qwen 2.5 Coder 32B / Qwen 3 7B** β good multilingual chops, fast on MPS via Ollama or mlx-lm.
|
| 83 |
+
- **Llama 3.3 70B 4-bit** β premium tier; fits comfortably in 128 GB unified.
|
| 84 |
+
- **GPT-OSS-20B** β Apache 2.0, sturdy English.
|
| 85 |
+
- Prompt template should:
|
| 86 |
+
1. Parse user style hint into tags (genre, tempo, mood, instruments).
|
| 87 |
+
2. Output structured lyrics with `[verse]`, `[chorus]`, `[bridge]`, `[outro]` markers β these are **exactly the structural tags ACE-Step's `TextEncodeAceStepAudio` consumes**.
|
| 88 |
+
3. Constrain section count and line count to roughly match the target song duration.
|
| 89 |
+
|
| 90 |
+
**This LLM is independent of the song-gen model and can be swapped freely.**
|
| 91 |
+
|
| 92 |
+
### 3. Style / tag normalization
|
| 93 |
+
|
| 94 |
+
A small classifier or 3 B LM that normalizes user free-text into the controlled-vocabulary tag set the song model was trained on (per genre, BPM bucket, vocal gender, mood). For ACE-Step this maps to its lyric-tag schema; for YuE it maps to `top_200_tags.json`.
|
| 95 |
+
|
| 96 |
+
Implementation: 1-shot prompt to the lyrics LLM with examples; cache results.
|
| 97 |
+
|
| 98 |
+
### 4. Voice cloning / personas (optional but Suno-equivalent)
|
| 99 |
+
|
| 100 |
+
To match Suno's "Personas" feature:
|
| 101 |
+
- **RVC v2** (Retrieval-based Voice Conversion) β open source, fast, runs on MPS, well-supported.
|
| 102 |
+
- Train a 5-minute reference clip β 10β15 min on M5 Max β speaker embedding.
|
| 103 |
+
- Apply to the generated vocal stem (Demucs-extracted) β remix.
|
| 104 |
+
|
| 105 |
+
ACE-Step's **ICL mode** (in-context learning from a reference clip) and YuE's ICL variants partly cover this too, but RVC gives explicit per-speaker control.
|
| 106 |
+
|
| 107 |
+
### 5. Stem separation
|
| 108 |
+
|
| 109 |
+
For Suno's "download stems" feature:
|
| 110 |
+
- **Demucs v4 / HTDemucs** β open source, Apache 2.0, runs on MPS, separates into vocals / drums / bass / other.
|
| 111 |
+
- Already bundled in [`fspecii/ace-step-ui`](https://github.com/fspecii/ace-step-ui).
|
| 112 |
+
|
| 113 |
+
### 6. Mastering / loudness normalization
|
| 114 |
+
|
| 115 |
+
- **pyloudnorm** for LUFS normalization to streaming spec (-14 LUFS Spotify, -16 for AirPods).
|
| 116 |
+
- **ffmpeg-normalize** as a CLI wrapper.
|
| 117 |
+
- **Optional: TBProAudio mvMeter / Voxengo Span equivalents** via web-audio for UI metering.
|
| 118 |
+
|
| 119 |
+
### 7. Watermarking + content credentials
|
| 120 |
+
|
| 121 |
+
This is a **legal must-have** for any 2026 generative-music product (training-data lawsuits against Suno/Udio set the precedent).
|
| 122 |
+
|
| 123 |
+
- **Inaudible audio watermark**: AudioSeal or SilentCipher β open-source, Meta-built, survives MP3 transcoding.
|
| 124 |
+
- **C2PA metadata**: sign the m4a with model name + version + prompt + timestamp via the C2PA SDK.
|
| 125 |
+
- **Visible "AI-generated" tag** in UI per the YuE model card's recommendation (and increasingly per platform policy).
|
| 126 |
+
|
| 127 |
+
### 8. Storage and streaming
|
| 128 |
+
|
| 129 |
+
- **S3-compatible object store** (R2, Backblaze B2, or self-hosted MinIO on the M5 Max if dev-only).
|
| 130 |
+
- **HLS encoding pipeline**: ffmpeg β m3u8 + 4 s segments; serve via NGINX or Cloudflare.
|
| 131 |
+
- For local dev, plain m4a + range requests are fine.
|
| 132 |
+
|
| 133 |
+
### 9. Orchestrator API
|
| 134 |
+
|
| 135 |
+
- **FastAPI** for the request-handling layer.
|
| 136 |
+
- **Redis Streams** or **Hatchet** for the generation queue (songs are 30 sβ2 min jobs on M5 Max β non-trivial latency, must be async).
|
| 137 |
+
- **PostgreSQL** for users, songs, lyrics, LoRAs, billing.
|
| 138 |
+
- **Server-Sent Events** for progress streaming back to the UI ("planner stage", "DiT denoising step 14/27", "mastering...").
|
| 139 |
+
|
| 140 |
+
### 10. Frontend
|
| 141 |
+
|
| 142 |
+
- **Next.js 16** + Cache Components for the user dashboard / library.
|
| 143 |
+
- **Wavesurfer.js** for waveform display and scrubbing.
|
| 144 |
+
- **Tone.js** for any in-browser preview / mixing.
|
| 145 |
+
- Auth via Clerk or Auth0 β the user's portfolio revamp may already include this.
|
| 146 |
+
|
| 147 |
+
---
|
| 148 |
+
|
| 149 |
+
## Build order (incremental milestones)
|
| 150 |
+
|
| 151 |
+
| Milestone | Scope | Validates |
|
| 152 |
+
|---|---|---|
|
| 153 |
+
| **M0 β Spike** | Get ACE-Step 1.5 XL running locally via clockworksquirrel fork; generate one 30 s song end-to-end | Hardware compatibility, RTF on M5 Max |
|
| 154 |
+
| **M1 β CLI MVP** | Wrap in a Python CLI: `genmusic --prompt "..." --lyrics "..." --out song.m4a` | Headless generation, mastering chain, file output |
|
| 155 |
+
| **M2 β Local UI** | Replace UI with `fspecii/ace-step-ui` initially (fastest path); add Demucs stem download | Browser flow, multi-song library, LAN access |
|
| 156 |
+
| **M3 β Lyrics LLM integration** | Plug Qwen 3 / Llama 3.3 as the lyrics generator; produce structured lyrics from a one-line prompt | Suno-equivalent prompt UX |
|
| 157 |
+
| **M4 β Multi-model router** | Add HeartMuLa-MLX as alternate; add Replicate YuE as multilingual fallback; user can pick or auto-route | A/B capability, breadth |
|
| 158 |
+
| **M5 β LoRA pipeline** | First custom LoRA on a target genre (e.g., user's preferred style); hot-swap at inference | Differentiation vs Suno |
|
| 159 |
+
| **M6 β Production wrapper** | FastAPI + Postgres + queue + auth + watermarking + C2PA signing | Real product surface |
|
| 160 |
+
| **M7 β Deploy** | Move heavy inference behind a rented A100 endpoint for paid users; keep M5 Max for free tier / personal use | Paid-tier economics |
|
| 161 |
+
|
| 162 |
+
---
|
| 163 |
+
|
| 164 |
+
## Open questions for the user before M0
|
| 165 |
+
|
| 166 |
+
1. **Commercial intent.** Is this a personal portfolio project (research mode β SongGeneration 2 is fair game) or a real SaaS (must stay Apache/MIT)? The license map changes drastically.
|
| 167 |
+
2. **Target audience.** Western pop (where Suno still wins polish) vs world music / experimental genres (where ACE-Step / YuE compete fairly)?
|
| 168 |
+
3. **Latency target.** Suno generates in ~30 s; users tolerate up to 90 s. ACE-Step on M5 Max hits this; YuE local does not.
|
| 169 |
+
4. **Hosting plan.** Local-only for personal use? Or eventually paid tier on rented GPU?
|
| 170 |
+
5. **Vocal cloning.** Is Suno-style "Persona" upload a must-have v1 feature, or v2?
|
| 171 |
+
6. **Catalog / training data.** Any in-house licensed song catalog for LoRA fine-tuning, or strictly the public-domain model out of the box?
|
| 172 |
+
|
| 173 |
+
---
|
| 174 |
+
|
| 175 |
+
## Risks and mitigations
|
| 176 |
+
|
| 177 |
+
| Risk | Likelihood | Mitigation |
|
| 178 |
+
|---|---|---|
|
| 179 |
+
| MPS regression in a future PyTorch release breaks ACE-Step | medium | Pin torch version; keep CPU fallback path. |
|
| 180 |
+
| ACE-Step releases v2 with breaking API mid-build | medium | Wrap inference in a thin adapter; abstract model behind a single `Generator.generate()` interface. |
|
| 181 |
+
| Vendor PER claims (HeartMuLa, LeVo) overstated β quality disappointment | medium | Run internal blind A/B on 20+ prompts before featuring a model in the UI. |
|
| 182 |
+
| Output watermark stripped by transcoding | low | Use AudioSeal which survives MP3; double-stamp with C2PA metadata. |
|
| 183 |
+
| Lyrics LLM hallucinates copyrighted hooks | medium | Run a similarity check against an embeddings index of known songs; flag for human review. |
|
| 184 |
+
| Training-data IP suit (Suno-style) | low for derivative usage | Use models with documented public-data training (ACE-Step's paper is reasonably transparent); avoid Tencent's non-commercial weights. |
|
| 185 |
+
| MPS OOM on long sequences | low (128 GB) | `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0`; chunk generation; offload non-active LoRAs. |
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
## Why ACE-Step 1.5 XL is the foundation (not just a model pick)
|
| 190 |
+
|
| 191 |
+
This is worth saying explicitly. Choosing the base model determines:
|
| 192 |
+
|
| 193 |
+
1. **Inference budget and unit economics** β ACE-Step is the only model where <2 s/song on A100 makes a paid tier economically obvious.
|
| 194 |
+
2. **Mac developer ergonomics** β first-class MPS means the user can iterate on the M5 Max for weeks without renting cloud GPU.
|
| 195 |
+
3. **License-clean output ownership** β MIT means users own their songs unambiguously.
|
| 196 |
+
4. **Future-proof on multilingual** β 50+ languages out of the box matters if the platform grows beyond an English audience.
|
| 197 |
+
5. **LoRA personalization is the differentiator** β fine-tuning support that works on MPS lets the user ship genre-specialist sub-models that Suno can't, because Suno's weights are locked.
|
| 198 |
+
6. **Production deployments exist** β AMD vendor-backed, `fspecii/ace-step-ui` running at scale, multiple SaaS already on the open weights. This is not betting on a research artifact.
|
| 199 |
+
|
| 200 |
+
The compound effect of those six is why ACE-Step is recommended as the platform foundation rather than just "the model to start with."
|