Spaces:

techfreakworm
/

ACE-Music-Studio

Running on Zero

App Files Files Community

techfreakworm commited on 2 days ago

Commit

9071450

unverified ·

1 Parent(s): c86ad91

docs: track spec + mockups + model research

Browse files

The docs/ and research/ directories existed on disk from before
git init (they hold the brainstorming + research output) but were
never explicitly added by the early A-series commits. Tracking them
now so the spec, UI mockups, and base-model research are part of
the repo history alongside the implementation plan.

Contents:

- docs/superpowers/specs/2026-05-18-ace-music-studio-design.md
- docs/superpowers/specs/mockups/01_generate_mobile_errors.html
- docs/superpowers/specs/mockups/02_cover_extend.html
- docs/superpowers/specs/mockups/03_edit_lyrics.html
- docs/superpowers/specs/mockups/README.md
- research/00_executive_summary.md
- research/01_yue.md
- research/02_diffrhythm.md
- research/03_acestep.md
- research/04_newcomers_and_survey.md
- research/05_apple_silicon_mps_audit.md
- research/06_comparison_matrix.md
- research/07_platform_architecture.md

Files changed (13) hide show

docs/superpowers/specs/2026-05-18-ace-music-studio-design.md +550 -0
docs/superpowers/specs/mockups/01_generate_mobile_errors.html +604 -0
docs/superpowers/specs/mockups/02_cover_extend.html +572 -0
docs/superpowers/specs/mockups/03_edit_lyrics.html +517 -0
docs/superpowers/specs/mockups/README.md +50 -0
research/00_executive_summary.md +122 -0
research/01_yue.md +268 -0
research/02_diffrhythm.md +138 -0
research/03_acestep.md +224 -0
research/04_newcomers_and_survey.md +161 -0
research/05_apple_silicon_mps_audit.md +105 -0
research/06_comparison_matrix.md +93 -0
research/07_platform_architecture.md +200 -0

docs/superpowers/specs/2026-05-18-ace-music-studio-design.md ADDED Viewed

	@@ -0,0 +1,550 @@

+# ACE Music Studio — Design Spec
+**Date:** 2026-05-18
+**Status:** Approved — ready for implementation plan
+**Repo:** `~/Projects/llm/music-generator/` → GitHub `techfreakworm/ace-music-studio` (to be created)
+**HF Space:** `huggingface.co/spaces/techfreakworm/ace-music-studio` (to be created)
+**Companion docs:** `research/00_executive_summary.md` (model selection rationale)
+---
+## 1. Goal
+A single-process Gradio app that wraps **ACE-Step 1.5 XL SFT** for full-song generation with vocals, deployable both to a free non-profit **Hugging Face ZeroGPU Space** and locally on **Apple M5 Max (MPS / MLX)** or **NVIDIA (CUDA)** workstations. Supports the full ACE-Step feature surface — text-to-song, audio-reference cover, song extension, segment-level edit/repaint, plus an in-app lyrics writer powered by a bundled small LM. Users can stack any number of LoRAs from a curated preset library or upload custom `.safetensors` files at runtime.
+Non-goals (v1): commercial-tier SaaS, multi-user accounts, persistent storage across sessions, social features, payment integration.
+---
+## 2. Locked product decisions
+| Decision | Value | Source |
+|---|---|---|
+| Product name | **ACE Music Studio** (slug `ace-music-studio`) | brainstorming Q1 |
+| Base model | ACE-Step 1.5 XL SFT (4 B DiT + 4 B Qwen3 planner) | research bundle `03_acestep.md` |
+| Backend pattern | Direct ACE-Step Python API, single Gradio process | brainstorming Q architecture |
+| UI layout | Sidebar nav + form + output (3 columns on desktop) | brainstorming Q layout = B |
+| Theme | Brutalist Mono (pure black/white, no accent) | brainstorming Q palette = E |
+| Tab set | Generate · Cover · Extend · Edit · Lyrics | brainstorming Q scope = all |
+| LoRA capability | Multi-stack via PEFT + bundled presets + custom upload | brainstorming Q scope |
+| Lyrics LM | Qwen 2.5 7B Instruct (Apache-2.0, ~14 GB bf16) | brainstorming Q lyrics LLM |
+| Hosting | Free ZeroGPU (community grant if needed) | brainstorming Q hosting |
+| License | MIT, public GitHub | brainstorming Q license |
+| Mobile | Horizontal scroll tabs at top, ≤ 640 px | brainstorming Q responsive = A |
+| Authorship rule | Mayank Gupta sole author on every commit | user prior memory `feedback_git_authorship.md` |
+---
+## 3. Architecture
+### 3.1 Top-level shape
+```
+              ┌──────────────────────────────────────────┐
+   browser ─▶ │  app.py — Gradio Blocks                  │
+              │  header · sidebar · 5 tabs · CTA footer  │
+              └─────────────────┬────────────────────────┘
+                                │
+                                ▼
+              ┌──────────────────────────────────────────┐
+              │  backend.py — ACEStepStudioBackend       │
+              │  @spaces.GPU(duration=callable)          │
+              │  lazy singletons; one mode-dispatch fn   │
+              └─────────────────┬────────────────────────┘
+                                │
+   ┌──────────────┬─────────────┴────────┬─────────────────┐
+   ▼              ▼                      ▼                 ▼
+ace_pipeline.py  lora_stack.py     lyrics_lm.py     post_process.py
+ACEStepPipeline  preset registry   Qwen 2.5 7B      Demucs stems
+device/cache     PEFT adapters     MLX or PyTorch    pyloudnorm
+                 sniff + validate  lazy load
+```
+### 3.2 Backend singleton — `ACEStepStudioBackend`
+One per-process instance, constructed lazily on first request. Owns three independently-lazy sub-singletons:
+| Sub-singleton | Loads when | Holds |
+|---|---|---|
+| `ACEStepPipeline` instance | first generation request | DiT, Qwen3 planner, audio codec, VAE |
+| `LyricsLM` instance | first lyrics-tab request | Qwen 2.5 7B weights, tokenizer |
+| `Demucs` instance | first stem-separation request | `htdemucs_ft` weights |
+Boot cost: only `_bootstrap()` (cache mirror + symlinks) — ~1–5 s. First gen request: +30–60 s warm-up. First lyrics request: +20–40 s. First stem request: +10 s. All amortised across the session.
+### 3.3 Device autodetect (`ace_pipeline.py`)
+Priority: **CUDA → MPS → CPU**.
+Apple Silicon path:
+- Set `PYTORCH_ENABLE_MPS_FALLBACK=1` before any torch import (in `app.py` module preamble, before backend imports torch).
+- Use the **Apple-Silicon fork's branch of ACE-Step** (`clockworksquirrel/ace-step-apple-silicon`) on Mac — pinned via `requirements-mac.txt` extra. Hybrid MLX (LM planner) + PyTorch MPS (DiT decoder).
+- Skip the CUDA-only `torch.mps.mem_get_info` gate — `vram_limit_for("mps")` returns `None` so ACE-Step's free-VRAM check short-circuits.
+- bf16 throughout; `--bf16 false` only if a specific kernel falls back.
+CUDA path:
+- Vanilla `ace-step` from git (or PyPI when published).
+- bf16; allow flash-attn if installed.
+- `vram_limit_for("cuda")` returns the safe cap from `torch.cuda.mem_get_info`.
+CPU path (warning only, not blocked):
+- Single warning banner on app load if no GPU detected: "CPU inference: expect ~10× slower."
+### 3.4 HF Spaces bootstrap (`app.py:_bootstrap()`)
+Direct port of z-image-studio's pattern, with model paths swapped:
+1. If `on_spaces()`, mirror the read-only `HF_HOME` (build cache) to `~/hf-cache-rw/`.
+2. Repoint `HF_HOME` and `HF_HUB_CACHE` env vars at the writable copy.
+3. Set `ACESTEP_MODEL_BASE_PATH` (or whatever the fork's env var is) to a project-local `./models/`.
+4. Symlink each cached HF snapshot into `./models/<repo>/` so the pipeline's loader finds them locally.
+This avoids re-downloads on every cold container start and works around HF's read-only build cache layer.
+### 3.5 ZeroGPU integration
+- `@spaces.GPU(duration=…)` decorates `backend.generate(mode, params)` at module load time. The decorator is a no-op identity off Spaces.
+- `duration` is a callable that estimates per-call timeout from `(mode, params)`, clamped to `[60, 180] s`:
+  - Generate / Cover at default settings → 60 s
+  - Long Generate (>120 s output) or Edit → 90–120 s
+  - Extend with large repaint window → 120–180 s
+  - Lyrics (separate decoration) → 30 s
+- On `"GPU task aborted"` exception, auto-retry once at 2× duration. After second failure, return `gr.Warning` with timing diagnostics.
+- `requirements.txt` **must not pin `spaces`** (HF injects its own version).
+---
+## 4. The five modes
+All mode handlers live in `modes.py` as pure functions over `(backend, params) → (audio_path, meta_dict)`. They share the **LoRA stack** and **advanced opts** code paths via shared helpers.
+### 4.1 Generate (text → song)
+**Inputs**: `prompt` (style), `lyrics`, `duration_s` (5–240), `instrumental` (bool), `lora_stack`, `advanced`.
+**ACE-Step params**: `audio_cover_strength=0`, `repaint_mode=None`, `flow_edit_morph=False`, `cot_*` controlled by advanced "LM thinking" toggle.
+**Output**: WAV (44.1 kHz stereo) + metadata JSON.
+### 4.2 Cover (audio reference → song in that style)
+**Inputs**: `prompt` (new style hint, optional), `ref_audio` file (any of mp3/wav/flac, ≤ 60 s), `lyrics` (new lyrics), `duration_s`, `lora_stack`, `advanced`.
+**ACE-Step params**: `audio_cover_strength≈0.93` (configurable in advanced), `cover_noise_strength=0`, `infer_method="ode"`.
+**Output**: WAV.
+### 4.3 Extend (continue an existing song)
+**Inputs**: `seed_audio` (≤ 240 s), `extra_prompt`, `extra_duration_s` (5–120), `lora_stack`, `advanced`.
+**ACE-Step params**: `repaint_mode="balanced"`, `repaint_strength` configurable, `repainting_start` set to the seed-audio end timestamp, `repainting_end` set to seed-end + `extra_duration_s`. Exact param names + sentinels for "append-after-end" must be verified against the current ACE-Step Python API during M3 implementation — see §14 open question.
+**Output**: WAV (seed + extension concatenated).
+### 4.4 Edit (repaint / flow morph a segment)
+**Inputs**: `source_audio`, `source_lyrics`, `target_lyrics`, `segment_start_s`, `segment_end_s`, `mode` ∈ {`repaint`, `flow_edit`}, `lora_stack`, `advanced`.
+**ACE-Step params**:
+- repaint sub-mode: `repaint_mode="balanced"`, `repainting_start=segment_start_s`, `repainting_end=segment_end_s`, `repaint_strength=0.5`.
+- flow_edit sub-mode: `flow_edit_morph=True`, `flow_edit_source_caption`, `flow_edit_source_lyrics`, `flow_edit_n_min=0.0`, `flow_edit_n_max=1.0`, `flow_edit_n_avg=1`.
+**Output**: WAV.
+### 4.5 Lyrics (Qwen 2.5 → structured lyrics)
+**Inputs**: `brief` (free-text prompt), `target_structure` (e.g., "intro, verse, chorus, verse, chorus, bridge, chorus, outro"), `language`, `tone` (optional).
+**System prompt** (locked):
+```
+You are a songwriter. Output ONLY structured lyrics for an AI music generator. Use these section tags exactly:
+[intro] [verse 1] [verse 2] [chorus] [bridge] [outro] (etc.)
+Each section is on its own line, followed by the lyrics for that section. Keep verses 4-8 lines, choruses 4 lines, bridges 2-4 lines. Match the requested tone and language. Do not include commentary, headers, or markdown.
+```
+**Output**: plain text with structural tags. A "Use these in Generate" button populates the Generate tab's `lyrics` field.
+### 4.6 Retake button
+Every mode's output panel has a "↻ retake" button. It re-runs the same mode handler with a new random seed, all other params unchanged.
+---
+## 5. LoRA stack (`lora_stack.py`)
+### 5.1 Preset registry
+`presets/manifest.json`:
+```json
+[
+  {"name":"RapMachine","hf_id":"ACE-Step/ACE-Step-v1-RapMachine-LoRA","kind":"genre"},
+  {"name":"Chinese Rap","hf_id":"ACE-Step/ACE-Step-v1-Chinese-Rap-LoRA","kind":"genre"},
+  {"name":"Lyric2Vocal","hf_id":"ACE-Step/ACE-Step-v1-Lyric2Vocal-LoRA","kind":"voice"},
+  {"name":"Text2Samples","hf_id":"ACE-Step/ACE-Step-v1-Text2Samples-LoRA","kind":"instrumental"}
+]
+```
+Presets are downloaded from HF on first preset-click, cached, and registered as PEFT adapters with the preset name. The four preset chips appear in every song-mode tab.
+### 5.2 Custom upload
+User drops a `.safetensors` file into the upload zone:
+1. `sniff(path)` reads the safetensors header (no full load, just metadata).
+2. Verifies key naming matches ACE-Step 1.5 XL DiT (`*.to_q.lora_A.weight`, etc.) and rank ≤ 256, alpha set, file ≤ 500 MB.
+3. On success, registers as a new PEFT adapter under `Path(path).stem` as adapter name; appears in the active stack.
+4. On failure, raises `LoRAValidationError` → `gr.Error` toast: "This LoRA isn't compatible with ACE-Step 1.5 XL SFT. Expected DiT modules: to_q, to_k, to_v, to_out.0, ff.net.0.proj, ff.net.2."
+### 5.3 Active stack management
+UI shows a list of active LoRAs with per-row strength slider (0.0–1.5) and × button. State held in `gr.State` per tab. On generate:
+```python
+backend.apply_lora_stack(active_adapters)   # pipe.set_adapters(names, weights=scales)
+audio, meta = backend.generate(mode, params)
+meta["loras"] = [{"name":n, "scale":s, "sha256":h} for n,s,h in active_adapters]
+```
+After generation the adapters stay loaded (cheap memory cost) but are deactivated via `pipe.disable_adapters()` if the user clears the stack.
+### 5.4 Sole-LoRA edge cases
+- All chips off + no upload → `pipe.disable_adapters()` (vanilla SFT XL output).
+- One LoRA with scale 0.0 → effectively disabled but still listed (UX: don't surprise the user by silently dropping it).
+- Same LoRA loaded twice (user dragged the same file twice) → dedupe by file sha256; UI flash: "already in stack."
+---
+## 6. Lyrics LM (`lyrics_lm.py`)
+### 6.1 Backend selection
+| Device | Backend | Weights size |
+|---|---|---|
+| `mps` (Mac) | `mlx-lm` with quantised Qwen 2.5 7B 4-bit | ~4 GB |
+| `cuda` | `transformers` with bf16 | ~14 GB |
+| ZeroGPU | `transformers` bf16, sliced into the `@spaces.GPU` lifetime | ~14 GB |
+Quantisation on Mac is the practical choice — 4-bit MLX-quant Qwen 2.5 7B runs ~3× faster than full-precision PyTorch MPS and barely affects lyric quality.
+### 6.2 Generation
+- `max_new_tokens=600`, `temperature=0.85`, `top_p=0.9`, `repetition_penalty=1.1`.
+- Stop sequences: `\n\n[end]`, `</lyrics>`.
+- Post-process: strip leading/trailing whitespace, normalize section tags to lowercase (e.g., `[Verse 1]` → `[verse 1]`).
+### 6.3 Lazy loading
+```python
+class LyricsLM:
+    _instance = None
+    @classmethod
+    def get(cls):
+        if cls._instance is None:
+            cls._instance = cls._load()
+        return cls._instance
+```
+First call cost: ~20–40 s on Mac, ~10 s on CUDA. Surfaced to the user via `gr.Progress` on the Lyrics tab.
+---
+## 7. Post-processing (`post_process.py`)
+### 7.1 Stem separation
+- `demucs.api.Separator(model="htdemucs_ft")` lazy singleton.
+- Output: 4 WAV files (vocals, drums, bass, other).
+- Runs synchronously after generation if the user expands the Stems section, or on-demand via a "Separate stems" button in the output panel.
+- On ZeroGPU, counted in the same `@spaces.GPU` lifetime as the generation that produced the audio.
+### 7.2 Loudness normalization
+- `pyloudnorm` normalises to **-14 LUFS** (streaming spec).
+- Toggled by an `Advanced` checkbox per mode (default ON).
+- Applied to the final WAV before MP3 encoding.
+### 7.3 MP3 export
+- `ffmpeg` via `subprocess` — 320 kbps CBR, 44.1 kHz, stereo.
+- Embeds metadata as ID3 tags (prompt, lora hashes, seed).
+---
+## 8. Frontend (`app.py` + `ui.py` + `theme.py`)
+> **Reference mockups (visual source of truth):**
+>
+> | File | Covers |
+> |---|---|
+> | [`mockups/01_generate_mobile_errors.html`](./mockups/01_generate_mobile_errors.html) | Generate tab (fully expanded), mobile phone screens, error / edge-case states |
+> | [`mockups/02_cover_extend.html`](./mockups/02_cover_extend.html) | Cover tab + Extend tab (both fully expanded) |
+> | [`mockups/03_edit_lyrics.html`](./mockups/03_edit_lyrics.html) | Edit tab (Repaint + Flow Morph sub-modes) + Lyrics tab (Qwen LM params) |
+> | [`mockups/README.md`](./mockups/README.md) | What's shared across tabs + what each tab adds |
+>
+> The mockups define the **layout, spacing, control surface, and disclosure hierarchy.** The prose below defines the **semantics** — what each control does, what the defaults are, what the responsive breakpoints are. If a discrepancy ever shows up, the mockups are the source for layout, and §3–§7 of this spec are the source for behaviour.
+### 8.1 Page chrome
+```html
+HEADER (sticky):
+  [brand: "ACE Music Studio." in 15px white, "." in #FFF as period]
+  [status: "ready · MPS · M5 Max" in 10px muted]
+CTA (below header, separator below):
+  Built with ♥.  Drop a like  ·  Follow @techfreakworm  for what's next.
+(Tab content)
+```
+### 8.2 Sidebar (desktop ≥ 1024 px)
+5 mode items + History section below. Active item: white left border + brighter text. Width: 170 px.
+### 8.3 Tablet (640–1024 px)
+Sidebar collapses to 30 px wide **icon rail**. Hover shows tooltip with full label. Same active treatment.
+### 8.4 Mobile (< 640 px)
+Native `gr.Tabs` (horizontal scroll) replaces the sidebar entirely. Hidden via CSS media query swap: `display: none` on `.ms-sidebar`, `display: flex` on a `.ms-mobile-tabs`. No JS.
+### 8.5 Tab body
+Two-column on desktop (form 60% / output 40%), stacks vertically on tablet and mobile.
+Form layer order (top to bottom, always-visible by default):
+1. Style prompt (textarea, ~3 rows)
+2. Lyrics (textarea, ~6 rows) — except Lyrics tab, which replaces with brief + structure inputs
+3. Mode-specific: ref audio (Cover), seed audio (Extend), source + segment (Edit)
+4. Duration slider + vocals/instrumental toggle (Generate only)
+5. LoRA section (collapsed by default; chip row visible if any preset is "on")
+6. Advanced accordion (collapsed by default)
+7. LM-planner accordion (collapsed by default)
+8. Generate button (primary; white-on-black; full-width on mobile)
+### 8.6 Output panel
+- Audio player with built-in waveform (Gradio 5 native)
+- Retake button (↻)
+- Stems grid (Demucs) — only visible after Demucs runs
+- Action row: ↓ mp3 · ↓ wav · `{ }` meta · ↗ share (copies a permalink with prompt+seed in URL params)
+- Metadata JSON viewer (collapsible, default closed)
+### 8.7 Theme tokens (`theme.py`)
+```python
+BG = "#0A0A0A"
+SURFACE = "#141414"
+SURFACE_STRONG = "#000000"
+BORDER = "#1F1F1F"
+BORDER_STRONG = "#2A2A2A"
+INK = "#E5E5E5"
+INK_MUTED = "#6B6B6B"
+PRIMARY = "#FFFFFF"
+ERROR = "#E5E5E5"  # high-contrast white in Brutalist Mono; gradio error background still red-ish but our text is white
+RADIUS = "6px"
+FONT_STACK = '"Inter", -apple-system, BlinkMacSystemFont, "Segoe UI", system-ui, sans-serif'
+```
+CSS injected via `gr.Blocks(css=…)` covers sidebar layout, responsive media queries, LoRA chip pill, waveform tightening, accordion arrow customization, hide-Gradio-footer.
+---
+## 9. Data flow per generation
+```
+1. User clicks "Generate" button on the Generate tab.
+2. app.py:on_generate(...) handler reads all gr inputs, coerces types.
+3. Handler validates active LoRAs (cheap header sniff) — raises gr.Error on failure.
+4. Handler calls backend.generate_with_retry(mode="generate", params={...}).
+5. backend.generate_with_retry is the @spaces.GPU-decorated entrypoint.
+6. Inside the GPU lifetime:
+   a. _ensure_pipeline()              — lazy load on first call
+   b. _apply_lora_stack(params.loras) — pipe.set_adapters(names, weights)
+   c. _dispatch_mode("generate", params) — calls pipe(...) with mode-specific kwargs
+   d. _post_process(audio, params)     — loudness norm, optionally stems
+   e. _emit_meta(params, audio)        — build metadata JSON, sha256s
+7. Returns (audio_path, meta_dict).
+8. Handler updates UI: audio player, metadata JSON viewer.
+9. History entry appended (in-memory, last 10).
+```
+ZeroGPU abort handling wraps step 5 in a one-shot retry at 2× duration. Beyond that: `gr.Warning` with the suggestion to reduce duration or steps.
+---
+## 10. Error handling matrix
+| Trigger | User-facing | Logs |
+|---|---|---|
+| LoRA file invalid (rank, modules, size) | `gr.Error("This LoRA isn't compatible with ACE-Step 1.5 XL SFT. …")` | full traceback to stderr |
+| Audio input wrong format | `gr.Error("Audio must be wav/mp3/flac, ≤ 240 s.")` | format diagnostics |
+| Cover/Extend/Edit missing required input | `gr.Error("Reference audio is required for Cover mode.")` | param dump |
+| ZeroGPU abort | auto-retry once at 2× duration; if still aborts: `gr.Warning("Generation timed out. Try a shorter duration or fewer steps.")` | timing info |
+| Lyrics LM cold-load fails (OOM) | `gr.Error("Couldn't load lyrics model. Free some memory and retry.")` | full traceback |
+| MPS op not implemented | falls back to CPU via env var; if still crashes: `gr.Error("This ACE-Step op isn't yet supported on Apple Silicon. Generation aborted.")` | op name + diagnostics |
+| Demucs separator fails on weird audio | `gr.Warning("Stem separation failed — audio still saved.")` | traceback |
+| Custom-LoRA download fails (preset) | `gr.Error("Couldn't download preset 'X'. Check network.")` | network log |
+| Out-of-disk on cache mirror | `gr.Error("Disk full. Free space and reload.")` | mount stats |
+---
+## 11. Testing
+### 11.1 Layers
+- **L1 — no GPU, no models**: module structure, type signatures, theme CSS asserts, LoRA-header sniff unit tests, metadata JSON shape, preset manifest schema. ~30 tests, runs in < 5 s.
+- **L2 — mocked pipeline**: each mode handler calls the backend with the right kwargs; `set_adapters` invoked with correct order/weights; lyrics LM prompt template asserted. ~25 tests, runs in < 30 s.
+- **GPU smoke (`@pytest.mark.gpu`, skipped by default)**: one Generate + one Cover + one Extend + one Lyrics at minimum settings, asserts output exists and is non-zero size. ~4 tests, runs in 5–10 min on M5 Max.
+### 11.2 CI
+- GitHub Actions: Python 3.11, run L1 + L2 with `pytest -m "not gpu"`.
+- ruff format + ruff check both pass.
+- No GPU testing in CI (cost). The user runs `pytest -m gpu` locally on the M5 Max before each release tag.
+### 11.3 Manual verification before merge
+- Each new mode handler: at least one end-to-end on M5 Max with a real prompt + the psytrance LoRA loaded.
+- LoRA upload: at least one bad-file rejection (rank mismatch) + one good-file success.
+- Responsive: open on phone (Safari iOS), verify horizontal tab strip, verify generate end-to-end.
+---
+## 12. Deployment
+### 12.1 HF Spaces
+`README.md` frontmatter:
+```yaml
+---
+title: ACE Music Studio
+emoji: 🎵
+colorFrom: gray
+colorTo: gray
+sdk: gradio
+sdk_version: "5.50.0"
+app_file: app.py
+python_version: "3.11"
+suggested_hardware: zero-a10g
+hf_oauth: false
+preload_from_hub:
+  - ACE-Step/ACE-Step-v1.5-XL-SFT *.safetensors,config.json,scheduler/*,vae/*,tokenizer/*
+  - Qwen/Qwen2.5-7B-Instruct *.safetensors,config.json,tokenizer*
+  - facebook/htdemucs_ft *.th
+  - ACE-Step/ACE-Step-v1-RapMachine-LoRA *.safetensors
+  - ACE-Step/ACE-Step-v1-Chinese-Rap-LoRA *.safetensors
+  - ACE-Step/ACE-Step-v1-Lyric2Vocal-LoRA *.safetensors
+  - ACE-Step/ACE-Step-v1-Text2Samples-LoRA *.safetensors
+---
+```
+Preload size estimate: ACE-Step XL SFT ~16 GB + Qwen 2.5 ~14 GB + htdemucs ~250 MB + 4 LoRAs ~400 MB = **~31 GB**, well under HF's 150 GB cap.
+### 12.2 GitHub
+- Repo: `techfreakworm/ace-music-studio` (public).
+- License: MIT.
+- HF Space mirror via dedicated git remote (`git push space main`).
+- README badges: HF Space, GitHub stars, MIT license, Python 3.11, backend ACE-Step.
+### 12.3 Local install
+```bash
+git clone https://github.com/techfreakworm/ace-music-studio
+cd ace-music-studio
+bash setup.sh           # creates .venv (Python 3.11), installs requirements
+source .venv/bin/activate
+python app.py           # http://127.0.0.1:7860
+```
+`setup.sh` detects Mac vs CUDA and installs the right ACE-Step branch + Qwen backend (mlx-lm on Mac).
+---
+## 13. Out of scope for v1
+These are deferred to v2+ — do **not** implement without explicit user OK:
+- Multi-prompt batch queue (generate 5 variants in a row)
+- Persistent generation history across sessions (DB-backed)
+- User accounts / auth
+- Telemetry dashboard
+- Voice cloning ("Persona" feature — RVC integration)
+- LoRA training inside the app
+- ControlNet-style conditioning (rhythm tracks, MIDI input)
+- Spectrogram visualization (waveform is enough for v1)
+- Multi-language UI strings (English only; song content can be any language)
+- Watermarking output audio
+- Browser-side audio editing (cut, paste, fade)
+- Multi-tenant rate limiting
+- Export to DAW format (stem zip is enough for v1)
+- Visual regression tests for the Gradio UI
+---
+## 14. Open implementation questions (defer to writing-plans)
+1. **ACE-Step package — git or PyPI?** As of 2026-05-18, the official `ace-step` PyPI package exists for v1.5 but the Apple-Silicon fork is git-only. Decision: `pip install ace-step` on CUDA, `pip install git+https://github.com/clockworksquirrel/ace-step-apple-silicon` on Mac (detected by `setup.sh`).
+2. **Demucs model — `htdemucs` or `htdemucs_ft`?** `htdemucs_ft` is the fine-tuned variant with slightly better separation. Larger weight (~250 MB) but trivial in our budget. Default: `htdemucs_ft`.
+3. **LoRA preset HF IDs** — placeholder paths above (`ACE-Step/ACE-Step-v1-*-LoRA`) may not match the exact HF org/repo naming when this is implemented; the plan should verify each preset's actual canonical HF path before the preload directive is finalised.
+4. **Qwen 2.5 7B vs 3B for ZeroGPU comfort** — 7B is correct per the brainstorming answer. If ZeroGPU's 60 s budget is too tight for cold-load + generate, fall back to **Qwen 2.5 3B Instruct** (~6 GB) without UI changes.
+5. **Edit-mode UX for segment selection** — start with two numeric inputs (start_s, end_s). v1.5 can add a waveform-clickable selector if user feedback demands it.
+6. **History persistence** — v1 is in-memory only. The sidebar history list is `gr.State`-backed and wipes on page reload. Persistent history is v2.
+7. **ACE-Step Extend / Repaint exact API surface** — the psytrance LoRA generation config shows the relevant kwargs (`repainting_start`, `repainting_end`, `repaint_mode`, `repaint_strength`, `chunk_mask_mode`, `repaint_latent_crossfade_frames`, `repaint_wav_crossfade_sec`). Verify the conventions for "append after end of seed audio" (e.g., does `repainting_end > audio_length` extend, or do we need a different sentinel?) before M3 ships.
+8. **MLX-quant Qwen 2.5 7B availability** — confirm `mlx-community/Qwen2.5-7B-Instruct-4bit` exists and produces acceptable lyric quality. If not, use `mlx-community/Qwen2.5-3B-Instruct-4bit` as the Mac path (the model card under §6.1's table moves to 3B-on-Mac, 7B-on-CUDA).
+---
+## 15. Sole-author rule
+Per the user's permanent feedback (memory `feedback_git_authorship.md`):
+- Mayank Gupta is sole author on every commit.
+- **NO** `Co-Authored-By: Claude…` trailer.
+- **NO** `Generated with Claude Code` footer.
+- **NO** `--author=…` flag.
+- This applies to commits made by any AI assistant working on this repo.
+Encoded in `CLAUDE.md`, `AGENTS.md`, and `SKILLS.md` at the top of the repo so every assistant sees it on first read.
+---
+## 16. Implementation milestones (rough)
+(Detailed sequencing belongs in the implementation plan — see `docs/superpowers/plans/`.)
+| Milestone | Deliverable | Validates |
+|---|---|---|
+| M0 — Bootstrap | `app.py:_bootstrap()` + device autodetect + Gradio Blocks skeleton + theme | App boots on M5 Max and on a Space-equivalent CPU env |
+| M1 — Generate mode (no LoRA) | `modes.generate` + `ace_pipeline.py` + audio player output | End-to-end "psytrance, 30 s" generation on M5 Max |
+| M2 — LoRA stack | `lora_stack.py` + preset chips + custom upload + active stack UI | Psytrance v2 + RapMachine stacked at 0.95 / 0.85 produce visibly different output |
+| M3 — Cover, Extend, Edit | Three more handlers + their tab UIs | Each mode produces a non-trivial output |
+| M4 — Lyrics LM | `lyrics_lm.py` + Lyrics tab + "use these" flow | Qwen 2.5 emits valid structural-tag lyrics; round-trip into Generate works |
+| M5 — Post-processing | Demucs + pyloudnorm + mp3 export | Stems download, normalised output, ID3-tagged MP3 |
+| M6 — Responsive + polish | Mobile media queries + tooltips + error UX + history sidebar | Phone Safari renders + generates end-to-end |
+| M7 — Deploy | Preload directive + ZeroGPU decorator + retry logic + Space mirror | Public Space serves requests at parity with local |
+---
+## 17. References
+- ACE-Step 1.5 paper: [arXiv 2506.00045](https://arxiv.org/abs/2506.00045)
+- ACE-Step 1.5 repo: [github.com/ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)
+- Apple Silicon fork: [github.com/clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)
+- ACE-Step LoRA family: [ace-step.github.io](https://ace-step.github.io/)
+- Qwen 2.5: [huggingface.co/Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
+- Demucs: [github.com/facebookresearch/demucs](https://github.com/facebookresearch/demucs)
+- z-image-studio (architectural precedent): `~/Projects/llm/z-image-studio/`
+- Research bundle: `research/00_executive_summary.md` and siblings

docs/superpowers/specs/mockups/01_generate_mobile_errors.html ADDED Viewed

	@@ -0,0 +1,604 @@

+<h2>Generate fully expanded · mobile · error states</h2>
+<p class="subtitle">Last batch. Generate tab with every control surfaced. Mobile phone screens for Generate + Cover + Lyrics. Six error/edge-case states.</p>
+<style>
+  .gm { background:#0A0A0A; color:#E5E5E5; border:1px solid #1F1F1F; border-radius:10px; padding:18px; font-size:12px; line-height:1.5; margin-top:14px; }
+  .gm-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:10px; border-bottom:1px solid #1F1F1F; margin-bottom:14px; }
+  .gm-brand { font-size:15px; font-weight:600; }
+  .gm-cta { font-size:11px; color:#6B6B6B; }
+  .gm-cta strong { color:#E5E5E5; }
+  .gm-status { font-size:10px; color:#6B6B6B; letter-spacing:0.08em; text-transform:uppercase; }
+  .gm-row { display:flex; gap:16px; align-items:flex-start; }
+  .gm-sidebar { background:#000; padding:14px 10px; border-radius:6px; min-width:170px; }
+  .gm-side { display:block; padding:8px 10px; border-radius:4px; margin-bottom:3px; font-size:12px; color:#6B6B6B; }
+  .gm-side.active { background:#1A1A1A; color:#FFF; border-left:2px solid #FFF; padding-left:8px; }
+  .gm-side .em { margin-right:6px; }
+  .gm-main { flex:1; display:flex; gap:14px; align-items:flex-start; }
+  .gm-form { flex:1.3; background:#141414; padding:16px; border-radius:6px; }
+  .gm-output { flex:1; background:#141414; padding:16px; border-radius:6px; min-width:260px; }
+  .gm-label { font-size:10px; text-transform:uppercase; letter-spacing:0.08em; color:#6B6B6B; margin-bottom:6px; display:flex; justify-content:space-between; align-items:center; }
+  .gm-label .hint { color:#5A5048; font-size:9px; text-transform:none; letter-spacing:normal; font-weight:400; }
+  .gm-input { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; margin-bottom:12px; font-size:11px; }
+  .gm-textarea { min-height:46px; }
+  .gm-grid2 { display:grid; grid-template-columns:1fr 1fr; gap:12px; margin-bottom:12px; }
+  .gm-grid3 { display:grid; grid-template-columns:1fr 1fr 1fr; gap:10px; margin-bottom:12px; }
+  .gm-grid4 { display:grid; grid-template-columns:1fr 1fr 1fr 1fr; gap:8px; margin-bottom:12px; }
+  .gm-slider-row { display:flex; align-items:center; gap:10px; padding:6px 8px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; }
+  .gm-slider-row .name { color:#6B6B6B; font-size:10px; min-width:130px; }
+  .gm-slider { flex:1; height:3px; background:#2A2A2A; border-radius:2px; position:relative; }
+  .gm-slider::after { content:""; position:absolute; top:-4px; width:10px; height:10px; background:#FFF; border-radius:50%; }
+  .gm-slider.p5::after { left:5%; }
+  .gm-slider.p10::after { left:10%; }
+  .gm-slider.p15::after { left:15%; }
+  .gm-slider.p20::after { left:20%; }
+  .gm-slider.p25::after { left:25%; }
+  .gm-slider.p33::after { left:33%; }
+  .gm-slider.p40::after { left:40%; }
+  .gm-slider.p50::after { left:50%; }
+  .gm-slider.p60::after { left:60%; }
+  .gm-slider.p65::after { left:65%; }
+  .gm-slider.p70::after { left:70%; }
+  .gm-slider.p85::after { left:85%; }
+  .gm-slider.p90::after { left:90%; }
+  .gm-slider.p95::after { left:95%; }
+  .gm-slider-row .val { color:#FFF; font-family:monospace; font-size:11px; min-width:42px; text-align:right; }
+  .gm-toggle { display:flex; align-items:center; gap:8px; padding:6px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; cursor:pointer; }
+  .gm-toggle .box { width:14px; height:14px; border:1px solid #2A2A2A; border-radius:3px; display:inline-flex; align-items:center; justify-content:center; font-size:9px; }
+  .gm-toggle.on { color:#FFF; border-color:#FFF; }
+  .gm-toggle.on .box { background:#FFF; color:#0A0A0A; border-color:#FFF; }
+  .gm-pills { display:flex; gap:0; background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:2px; margin-bottom:12px; }
+  .gm-pill { flex:1; text-align:center; padding:6px 10px; font-size:11px; color:#6B6B6B; border-radius:3px; cursor:pointer; }
+  .gm-pill.on { background:#FFF; color:#0A0A0A; }
+  .gm-select { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; font-size:11px; display:flex; justify-content:space-between; align-items:center; margin-bottom:8px; }
+  .gm-select .arrow { color:#6B6B6B; }
+  .gm-section { border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-top:14px; background:#0F0F0F; }
+  .gm-section-h { display:flex; justify-content:space-between; align-items:center; margin-bottom:12px; font-size:11px; font-weight:600; }
+  .gm-section-h .arrow { color:#FFF; }
+  .gm-section-h .meta { color:#6B6B6B; font-weight:400; font-size:10px; }
+  .gm-chip { display:inline-block; padding:5px 10px; border-radius:14px; font-size:10px; margin-right:5px; margin-bottom:5px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; cursor:pointer; }
+  .gm-chip.on { border-color:#FFF; color:#FFF; }
+  .gm-chip.upload { border-style:dashed; color:#FFF; }
+  .gm-lora-row { display:flex; align-items:center; gap:10px; padding:8px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:6px; font-size:11px; }
+  .gm-lora-name { flex:1; }
+  .gm-lora-name small { color:#6B6B6B; font-weight:400; margin-left:4px; }
+  .gm-x { color:#6B6B6B; cursor:pointer; padding:0 4px; }
+  .gm-btn { background:#FFF; color:#0A0A0A; padding:12px 18px; border-radius:4px; font-weight:600; display:block; font-size:13px; text-align:center; cursor:pointer; margin-top:16px; }
+  .gm-waveform { height:60px; background:#000; border:1px solid #2A2A2A; border-radius:4px; display:flex; align-items:center; justify-content:center; gap:2px; padding:8px; margin-bottom:10px; }
+  .gm-bar { width:2px; background:#E5E5E5; }
+  .gm-player-controls { display:flex; align-items:center; gap:10px; color:#6B6B6B; font-size:10px; margin-bottom:14px; }
+  .gm-play { width:28px; height:28px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:11px; }
+  .gm-meta-block { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:8px 10px; font-size:9px; color:#6B6B6B; font-family:monospace; line-height:1.6; max-height:160px; overflow:hidden; margin-top:8px; }
+  .gm-actions { display:flex; flex-wrap:wrap; gap:6px; margin-bottom:10px; }
+  .gm-secondary { border:1px solid #2A2A2A; color:#E5E5E5; padding:6px 12px; border-radius:4px; font-size:10px; cursor:pointer; }
+  .gm-stems { display:grid; grid-template-columns:1fr 1fr; gap:6px; margin-bottom:10px; }
+  .gm-stem { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between; align-items:center; }
+  .gm-stem .dl { color:#FFF; cursor:pointer; }
+</style>
+<h3 style="margin-top:14px">🎵 Generate — fully expanded · psytrance preset stacked with custom LoRA</h3>
+<div class="gm">
+  <div class="gm-header">
+    <div>
+      <div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
+      <div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">♥</span>. <strong>Drop a like</strong> · Follow <strong>@techfreakworm</strong> for what's next.</div>
+    </div>
+    <div class="gm-status">ready · MPS · M5 Max</div>
+  </div>
+  <div class="gm-row">
+    <div class="gm-sidebar">
+      <div class="gm-side active"><span class="em">🎵</span>Generate</div>
+      <div class="gm-side"><span class="em">🎤</span>Cover</div>
+      <div class="gm-side"><span class="em">⏩</span>Extend</div>
+      <div class="gm-side"><span class="em">✏️</span>Edit</div>
+      <div class="gm-side"><span class="em">✍️</span>Lyrics</div>
+      <div style="border-top:1px solid #1F1F1F; margin:14px 0 10px; padding-top:10px; font-size:9px; color:#6B6B6B; text-transform:uppercase; letter-spacing:0.1em;">History · session</div>
+      <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">▶ psytrance · just now</div>
+      <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">▶ ambient_v4 · 2m</div>
+      <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">▶ chinese_rap · 7m</div>
+      <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">▶ lofi_vocal · 14m</div>
+    </div>
+    <div class="gm-main">
+      <div class="gm-form">
+        <div class="gm-label">1 · Style prompt <span class="hint">describe the song · genre, instruments, mood</span></div>
+        <div class="gm-input">psytrance, rolling triplet bassline, acid squelch, metallic leads, atmospheric pads, high quality</div>
+        <div class="gm-label">2 · Lyrics <span class="hint">use [verse] [chorus] [bridge] tags · ↗ open Lyrics tab to draft with Qwen 2.5</span></div>
+        <div class="gm-input gm-textarea" style="min-height:64px">[intro - atmospheric pads &amp; ambient synth]<br><br>[verse 1] six in the morning, the sun's still pretending<br>kick drum carries what the night was sending<br>shoes off, eyes closed, the city's still bending<br><br>[chorus] we let go, we let go, we let go</div>
+        <div class="gm-grid2">
+          <div>
+            <div class="gm-label">Duration <span class="hint">5 – 240 s</span></div>
+            <div class="gm-slider-row"><span class="name">seconds</span><span class="gm-slider p15"></span><span class="val">30</span></div>
+          </div>
+          <div>
+            <div class="gm-label">Vocal mode</div>
+            <div class="gm-pills">
+              <div class="gm-pill on">With vocals</div>
+              <div class="gm-pill">Instrumental</div>
+            </div>
+          </div>
+        </div>
+        <!-- LoRA section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h">
+            <span>LoRA stack <span class="meta">· 2 active · order matters</span></span>
+            <span class="arrow">▾</span>
+          </div>
+          <div class="gm-label">Bundled presets <span class="hint">click to toggle</span></div>
+          <div style="margin-bottom:12px;">
+            <span class="gm-chip">RapMachine</span>
+            <span class="gm-chip">Chinese Rap</span>
+            <span class="gm-chip on">Lyric2Vocal</span>
+            <span class="gm-chip">Text2Samples</span>
+          </div>
+          <div class="gm-label">Active stack <span class="hint">↑↓ to reorder · × to remove</span></div>
+          <div class="gm-lora-row">
+            <span class="gm-lora-name">Lyric2Vocal <small>· preset · 28 MB</small></span>
+            <span class="gm-slider p65" style="width:100px"></span>
+            <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.65</span>
+            <span class="gm-x">×</span>
+          </div>
+          <div class="gm-lora-row">
+            <span class="gm-lora-name">psytrance_v2 <small>· custom · 47 MB · rank 64 · sha 0c94…</small></span>
+            <span class="gm-slider p95" style="width:100px"></span>
+            <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.95</span>
+            <span class="gm-x">×</span>
+          </div>
+          <div style="margin-top:10px;">
+            <span class="gm-chip upload">↑ drop .safetensors here or click</span>
+          </div>
+        </div>
+        <!-- Advanced section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h">
+            <span>Advanced <span class="meta">· generation parameters</span></span>
+            <span class="arrow">▾</span>
+          </div>
+          <div class="gm-grid3">
+            <div><div class="gm-label">BPM</div><div class="gm-input" style="margin-bottom:0">135</div></div>
+            <div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">auto</div></div>
+            <div><div class="gm-label">Time signature</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">▾</span></div></div>
+            <div><div class="gm-label">Vocal language</div><div class="gm-select">auto <span class="arrow">▾</span></div></div>
+          </div>
+          <div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p25"></span><span class="val">50</span></div>
+          <div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
+          <div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p33"></span><span class="val">3</span></div>
+          <div class="gm-slider-row"><span class="name">CFG interval start</span><span class="gm-slider p5"></span><span class="val">0.0</span></div>
+          <div class="gm-slider-row"><span class="name">CFG interval end</span><span class="gm-slider p95"></span><span class="val">1.0</span></div>
+          <div class="gm-label" style="margin-top:8px">Negative prompt <span class="hint">things to avoid</span></div>
+          <div class="gm-input gm-textarea" style="font-size:10px">bitcrushed, aliasing, quantizing noise, digital clipping, glitchy, mp3 artifacts, jazz, funk, pop, acoustic, lo-fi, orchestral, dubstep, vocal hooks, electric guitar, slow tempo, jazz chords, blues scale</div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
+            <div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">✓</span> -14 LUFS</div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Fade in</div><div class="gm-slider-row"><span class="name">seconds</span><span class="gm-slider p5"></span><span class="val">0.0</span></div></div>
+            <div><div class="gm-label">Fade out</div><div class="gm-slider-row"><span class="name">seconds</span><span class="gm-slider p5"></span><span class="val">0.0</span></div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Latent shift</div><div class="gm-input" style="margin-bottom:0">0</div></div>
+            <div><div class="gm-label">Latent rescale</div><div class="gm-input" style="margin-bottom:0">1</div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">1297183202</div></div>
+            <div><div class="gm-label">&nbsp;</div><div class="gm-toggle"><span class="box"></span> Lock seed</div></div>
+          </div>
+        </div>
+        <!-- LM planner section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h">
+            <span>LM planner · Qwen3 thinking <span class="meta">· chain-of-thought structure</span></span>
+            <span class="arrow">▾</span>
+          </div>
+          <div class="gm-toggle on"><span class="box">✓</span> Thinking enabled <span style="color:#6B6B6B; font-size:9px; margin-left:auto">+ slower but better structure</span></div>
+          <div class="gm-toggle on"><span class="box">✓</span> Constrained decoding</div>
+          <div class="gm-grid4" style="margin-top:8px">
+            <div><div class="gm-label">Temperature</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
+            <div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
+            <div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
+            <div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
+          </div>
+          <div class="gm-label">CoT pipeline toggles <span class="hint">which fields the LM rewrites pre-generation</span></div>
+          <div class="gm-grid4">
+            <div class="gm-toggle"><span class="box"></span> metas</div>
+            <div class="gm-toggle"><span class="box"></span> caption</div>
+            <div class="gm-toggle"><span class="box"></span> lyrics</div>
+            <div class="gm-toggle"><span class="box"></span> language</div>
+          </div>
+          <div class="gm-label">LM negative prompt</div>
+          <div class="gm-input" style="font-size:10px">happy chords, major scale, uplifting melody</div>
+          <div class="gm-label">CoT override fields <span class="hint">if a CoT toggle is on, the LM rewrites these</span></div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">cot_bpm</div><div class="gm-input" style="margin-bottom:0; opacity:0.5">(blank → use main BPM)</div></div>
+            <div><div class="gm-label">cot_keyscale</div><div class="gm-input" style="margin-bottom:0; opacity:0.5">(blank → use main key)</div></div>
+          </div>
+        </div>
+        <!-- DCW section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h">
+            <span>DCW · dynamic CFG warping <span class="meta">· wavelet-based</span></span>
+            <span class="arrow">▾</span>
+          </div>
+          <div class="gm-toggle on"><span class="box">✓</span> DCW enabled</div>
+          <div class="gm-grid3">
+            <div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">▾</span></div></div>
+            <div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">▾</span></div></div>
+            <div><div class="gm-label">&nbsp;</div><div style="font-size:9px; color:#6B6B6B; padding-top:8px;">leave defaults if unsure</div></div>
+          </div>
+          <div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p5"></span><span class="val">0.02</span></div>
+          <div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
+        </div>
+        <div class="gm-btn">▶ Generate · est. ~30 s on M5 Max</div>
+      </div>
+      <!-- Output panel -->
+      <div class="gm-output">
+        <div class="gm-label" style="margin-bottom:10px">Output · psytrance · 30 s · seed 1297183202</div>
+        <div class="gm-waveform">
+          <div class="gm-bar" style="height:18%"></div><div class="gm-bar" style="height:32%"></div><div class="gm-bar" style="height:54%"></div><div class="gm-bar" style="height:72%"></div><div class="gm-bar" style="height:88%"></div><div class="gm-bar" style="height:62%"></div><div class="gm-bar" style="height:42%"></div><div class="gm-bar" style="height:78%"></div><div class="gm-bar" style="height:92%"></div><div class="gm-bar" style="height:66%"></div><div class="gm-bar" style="height:48%"></div><div class="gm-bar" style="height:30%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:80%"></div><div class="gm-bar" style="height:70%"></div><div class="gm-bar" style="height:44%"></div><div class="gm-bar" style="height:24%"></div><div class="gm-bar" style="height:50%"></div>
+        </div>
+        <div class="gm-player-controls">
+          <span class="gm-play">▶</span>
+          <span>0:00 / 0:30</span>
+          <span style="margin-left:auto; cursor:pointer; color:#FFF">↻ retake · new seed</span>
+        </div>
+        <div class="gm-label">Stems · Demucs htdemucs_ft</div>
+        <div class="gm-stems">
+          <div class="gm-stem"><span>vocals · 1.8 MB</span><span class="dl">↓</span></div>
+          <div class="gm-stem"><span>drums · 1.6 MB</span><span class="dl">↓</span></div>
+          <div class="gm-stem"><span>bass · 1.4 MB</span><span class="dl">↓</span></div>
+          <div class="gm-stem"><span>other · 1.7 MB</span><span class="dl">↓</span></div>
+        </div>
+        <div class="gm-label">Export</div>
+        <div class="gm-actions">
+          <span class="gm-secondary">↓ mp3 · 1.2 MB</span>
+          <span class="gm-secondary">↓ wav · 5.3 MB</span>
+          <span class="gm-secondary">↓ stems zip</span>
+          <span class="gm-secondary">{ } meta</span>
+          <span class="gm-secondary">↗ share</span>
+        </div>
+        <div class="gm-label" style="margin-top:14px">Metadata</div>
+        <div class="gm-meta-block">
+{<br>
+&nbsp;&nbsp;"mode": "generate",<br>
+&nbsp;&nbsp;"prompt": "psytrance, rolling triplet bassline...",<br>
+&nbsp;&nbsp;"lyrics_first_line": "[intro - atmospheric pads...",<br>
+&nbsp;&nbsp;"duration_s": 30, "instrumental": false,<br>
+&nbsp;&nbsp;"bpm": 135, "key": "auto", "time_sig": "4/4",<br>
+&nbsp;&nbsp;"sampler": "heun", "steps": 50, "cfg": 5.0, "shift": 3,<br>
+&nbsp;&nbsp;"cfg_interval": [0.0, 1.0],<br>
+&nbsp;&nbsp;"lm": {"thinking": true, "temp": 0.85, "top_p": 0.9, "cfg": 2,<br>
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"cot": {"metas":false,"caption":false,"lyrics":false,"language":false}},<br>
+&nbsp;&nbsp;"dcw": {"enabled":true,"mode":"double","scaler":0.02,"high_scaler":0.06,"wavelet":"haar"},<br>
+&nbsp;&nbsp;"loras": [<br>
+&nbsp;&nbsp;&nbsp;&nbsp;{"name":"Lyric2Vocal","scale":0.65,"sha256":"7e1f..."},<br>
+&nbsp;&nbsp;&nbsp;&nbsp;{"name":"psytrance_v2","scale":0.95,"sha256":"0c94..."}<br>
+&nbsp;&nbsp;],<br>
+&nbsp;&nbsp;"seed": 1297183202,<br>
+&nbsp;&nbsp;"output_sha256": "f33a..."<br>
+}
+        </div>
+      </div>
+    </div>
+  </div>
+</div>
+<h3 style="margin-top:30px">📱 Mobile — phone screens</h3>
+<p class="subtitle">Horizontal scroll tab strip at the top replaces the sidebar. Output stacks below form. Same Brutalist Mono.</p>
+<style>
+  .mob-frame { display:flex; gap:24px; flex-wrap:wrap; justify-content:center; align-items:flex-start; }
+  .mob-phone { background:#222; border-radius:18px; padding:8px; }
+  .mob-screen { width:200px; background:#0A0A0A; color:#E5E5E5; border-radius:12px; padding:10px; }
+  .mob-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:6px; border-bottom:1px solid #1F1F1F; margin-bottom:8px; }
+  .mob-brand { font-size:11px; font-weight:600; }
+  .mob-cta { font-size:8px; color:#6B6B6B; }
+  .mob-tabs { display:flex; gap:6px; overflow-x:auto; padding:4px 0; margin-bottom:8px; border-bottom:1px solid #1F1F1F; }
+  .mob-tab { font-size:9px; color:#6B6B6B; white-space:nowrap; padding:4px 6px; }
+  .mob-tab.active { color:#FFF; border-bottom:1px solid #FFF; }
+  .mob-form { background:#141414; padding:10px; border-radius:5px; }
+  .mob-label { font-size:8px; text-transform:uppercase; letter-spacing:0.06em; color:#6B6B6B; margin-bottom:4px; }
+  .mob-input { background:#000; border:1px solid #2A2A2A; padding:5px 8px; border-radius:3px; font-size:9px; margin-bottom:8px; }
+  .mob-textarea { min-height:30px; }
+  .mob-chips { margin-bottom:8px; }
+  .mob-chip { display:inline-block; padding:2px 7px; border-radius:9px; font-size:8px; margin-right:3px; margin-bottom:3px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; }
+  .mob-chip.on { border-color:#FFF; color:#FFF; }
+  .mob-accordion { background:#000; border:1px solid #2A2A2A; border-radius:3px; padding:5px 8px; margin-bottom:6px; font-size:9px; color:#6B6B6B; display:flex; justify-content:space-between; }
+  .mob-btn { background:#FFF; color:#0A0A0A; padding:6px 10px; border-radius:3px; font-weight:600; font-size:9px; text-align:center; }
+  .mob-output { background:#141414; padding:10px; border-radius:5px; margin-top:8px; }
+  .mob-wave { height:30px; background:#000; border:1px solid #2A2A2A; border-radius:3px; display:flex; align-items:center; gap:1px; padding:4px; margin-bottom:6px; }
+  .mob-wave-bar { width:1px; background:#FFF; }
+  .mob-controls { display:flex; align-items:center; gap:6px; font-size:8px; color:#6B6B6B; margin-bottom:8px; }
+  .mob-play { width:20px; height:20px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:9px; }
+  .mob-export { display:flex; flex-wrap:wrap; gap:3px; }
+  .mob-secondary { border:1px solid #2A2A2A; padding:3px 8px; border-radius:3px; font-size:8px; color:#E5E5E5; }
+  .mob-dropzone { background:#000; border:2px solid #FFF; border-radius:4px; padding:6px 8px; margin-bottom:8px; font-size:8px; }
+  .mob-caption { text-align:center; color:#6B6B6B; font-size:10px; margin-top:8px; }
+  /* Mobile slider — bounded inside its container, no overflow */
+  .mob-slider { height:3px; background:#2A2A2A; border-radius:2px; position:relative; margin:6px 0 10px; box-sizing:border-box; }
+  .mob-slider::after { content:""; position:absolute; top:-3px; width:9px; height:9px; background:#FFF; border-radius:50%; transform:translateX(-50%); }
+  .mob-slider.p15::after { left:15%; }
+  .mob-slider.p93::after { left:93%; }
+</style>
+<div class="mob-frame">
+  <!-- Phone 1: Generate -->
+  <div>
+    <div class="mob-phone">
+      <div class="mob-screen">
+        <div class="mob-header">
+          <div class="mob-brand">ACE Music.</div>
+          <div class="mob-cta">♥ @tfw</div>
+        </div>
+        <div class="mob-tabs">
+          <div class="mob-tab active">🎵 Generate</div>
+          <div class="mob-tab">🎤 Cover</div>
+          <div class="mob-tab">⏩</div>
+          <div class="mob-tab">✏️</div>
+          <div class="mob-tab">✍️</div>
+        </div>
+        <div class="mob-form">
+          <div class="mob-label">Style</div>
+          <div class="mob-input">psytrance, acid leads</div>
+          <div class="mob-label">Lyrics</div>
+          <div class="mob-input mob-textarea">[verse] six in the morning...</div>
+          <div class="mob-label">Duration · 30 s</div>
+          <div class="mob-slider p15"></div>
+          <div class="mob-chips">
+            <span class="mob-chip on">psytrance_v2</span>
+            <span class="mob-chip">+ upload</span>
+          </div>
+          <div class="mob-accordion">▸ Advanced · BPM 135, sampler heun</div>
+          <div class="mob-accordion">▸ LM planner</div>
+          <div class="mob-accordion">▸ DCW</div>
+          <div class="mob-btn" style="margin-top:6px">▶ Generate</div>
+        </div>
+        <div class="mob-output">
+          <div class="mob-wave">
+            <div class="mob-wave-bar" style="height:30%"></div><div class="mob-wave-bar" style="height:60%"></div><div class="mob-wave-bar" style="height:80%"></div><div class="mob-wave-bar" style="height:50%"></div><div class="mob-wave-bar" style="height:70%"></div><div class="mob-wave-bar" style="height:90%"></div><div class="mob-wave-bar" style="height:40%"></div><div class="mob-wave-bar" style="height:65%"></div><div class="mob-wave-bar" style="height:80%"></div><div class="mob-wave-bar" style="height:55%"></div><div class="mob-wave-bar" style="height:75%"></div><div class="mob-wave-bar" style="height:45%"></div><div class="mob-wave-bar" style="height:35%"></div><div class="mob-wave-bar" style="height:60%"></div><div class="mob-wave-bar" style="height:25%"></div>
+          </div>
+          <div class="mob-controls">
+            <span class="mob-play">▶</span>
+            <span>0:00 / 0:30</span>
+            <span style="margin-left:auto; color:#FFF">↻</span>
+          </div>
+          <div class="mob-export">
+            <span class="mob-secondary">↓ mp3</span>
+            <span class="mob-secondary">↓ wav</span>
+            <span class="mob-secondary">stems</span>
+          </div>
+        </div>
+      </div>
+    </div>
+    <div class="mob-caption">Generate · 360 × 720 mobile</div>
+  </div>
+  <!-- Phone 2: Cover with file picked -->
+  <div>
+    <div class="mob-phone">
+      <div class="mob-screen">
+        <div class="mob-header">
+          <div class="mob-brand">ACE Music.</div>
+          <div class="mob-cta">♥ @tfw</div>
+        </div>
+        <div class="mob-tabs">
+          <div class="mob-tab">🎵</div>
+          <div class="mob-tab active">🎤 Cover</div>
+          <div class="mob-tab">⏩</div>
+          <div class="mob-tab">✏️</div>
+          <div class="mob-tab">✍️</div>
+        </div>
+        <div class="mob-form">
+          <div class="mob-label">1 · Reference</div>
+          <div class="mob-dropzone">
+            <strong style="color:#FFF">↑ ref_psy.wav</strong><br>
+            <span style="color:#6B6B6B">44.1k · 28 s · 2.1 MB</span>
+          </div>
+          <div class="mob-label">2 · New prompt</div>
+          <div class="mob-input">faster, more aggressive</div>
+          <div class="mob-label">3 · New lyrics</div>
+          <div class="mob-input mob-textarea">[verse] new lyrics over ref...</div>
+          <div class="mob-label">Cover strength · 0.93</div>
+          <div class="mob-slider p93"></div>
+          <div class="mob-chips">
+            <span class="mob-chip on">RapMachine</span>
+          </div>
+          <div class="mob-accordion">▸ Advanced</div>
+          <div class="mob-accordion">▸ LM planner</div>
+          <div class="mob-btn" style="margin-top:6px">▶ Cover</div>
+        </div>
+      </div>
+    </div>
+    <div class="mob-caption">Cover · with ref audio loaded</div>
+  </div>
+  <!-- Phone 3: Lyrics output -->
+  <div>
+    <div class="mob-phone">
+      <div class="mob-screen">
+        <div class="mob-header">
+          <div class="mob-brand">ACE Music.</div>
+          <div class="mob-cta">♥ @tfw</div>
+        </div>
+        <div class="mob-tabs">
+          <div class="mob-tab">🎵</div>
+          <div class="mob-tab">🎤</div>
+          <div class="mob-tab">⏩</div>
+          <div class="mob-tab">✏️</div>
+          <div class="mob-tab active">✍️ Lyrics</div>
+        </div>
+        <div class="mob-form">
+          <div class="mob-label">Brief</div>
+          <div class="mob-input mob-textarea">psytrance anthem about sunrise...</div>
+          <div class="mob-label">Structure</div>
+          <div class="mob-input">intro, verse, chorus...</div>
+          <div class="mob-label">Language · en · 0.85 temp</div>
+          <div class="mob-accordion">▸ LM parameters</div>
+          <div class="mob-btn" style="margin-top:6px">▶ Draft</div>
+        </div>
+        <div class="mob-output">
+          <div style="font-size:9px; line-height:1.5;">
+            <strong style="color:#FFF">[intro]</strong><br>
+            <span style="color:#B8B0A4">the lights start low...</span><br>
+            <strong style="color:#FFF">[verse 1]</strong><br>
+            <span style="color:#B8B0A4">six in the morning,<br>the sun's still pretending...</span>
+          </div>
+          <div class="mob-export" style="margin-top:8px">
+            <span class="mob-secondary" style="border-color:#FFF; color:#FFF">↑ Use in Generate</span>
+            <span class="mob-secondary">↻</span>
+          </div>
+        </div>
+      </div>
+    </div>
+    <div class="mob-caption">Lyrics · draft visible</div>
+  </div>
+</div>
+<h3 style="margin-top:30px">⚠️ Error and edge-case states</h3>
+<style>
+  .err { background:#0A0A0A; border:1px solid #1F1F1F; border-radius:8px; padding:14px; margin-bottom:10px; }
+  .err-row { display:flex; align-items:flex-start; gap:14px; }
+  .err-icon { width:28px; height:28px; flex-shrink:0; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:14px; font-weight:600; }
+  .err-icon.warn { background:#0A0A0A; color:#FFF; border:1px solid #FFF; }
+  .err-icon.info { background:transparent; color:#6B6B6B; border:1px solid #6B6B6B; }
+  .err-body { flex:1; }
+  .err-title { font-size:12px; font-weight:600; color:#FFF; margin-bottom:4px; }
+  .err-msg { font-size:11px; color:#B8B0A4; line-height:1.5; margin-bottom:6px; }
+  .err-action { display:inline-block; border:1px solid #FFF; color:#FFF; padding:4px 10px; border-radius:3px; font-size:10px; cursor:pointer; margin-right:4px; }
+  .err-action.secondary { border-color:#2A2A2A; color:#B8B0A4; }
+  .err-tag { display:inline-block; background:#1A1A1A; color:#6B6B6B; padding:2px 6px; border-radius:3px; font-size:9px; font-family:monospace; margin-left:6px; }
+  .progress { background:#0A0A0A; border:1px solid #1F1F1F; border-radius:8px; padding:18px; margin-bottom:10px; }
+  .progress-bar { height:4px; background:#1A1A1A; border-radius:2px; overflow:hidden; margin:10px 0 6px; }
+  .progress-bar .fill { height:100%; background:#FFF; width:42%; }
+  .progress-meta { display:flex; justify-content:space-between; font-size:10px; color:#6B6B6B; font-family:monospace; }
+  .progress-title { font-size:12px; font-weight:600; color:#FFF; margin-bottom:4px; }
+  .progress-sub { font-size:10px; color:#6B6B6B; }
+</style>
+<div class="err">
+  <div class="err-row">
+    <div class="err-icon">!</div>
+    <div class="err-body">
+      <div class="err-title">LoRA not compatible <span class="err-tag">LoRAValidationError</span></div>
+      <div class="err-msg">This LoRA was trained against <code>SDXL</code>, not ACE-Step 1.5 XL SFT. Expected DiT modules: <code>to_q, to_k, to_v, to_out.0, ff.net.0.proj, ff.net.2</code>. Got: <code>unet.down_blocks…</code>.</div>
+      <div class="err-action">Remove from stack</div>
+      <span class="err-action secondary">View header diagnostics</span>
+    </div>
+  </div>
+</div>
+<div class="err">
+  <div class="err-row">
+    <div class="err-icon warn">⚠</div>
+    <div class="err-body">
+      <div class="err-title">ZeroGPU timed out · auto-retried at 2× duration</div>
+      <div class="err-msg">First attempt aborted at the 60 s shared-A10G cap. Second attempt at 120 s also aborted. Try a shorter duration, fewer steps, or fewer active LoRAs. <span style="color:#6B6B6B">last seen: 70 s wall, step 41/50</span></div>
+      <div class="err-action">Lower steps to 30</div>
+      <span class="err-action secondary">Reduce duration to 20 s</span>
+    </div>
+  </div>
+</div>
+<div class="err">
+  <div class="err-row">
+    <div class="err-icon warn">⚠</div>
+    <div class="err-body">
+      <div class="err-title">MPS op not implemented · falling back to CPU <span class="err-tag">aten::_fft_r2c</span></div>
+      <div class="err-msg">An ACE-Step kernel hit a PyTorch MPS gap. CPU fallback engaged via <code>PYTORCH_ENABLE_MPS_FALLBACK=1</code>. Generation will continue but be ~2–3× slower for the affected segments.</div>
+      <div class="err-action secondary">Continue anyway</div>
+      <span class="err-action secondary">Open issue on GitHub</span>
+    </div>
+  </div>
+</div>
+<div class="err">
+  <div class="err-row">
+    <div class="err-icon">!</div>
+    <div class="err-body">
+      <div class="err-title">Reference audio rejected <span class="err-tag">unsupported format</span></div>
+      <div class="err-msg">Cover mode needs <code>wav</code>, <code>mp3</code>, or <code>flac</code>, ≤ 60 s, ≤ 50 MB. Got <code>m4a</code>, 4:12 long, 87 MB.</div>
+      <div class="err-action">Pick a different file</div>
+      <span class="err-action secondary">Auto-convert + trim to first 60 s</span>
+    </div>
+  </div>
+</div>
+<div class="err">
+  <div class="err-row">
+    <div class="err-icon info">i</div>
+    <div class="err-body">
+      <div class="err-title">First request — warming up the pipeline (~45 s)</div>
+      <div class="err-msg">Loading <code>ACE-Step v1.5 XL SFT</code> weights into MPS memory. Subsequent generations in this session start instantly.</div>
+    </div>
+  </div>
+</div>
+<div class="progress">
+  <div class="progress-title">Generating… <span style="color:#6B6B6B; font-weight:400; font-size:10px;">step 21 / 50 · ETA 14 s</span></div>
+  <div class="progress-sub">heun sampler · CFG 5.0 · 2 LoRAs active · seed 1297183202</div>
+  <div class="progress-bar"><div class="fill"></div></div>
+  <div class="progress-meta">
+    <span>0:08 elapsed</span>
+    <span>↻ cancel</span>
+  </div>
+</div>
+<div class="options" style="margin-top:24px">
+  <div class="option" data-choice="approve" onclick="toggleSelect(this)">
+    <div class="letter">✓</div>
+    <div class="content">
+      <h3>All mockups approved — bake them into the spec</h3>
+      <p>Move every approved mockup into <code>docs/superpowers/specs/mockups/</code> and reference them from §8 of the spec. Then hand off to writing-plans.</p>
+    </div>
+  </div>
+  <div class="option" data-choice="revise" onclick="toggleSelect(this)">
+    <div class="letter">✎</div>
+    <div class="content">
+      <h3>Revise something specific</h3>
+      <p>Tell me which mockup / control / error needs work.</p>
+    </div>
+  </div>
+</div>

docs/superpowers/specs/mockups/02_cover_extend.html ADDED Viewed

	@@ -0,0 +1,572 @@

+<h2>Cover and Extend · everything expanded</h2>
+<p class="subtitle">Every accordion open. Every control visible. Showing the actual depth of options. In production, "Advanced", "LM planner", "DCW" stay collapsed by default — but this is the full surface so you can verify nothing is missing.</p>
+<style>
+  /* base */
+  .gm { background:#0A0A0A; color:#E5E5E5; border:1px solid #1F1F1F; border-radius:10px; padding:18px; font-size:12px; line-height:1.5; margin-top:14px; }
+  .gm-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:10px; border-bottom:1px solid #1F1F1F; margin-bottom:14px; }
+  .gm-brand { font-size:15px; font-weight:600; }
+  .gm-cta { font-size:11px; color:#6B6B6B; }
+  .gm-cta strong { color:#E5E5E5; }
+  .gm-status { font-size:10px; color:#6B6B6B; letter-spacing:0.08em; text-transform:uppercase; }
+  .gm-row { display:flex; gap:16px; align-items:flex-start; }
+  .gm-sidebar { background:#000; padding:14px 10px; border-radius:6px; min-width:170px; position:sticky; top:0; }
+  .gm-side { display:block; padding:8px 10px; border-radius:4px; margin-bottom:3px; font-size:12px; color:#6B6B6B; }
+  .gm-side.active { background:#1A1A1A; color:#FFF; border-left:2px solid #FFF; padding-left:8px; }
+  .gm-side .em { margin-right:6px; }
+  .gm-main { flex:1; display:flex; gap:14px; align-items:flex-start; }
+  .gm-form { flex:1.3; background:#141414; padding:16px; border-radius:6px; }
+  .gm-output { flex:1; background:#141414; padding:16px; border-radius:6px; min-width:260px; position:sticky; top:0; }
+  /* form controls */
+  .gm-label { font-size:10px; text-transform:uppercase; letter-spacing:0.08em; color:#6B6B6B; margin-bottom:6px; display:flex; justify-content:space-between; align-items:center; }
+  .gm-label .hint { color:#5A5048; font-size:9px; text-transform:none; letter-spacing:normal; font-weight:400; }
+  .gm-input { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; margin-bottom:12px; font-size:11px; }
+  .gm-textarea { min-height:46px; }
+  .gm-grid2 { display:grid; grid-template-columns:1fr 1fr; gap:12px; margin-bottom:12px; }
+  .gm-grid3 { display:grid; grid-template-columns:1fr 1fr 1fr; gap:10px; margin-bottom:12px; }
+  .gm-grid4 { display:grid; grid-template-columns:1fr 1fr 1fr 1fr; gap:8px; margin-bottom:12px; }
+  /* slider */
+  .gm-slider-row { display:flex; align-items:center; gap:10px; padding:6px 8px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; }
+  .gm-slider-row .name { color:#6B6B6B; font-size:10px; min-width:90px; }
+  .gm-slider { flex:1; height:3px; background:#2A2A2A; border-radius:2px; position:relative; }
+  .gm-slider::after { content:""; position:absolute; top:-4px; width:10px; height:10px; background:#FFF; border-radius:50%; }
+  .gm-slider.p10::after { left:10%; }
+  .gm-slider.p20::after { left:20%; }
+  .gm-slider.p30::after { left:30%; }
+  .gm-slider.p40::after { left:40%; }
+  .gm-slider.p50::after { left:50%; }
+  .gm-slider.p60::after { left:60%; }
+  .gm-slider.p70::after { left:70%; }
+  .gm-slider.p85::after { left:85%; }
+  .gm-slider.p93::after { left:93%; }
+  .gm-slider.p95::after { left:95%; }
+  .gm-slider-row .val { color:#FFF; font-family:monospace; font-size:11px; min-width:42px; text-align:right; }
+  /* toggle */
+  .gm-toggle { display:flex; align-items:center; gap:8px; padding:6px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; cursor:pointer; }
+  .gm-toggle .box { width:14px; height:14px; border:1px solid #2A2A2A; border-radius:3px; display:inline-flex; align-items:center; justify-content:center; font-size:9px; }
+  .gm-toggle.on { color:#FFF; border-color:#FFF; }
+  .gm-toggle.on .box { background:#FFF; color:#0A0A0A; border-color:#FFF; }
+  /* radio pill */
+  .gm-pills { display:flex; gap:0; background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:2px; margin-bottom:12px; }
+  .gm-pill { flex:1; text-align:center; padding:6px 10px; font-size:11px; color:#6B6B6B; border-radius:3px; cursor:pointer; }
+  .gm-pill.on { background:#FFF; color:#0A0A0A; }
+  /* select */
+  .gm-select { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; font-size:11px; display:flex; justify-content:space-between; align-items:center; margin-bottom:8px; }
+  .gm-select .arrow { color:#6B6B6B; }
+  /* section divider */
+  .gm-section { border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-top:14px; background:#0F0F0F; }
+  .gm-section-h { display:flex; justify-content:space-between; align-items:center; margin-bottom:12px; font-size:11px; font-weight:600; }
+  .gm-section-h .arrow { color:#FFF; }
+  .gm-section-h .meta { color:#6B6B6B; font-weight:400; font-size:10px; }
+  .gm-chip { display:inline-block; padding:5px 10px; border-radius:14px; font-size:10px; margin-right:5px; margin-bottom:5px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; cursor:pointer; }
+  .gm-chip.on { border-color:#FFF; color:#FFF; }
+  .gm-chip.upload { border-style:dashed; color:#FFF; }
+  .gm-lora-row { display:flex; align-items:center; gap:10px; padding:8px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:6px; font-size:11px; }
+  .gm-lora-name { flex:1; }
+  .gm-lora-name small { color:#6B6B6B; font-weight:400; margin-left:4px; }
+  .gm-x { color:#6B6B6B; cursor:pointer; padding:0 4px; }
+  .gm-btn { background:#FFF; color:#0A0A0A; padding:12px 18px; border-radius:4px; font-weight:600; display:block; font-size:13px; text-align:center; cursor:pointer; margin-top:16px; }
+  /* drop zone */
+  .gm-dropzone { background:#000; border:2px dashed #2A2A2A; border-radius:6px; padding:14px; margin-bottom:12px; text-align:center; font-size:11px; color:#6B6B6B; }
+  .gm-dropzone.has-file { border-style:solid; border-color:#FFF; color:#FFF; text-align:left; padding:10px 12px; }
+  .gm-dropzone .filename { font-weight:600; }
+  .gm-dropzone .meta { color:#6B6B6B; font-size:9px; margin-top:2px; font-weight:400; }
+  .gm-dropzone .miniwave { height:18px; background:repeating-linear-gradient(90deg, currentColor 0 1px, transparent 1px 3px); margin-top:6px; opacity:0.5; }
+  /* output */
+  .gm-waveform { height:60px; background:#000; border:1px solid #2A2A2A; border-radius:4px; display:flex; align-items:center; justify-content:center; gap:2px; padding:8px; margin-bottom:10px; }
+  .gm-bar { width:2px; background:#E5E5E5; }
+  .gm-player-controls { display:flex; align-items:center; gap:10px; color:#6B6B6B; font-size:10px; margin-bottom:14px; }
+  .gm-play { width:28px; height:28px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:11px; }
+  .gm-stems { display:grid; grid-template-columns:1fr 1fr; gap:6px; margin-bottom:10px; }
+  .gm-stem { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between; align-items:center; }
+  .gm-stem .dl { color:#FFF; cursor:pointer; }
+  .gm-meta-block { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:8px 10px; font-size:9px; color:#6B6B6B; font-family:monospace; line-height:1.6; max-height:140px; overflow:hidden; margin-top:8px; }
+  .gm-actions { display:flex; flex-wrap:wrap; gap:6px; margin-bottom:10px; }
+  .gm-secondary { border:1px solid #2A2A2A; color:#E5E5E5; padding:6px 12px; border-radius:4px; font-size:10px; cursor:pointer; }
+</style>
+<h3 style="margin-top:14px">🎤 Cover — fully expanded</h3>
+<div class="gm">
+  <div class="gm-header">
+    <div>
+      <div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
+      <div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">♥</span>. <strong>Drop a like</strong> · Follow <strong>@techfreakworm</strong> for what's next.</div>
+    </div>
+    <div class="gm-status">ready · MPS · M5 Max</div>
+  </div>
+  <div class="gm-row">
+    <div class="gm-sidebar">
+      <div class="gm-side"><span class="em">🎵</span>Generate</div>
+      <div class="gm-side active"><span class="em">🎤</span>Cover</div>
+      <div class="gm-side"><span class="em">⏩</span>Extend</div>
+      <div class="gm-side"><span class="em">✏️</span>Edit</div>
+      <div class="gm-side"><span class="em">✍️</span>Lyrics</div>
+      <div style="border-top:1px solid #1F1F1F; margin:14px 0 10px; padding-top:10px; font-size:9px; color:#6B6B6B; text-transform:uppercase; letter-spacing:0.1em;">History · session</div>
+      <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">▶ psy_cover · just now</div>
+      <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">▶ lofi_remix · 3m</div>
+      <div style="font-size:10px; color:#6B6B6B; padding:4px 8px;">▶ ambient_v2 · 12m</div>
+    </div>
+    <div class="gm-main">
+      <div class="gm-form">
+        <div class="gm-label">1 · Reference audio <span class="hint">wav / mp3 / flac · ≤ 60 s · matters most for first 12 s</span></div>
+        <div class="gm-dropzone has-file">
+          <div class="filename">↑ reference_psy_track.wav</div>
+          <div class="meta">44.1 kHz · stereo · 28.4 s · 2.1 MB</div>
+          <div class="miniwave"></div>
+        </div>
+        <div class="gm-label">2 · New style prompt <span class="hint">leave blank to fully inherit reference style</span></div>
+        <div class="gm-input">faster, more aggressive leads, club-ready</div>
+        <div class="gm-label">3 · New lyrics <span class="hint">use [verse] [chorus] [bridge] tags · open Lyrics tab to draft with AI</span></div>
+        <div class="gm-input gm-textarea">[intro] driving acid bassline<br>[verse] new lyrics over the reference style<br>[chorus] one more time, one more time<br>[outro] ...</div>
+        <div class="gm-grid2">
+          <div>
+            <div class="gm-label">Duration <span class="hint">seconds</span></div>
+            <div class="gm-slider-row"><span class="name">5 – 240 s</span><span class="gm-slider p10"></span><span class="val">30</span></div>
+          </div>
+          <div>
+            <div class="gm-label">Vocal mode</div>
+            <div class="gm-pills">
+              <div class="gm-pill on">With vocals</div>
+              <div class="gm-pill">Instrumental</div>
+            </div>
+          </div>
+        </div>
+        <div class="gm-label">Cover-specific <span class="hint">how the reference influences the output</span></div>
+        <div class="gm-slider-row"><span class="name">Cover strength</span><span class="gm-slider p93"></span><span class="val">0.93</span></div>
+        <div class="gm-slider-row"><span class="name">Cover noise</span><span class="gm-slider p10" style="--p:0.05;"></span><span class="val">0.00</span></div>
+        <!-- LoRA section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h">
+            <span>LoRA stack <span class="meta">· 2 active</span></span>
+            <span class="arrow">▾</span>
+          </div>
+          <div class="gm-label">Bundled presets <span class="hint">click to toggle</span></div>
+          <div style="margin-bottom:12px;">
+            <span class="gm-chip on">RapMachine</span>
+            <span class="gm-chip">Chinese Rap</span>
+            <span class="gm-chip">Lyric2Vocal</span>
+            <span class="gm-chip">Text2Samples</span>
+          </div>
+          <div class="gm-label">Active stack <span class="hint">applied in order, top first</span></div>
+          <div class="gm-lora-row">
+            <span class="gm-lora-name">RapMachine <small>· preset</small></span>
+            <span class="gm-slider p85" style="width:100px"></span>
+            <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.85</span>
+            <span class="gm-x">×</span>
+          </div>
+          <div class="gm-lora-row">
+            <span class="gm-lora-name">psytrance_v2 <small>· custom · 47 MB · rank 64</small></span>
+            <span class="gm-slider p95" style="width:100px"></span>
+            <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.95</span>
+            <span class="gm-x">×</span>
+          </div>
+          <div style="margin-top:10px;">
+            <span class="gm-chip upload">↑ drop .safetensors here or click</span>
+          </div>
+        </div>
+        <!-- Advanced section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h">
+            <span>Advanced</span>
+            <span class="arrow">▾</span>
+          </div>
+          <div class="gm-grid3">
+            <div><div class="gm-label">BPM</div><div class="gm-input" style="margin-bottom:0">135</div></div>
+            <div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">auto</div></div>
+            <div><div class="gm-label">Time sig</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">▾</span></div></div>
+            <div><div class="gm-label">Vocal language</div><div class="gm-select">auto <span class="arrow">▾</span></div></div>
+          </div>
+          <div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p20"></span><span class="val">50</span></div>
+          <div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
+          <div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p30"></span><span class="val">3</span></div>
+          <div class="gm-label" style="margin-top:8px">Negative prompt <span class="hint">things to avoid in the output</span></div>
+          <div class="gm-input gm-textarea" style="font-size:10px">bitcrushed, aliasing, jazz, pop, vocal hooks, slow tempo</div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
+            <div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">✓</span> Normalize to -14 LUFS</div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Fade in</div><div class="gm-input" style="margin-bottom:0">0.0 s</div></div>
+            <div><div class="gm-label">Fade out</div><div class="gm-input" style="margin-bottom:0">0.0 s</div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">1297183202</div></div>
+            <div><div class="gm-label">&nbsp;</div><div class="gm-toggle"><span class="box"></span> Lock seed · re-use across retakes</div></div>
+          </div>
+        </div>
+        <!-- LM planner section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h">
+            <span>LM planner · Qwen3 thinking</span>
+            <span class="arrow">▾</span>
+          </div>
+          <div class="gm-toggle on"><span class="box">✓</span> Thinking enabled <span style="color:#6B6B6B; font-size:9px; margin-left:auto">+ slower but better structure</span></div>
+          <div class="gm-toggle on"><span class="box">✓</span> Constrained decoding</div>
+          <div class="gm-grid4" style="margin-top:8px">
+            <div><div class="gm-label">Temp</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
+            <div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
+            <div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
+            <div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
+          </div>
+          <div class="gm-label">CoT pipeline toggles <span class="hint">which fields the LM rewrites</span></div>
+          <div class="gm-grid4">
+            <div class="gm-toggle"><span class="box"></span> metas</div>
+            <div class="gm-toggle"><span class="box"></span> caption</div>
+            <div class="gm-toggle"><span class="box"></span> lyrics</div>
+            <div class="gm-toggle"><span class="box"></span> language</div>
+          </div>
+          <div class="gm-label">LM negative prompt</div>
+          <div class="gm-input" style="font-size:10px">happy chords, major scale</div>
+        </div>
+        <!-- DCW section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h">
+            <span>DCW · dynamic CFG warping</span>
+            <span class="arrow">▾</span>
+          </div>
+          <div class="gm-toggle on"><span class="box">✓</span> DCW enabled</div>
+          <div class="gm-grid3">
+            <div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">▾</span></div></div>
+            <div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">▾</span></div></div>
+            <div><div class="gm-label">&nbsp;</div><div style="font-size:9px; color:#6B6B6B; padding-top:8px;">leave defaults if unsure</div></div>
+          </div>
+          <div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p10"></span><span class="val">0.02</span></div>
+          <div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
+        </div>
+        <div class="gm-btn">▶ Generate cover · est. ~35 s on M5 Max</div>
+      </div>
+      <!-- Output panel -->
+      <div class="gm-output">
+        <div class="gm-label" style="margin-bottom:10px">Output · cover · 30 s · seed 1297183202</div>
+        <div class="gm-toggle"><span class="box"></span> Compare side-by-side with reference</div>
+        <div class="gm-waveform">
+          <div class="gm-bar" style="height:22%"></div><div class="gm-bar" style="height:54%"></div><div class="gm-bar" style="height:78%"></div><div class="gm-bar" style="height:42%"></div><div class="gm-bar" style="height:62%"></div><div class="gm-bar" style="height:88%"></div><div class="gm-bar" style="height:32%"></div><div class="gm-bar" style="height:70%"></div><div class="gm-bar" style="height:50%"></div><div class="gm-bar" style="height:84%"></div><div class="gm-bar" style="height:64%"></div><div class="gm-bar" style="height:38%"></div><div class="gm-bar" style="height:74%"></div><div class="gm-bar" style="height:46%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:80%"></div><div class="gm-bar" style="height:36%"></div><div class="gm-bar" style="height:68%"></div>
+        </div>
+        <div class="gm-player-controls">
+          <span class="gm-play">▶</span>
+          <span>0:00 / 0:30</span>
+          <span style="margin-left:auto; cursor:pointer; color:#FFF">↻ retake · new seed</span>
+        </div>
+        <div class="gm-label">Stems · Demucs htdemucs_ft</div>
+        <div class="gm-stems">
+          <div class="gm-stem"><span>vocals · 1.8 MB</span><span class="dl">↓</span></div>
+          <div class="gm-stem"><span>drums · 1.6 MB</span><span class="dl">↓</span></div>
+          <div class="gm-stem"><span>bass · 1.4 MB</span><span class="dl">↓</span></div>
+          <div class="gm-stem"><span>other · 1.7 MB</span><span class="dl">↓</span></div>
+        </div>
+        <div class="gm-label">Export</div>
+        <div class="gm-actions">
+          <span class="gm-secondary">↓ mp3 · 320k · 1.2 MB</span>
+          <span class="gm-secondary">↓ wav · 44.1k · 5.3 MB</span>
+          <span class="gm-secondary">↓ stems zip</span>
+          <span class="gm-secondary">{ } meta json</span>
+          <span class="gm-secondary">↗ copy share link</span>
+        </div>
+        <div class="gm-label" style="margin-top:14px">Metadata · for reproducibility</div>
+        <div class="gm-meta-block">
+{<br>
+&nbsp;&nbsp;"mode": "cover",<br>
+&nbsp;&nbsp;"prompt": "faster, more aggressive leads, club-ready",<br>
+&nbsp;&nbsp;"lyrics_first_line": "[intro] driving acid bassline...",<br>
+&nbsp;&nbsp;"ref_audio_sha256": "a4f1...d29c",<br>
+&nbsp;&nbsp;"duration_s": 30, "bpm": 135, "key": "auto",<br>
+&nbsp;&nbsp;"sampler": "heun", "steps": 50, "cfg": 5.0, "shift": 3,<br>
+&nbsp;&nbsp;"audio_cover_strength": 0.93, "cover_noise_strength": 0.0,<br>
+&nbsp;&nbsp;"lm": {"thinking": true, "temp": 0.85, "top_p": 0.9, "cfg": 2,<br>
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"cot": {"metas": false, "caption": false, "lyrics": false}},<br>
+&nbsp;&nbsp;"dcw": {"enabled": true, "mode": "double", "scaler": 0.02, "high_scaler": 0.06, "wavelet": "haar"},<br>
+&nbsp;&nbsp;"loras": [<br>
+&nbsp;&nbsp;&nbsp;&nbsp;{"name": "RapMachine", "scale": 0.85, "sha256": "b7e2..."},<br>
+&nbsp;&nbsp;&nbsp;&nbsp;{"name": "psytrance_v2", "scale": 0.95, "sha256": "0c94..."}<br>
+&nbsp;&nbsp;],<br>
+&nbsp;&nbsp;"seed": 1297183202,<br>
+&nbsp;&nbsp;"output_sha256": "f33a...19b8"<br>
+}
+        </div>
+      </div>
+    </div>
+  </div>
+</div>
+<h3 style="margin-top:30px">⏩ Extend — fully expanded</h3>
+<div class="gm">
+  <div class="gm-header">
+    <div>
+      <div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
+      <div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">♥</span>. <strong>Drop a like</strong> · Follow <strong>@techfreakworm</strong> for what's next.</div>
+    </div>
+    <div class="gm-status">ready · MPS · M5 Max</div>
+  </div>
+  <div class="gm-row">
+    <div class="gm-sidebar">
+      <div class="gm-side"><span class="em">🎵</span>Generate</div>
+      <div class="gm-side"><span class="em">🎤</span>Cover</div>
+      <div class="gm-side active"><span class="em">⏩</span>Extend</div>
+      <div class="gm-side"><span class="em">✏️</span>Edit</div>
+      <div class="gm-side"><span class="em">✍️</span>Lyrics</div>
+    </div>
+    <div class="gm-main">
+      <div class="gm-form">
+        <div class="gm-label">1 · Seed audio <span class="hint">what to continue · wav / mp3 / flac · ≤ 240 s</span></div>
+        <div class="gm-dropzone has-file">
+          <div class="filename">↑ unfinished_track_v3.wav</div>
+          <div class="meta">44.1 kHz · stereo · 1:42 · 18.0 MB · BPM detected 135 · key C minor</div>
+          <div class="miniwave"></div>
+        </div>
+        <div class="gm-label">2 · Extension prompt <span class="hint">style hint for what comes next</span></div>
+        <div class="gm-input">build to climax, layered acid leads, then breakdown</div>
+        <div class="gm-label">3 · Extension lyrics <span class="hint">optional · use [verse] [chorus] tags · blank = instrumental continuation</span></div>
+        <div class="gm-input gm-textarea">[bridge] the drop is coming...<br>[chorus] one more time, one more time</div>
+        <div class="gm-grid2">
+          <div>
+            <div class="gm-label">Extra duration <span class="hint">seconds</span></div>
+            <div class="gm-slider-row"><span class="name">5 – 120 s</span><span class="gm-slider p50"></span><span class="val">60</span></div>
+          </div>
+          <div>
+            <div class="gm-label">Vocal mode</div>
+            <div class="gm-pills"><div class="gm-pill on">With vocals</div><div class="gm-pill">Instrumental</div></div>
+          </div>
+        </div>
+        <div class="gm-label">Extend-specific <span class="hint">how the seam is handled</span></div>
+        <div class="gm-grid2">
+          <div><div class="gm-label">Repaint mode</div><div class="gm-select">balanced <span class="arrow">▾</span></div></div>
+          <div><div class="gm-label">Chunk mask</div><div class="gm-select">auto <span class="arrow">▾</span></div></div>
+        </div>
+        <div class="gm-slider-row"><span class="name">Repaint strength</span><span class="gm-slider p50"></span><span class="val">0.50</span></div>
+        <div class="gm-slider-row"><span class="name">Latent crossfade frames</span><span class="gm-slider p20"></span><span class="val">10</span></div>
+        <div class="gm-slider-row"><span class="name">WAV crossfade seconds</span><span class="gm-slider p10"></span><span class="val">2.0</span></div>
+        <!-- LoRA section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h"><span>LoRA stack <span class="meta">· 1 active</span></span><span class="arrow">▾</span></div>
+          <div class="gm-label">Bundled presets</div>
+          <div style="margin-bottom:12px;">
+            <span class="gm-chip">RapMachine</span>
+            <span class="gm-chip">Chinese Rap</span>
+            <span class="gm-chip">Lyric2Vocal</span>
+            <span class="gm-chip">Text2Samples</span>
+          </div>
+          <div class="gm-label">Active stack</div>
+          <div class="gm-lora-row">
+            <span class="gm-lora-name">psytrance_v2 <small>· custom · 47 MB</small></span>
+            <span class="gm-slider p95" style="width:100px"></span>
+            <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.95</span>
+            <span class="gm-x">×</span>
+          </div>
+          <div style="margin-top:10px;">
+            <span class="gm-chip upload">↑ drop .safetensors here</span>
+          </div>
+        </div>
+        <!-- Advanced section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h"><span>Advanced</span><span class="arrow">▾</span></div>
+          <div class="gm-grid3">
+            <div><div class="gm-label">BPM <span class="hint">inherits from seed if blank</span></div><div class="gm-input" style="margin-bottom:0">135</div></div>
+            <div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">C minor</div></div>
+            <div><div class="gm-label">Time sig</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">▾</span></div></div>
+            <div><div class="gm-label">Vocal language</div><div class="gm-select">en <span class="arrow">▾</span></div></div>
+          </div>
+          <div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p20"></span><span class="val">50</span></div>
+          <div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
+          <div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p30"></span><span class="val">3</span></div>
+          <div class="gm-label" style="margin-top:8px">Negative prompt</div>
+          <div class="gm-input gm-textarea" style="font-size:10px">bitcrushed, aliasing, lo-fi hiss</div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
+            <div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">✓</span> -14 LUFS</div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">9911</div></div>
+            <div><div class="gm-label">&nbsp;</div><div class="gm-toggle"><span class="box"></span> Lock seed</div></div>
+          </div>
+        </div>
+        <!-- LM planner section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h"><span>LM planner · Qwen3 thinking</span><span class="arrow">▾</span></div>
+          <div class="gm-toggle on"><span class="box">✓</span> Thinking enabled</div>
+          <div class="gm-toggle on"><span class="box">✓</span> Constrained decoding</div>
+          <div class="gm-grid4" style="margin-top:8px">
+            <div><div class="gm-label">Temp</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
+            <div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
+            <div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
+            <div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
+          </div>
+          <div class="gm-label">CoT pipeline toggles</div>
+          <div class="gm-grid4">
+            <div class="gm-toggle"><span class="box"></span> metas</div>
+            <div class="gm-toggle"><span class="box"></span> caption</div>
+            <div class="gm-toggle"><span class="box"></span> lyrics</div>
+            <div class="gm-toggle"><span class="box"></span> language</div>
+          </div>
+        </div>
+        <!-- DCW section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h"><span>DCW · dynamic CFG warping</span><span class="arrow">▾</span></div>
+          <div class="gm-toggle on"><span class="box">✓</span> DCW enabled</div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">▾</span></div></div>
+            <div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">▾</span></div></div>
+          </div>
+          <div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p10"></span><span class="val">0.02</span></div>
+          <div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
+        </div>
+        <div class="gm-btn">▶ Extend · est. ~50 s · output 2:42 total</div>
+      </div>
+      <!-- Output panel -->
+      <div class="gm-output">
+        <div class="gm-label" style="margin-bottom:10px">Output · extended · 2:42 · seed 9911</div>
+        <div class="gm-toggle on"><span class="box">✓</span> Show seed boundary marker</div>
+        <div class="gm-waveform" style="position:relative">
+          <div class="gm-bar" style="height:32%"></div><div class="gm-bar" style="height:48%"></div><div class="gm-bar" style="height:64%"></div><div class="gm-bar" style="height:42%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:38%"></div><div class="gm-bar" style="height:52%"></div><div class="gm-bar" style="height:46%"></div><div class="gm-bar" style="height:34%; opacity:0.5"></div>
+          <div style="border-left:1px dashed #FFF; height:48px;"></div>
+          <div class="gm-bar" style="height:62%"></div><div class="gm-bar" style="height:78%"></div><div class="gm-bar" style="height:92%"></div><div class="gm-bar" style="height:84%"></div><div class="gm-bar" style="height:70%"></div><div class="gm-bar" style="height:58%"></div><div class="gm-bar" style="height:40%"></div>
+          <div style="position:absolute; bottom:-2px; left:50%; transform:translateX(-50%); font-size:8px; color:#FFF; background:#0A0A0A; padding:0 4px;">↑ seed ends · 1:42</div>
+        </div>
+        <div class="gm-player-controls">
+          <span class="gm-play">▶</span>
+          <span>0:00 / 2:42</span>
+          <span style="margin-left:auto; cursor:pointer; color:#FFF">↻ retake</span>
+        </div>
+        <div class="gm-label">Stems · Demucs</div>
+        <div class="gm-stems">
+          <div class="gm-stem"><span>vocals</span><span class="dl">↓</span></div>
+          <div class="gm-stem"><span>drums</span><span class="dl">↓</span></div>
+          <div class="gm-stem"><span>bass</span><span class="dl">↓</span></div>
+          <div class="gm-stem"><span>other</span><span class="dl">↓</span></div>
+        </div>
+        <div class="gm-label">Export</div>
+        <div class="gm-actions">
+          <span class="gm-secondary">↓ full mp3 · 6.3 MB</span>
+          <span class="gm-secondary">↓ extension-only mp3 · 2.4 MB</span>
+          <span class="gm-secondary">↓ full wav</span>
+          <span class="gm-secondary">↓ stems zip</span>
+          <span class="gm-secondary">{ } meta json</span>
+          <span class="gm-secondary">↗ share link</span>
+        </div>
+        <div class="gm-label" style="margin-top:14px">Metadata</div>
+        <div class="gm-meta-block">
+{<br>
+&nbsp;&nbsp;"mode": "extend",<br>
+&nbsp;&nbsp;"seed_audio_sha256": "e5c0...21ed",<br>
+&nbsp;&nbsp;"seed_duration_s": 102,<br>
+&nbsp;&nbsp;"extension_prompt": "build to climax, layered acid leads...",<br>
+&nbsp;&nbsp;"extension_lyrics_first_line": "[bridge] the drop is coming...",<br>
+&nbsp;&nbsp;"extra_duration_s": 60,<br>
+&nbsp;&nbsp;"repaint_mode": "balanced",<br>
+&nbsp;&nbsp;"repaint_strength": 0.5,<br>
+&nbsp;&nbsp;"latent_crossfade_frames": 10,<br>
+&nbsp;&nbsp;"wav_crossfade_s": 2.0,<br>
+&nbsp;&nbsp;"chunk_mask_mode": "auto",<br>
+&nbsp;&nbsp;"bpm": 135, "key": "C minor",<br>
+&nbsp;&nbsp;"sampler": "heun", "steps": 50, "cfg": 5.0, "shift": 3,<br>
+&nbsp;&nbsp;"lm": {"thinking": true, "temp": 0.85, "top_p": 0.9},<br>
+&nbsp;&nbsp;"dcw": {"enabled": true, "mode": "double", "scaler": 0.02},<br>
+&nbsp;&nbsp;"loras": [{"name": "psytrance_v2", "scale": 0.95, "sha256": "0c94..."}],<br>
+&nbsp;&nbsp;"seed": 9911,<br>
+&nbsp;&nbsp;"output_sha256": "9fbc...4071"<br>
+}
+        </div>
+      </div>
+    </div>
+  </div>
+</div>
+<div class="options" style="margin-top:24px">
+  <div class="option" data-choice="approve" onclick="toggleSelect(this)">
+    <div class="letter">✓</div>
+    <div class="content">
+      <h3>Both look right — show Edit + Lyrics + Generate (refreshed) next</h3>
+      <p>Cover and Extend hierarchies + control depth are correct. Continue.</p>
+    </div>
+  </div>
+  <div class="option" data-choice="revise" onclick="toggleSelect(this)">
+    <div class="letter">✎</div>
+    <div class="content">
+      <h3>Revise — tell me which control / section</h3>
+      <p>Reply in terminal with specifics. I'll redo a single section without re-pushing the whole thing.</p>
+    </div>
+  </div>
+</div>

docs/superpowers/specs/mockups/03_edit_lyrics.html ADDED Viewed

	@@ -0,0 +1,517 @@

+<h2>Edit and Lyrics · everything expanded</h2>
+<p class="subtitle">Edit has two sub-modes — Repaint (segment regeneration) and Flow Morph (caption-to-caption transformation). Lyrics tab uses Qwen 2.5 7B Instruct to draft structurally-tagged lyrics for the song modes.</p>
+<style>
+  .gm { background:#0A0A0A; color:#E5E5E5; border:1px solid #1F1F1F; border-radius:10px; padding:18px; font-size:12px; line-height:1.5; margin-top:14px; }
+  .gm-header { display:flex; justify-content:space-between; align-items:center; padding-bottom:10px; border-bottom:1px solid #1F1F1F; margin-bottom:14px; }
+  .gm-brand { font-size:15px; font-weight:600; }
+  .gm-cta { font-size:11px; color:#6B6B6B; }
+  .gm-cta strong { color:#E5E5E5; }
+  .gm-status { font-size:10px; color:#6B6B6B; letter-spacing:0.08em; text-transform:uppercase; }
+  .gm-row { display:flex; gap:16px; align-items:flex-start; }
+  .gm-sidebar { background:#000; padding:14px 10px; border-radius:6px; min-width:170px; }
+  .gm-side { display:block; padding:8px 10px; border-radius:4px; margin-bottom:3px; font-size:12px; color:#6B6B6B; }
+  .gm-side.active { background:#1A1A1A; color:#FFF; border-left:2px solid #FFF; padding-left:8px; }
+  .gm-side .em { margin-right:6px; }
+  .gm-main { flex:1; display:flex; gap:14px; align-items:flex-start; }
+  .gm-form { flex:1.3; background:#141414; padding:16px; border-radius:6px; }
+  .gm-output { flex:1; background:#141414; padding:16px; border-radius:6px; min-width:260px; }
+  .gm-label { font-size:10px; text-transform:uppercase; letter-spacing:0.08em; color:#6B6B6B; margin-bottom:6px; display:flex; justify-content:space-between; align-items:center; }
+  .gm-label .hint { color:#5A5048; font-size:9px; text-transform:none; letter-spacing:normal; font-weight:400; }
+  .gm-input { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; margin-bottom:12px; font-size:11px; }
+  .gm-textarea { min-height:46px; }
+  .gm-grid2 { display:grid; grid-template-columns:1fr 1fr; gap:12px; margin-bottom:12px; }
+  .gm-grid3 { display:grid; grid-template-columns:1fr 1fr 1fr; gap:10px; margin-bottom:12px; }
+  .gm-grid4 { display:grid; grid-template-columns:1fr 1fr 1fr 1fr; gap:8px; margin-bottom:12px; }
+  .gm-slider-row { display:flex; align-items:center; gap:10px; padding:6px 8px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; }
+  .gm-slider-row .name { color:#6B6B6B; font-size:10px; min-width:130px; }
+  .gm-slider { flex:1; height:3px; background:#2A2A2A; border-radius:2px; position:relative; }
+  .gm-slider::after { content:""; position:absolute; top:-4px; width:10px; height:10px; background:#FFF; border-radius:50%; }
+  .gm-slider.p10::after { left:10%; }
+  .gm-slider.p20::after { left:20%; }
+  .gm-slider.p25::after { left:25%; }
+  .gm-slider.p33::after { left:33%; }
+  .gm-slider.p40::after { left:40%; }
+  .gm-slider.p50::after { left:50%; }
+  .gm-slider.p60::after { left:60%; }
+  .gm-slider.p70::after { left:70%; }
+  .gm-slider.p85::after { left:85%; }
+  .gm-slider.p95::after { left:95%; }
+  .gm-slider-row .val { color:#FFF; font-family:monospace; font-size:11px; min-width:42px; text-align:right; }
+  .gm-toggle { display:flex; align-items:center; gap:8px; padding:6px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:8px; font-size:11px; cursor:pointer; }
+  .gm-toggle .box { width:14px; height:14px; border:1px solid #2A2A2A; border-radius:3px; display:inline-flex; align-items:center; justify-content:center; font-size:9px; }
+  .gm-toggle.on { color:#FFF; border-color:#FFF; }
+  .gm-toggle.on .box { background:#FFF; color:#0A0A0A; border-color:#FFF; }
+  .gm-pills { display:flex; gap:0; background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:2px; margin-bottom:12px; }
+  .gm-pill { flex:1; text-align:center; padding:6px 10px; font-size:11px; color:#6B6B6B; border-radius:3px; cursor:pointer; }
+  .gm-pill.on { background:#FFF; color:#0A0A0A; }
+  .gm-select { background:#000; border:1px solid #2A2A2A; padding:8px 10px; border-radius:4px; color:#E5E5E5; font-size:11px; display:flex; justify-content:space-between; align-items:center; margin-bottom:8px; }
+  .gm-select .arrow { color:#6B6B6B; }
+  .gm-section { border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-top:14px; background:#0F0F0F; }
+  .gm-section.dim { opacity:0.4; }
+  .gm-section-h { display:flex; justify-content:space-between; align-items:center; margin-bottom:12px; font-size:11px; font-weight:600; }
+  .gm-section-h .arrow { color:#FFF; }
+  .gm-section-h .meta { color:#6B6B6B; font-weight:400; font-size:10px; }
+  .gm-chip { display:inline-block; padding:5px 10px; border-radius:14px; font-size:10px; margin-right:5px; margin-bottom:5px; background:#000; border:1px solid #2A2A2A; color:#6B6B6B; cursor:pointer; }
+  .gm-chip.on { border-color:#FFF; color:#FFF; }
+  .gm-chip.upload { border-style:dashed; color:#FFF; }
+  .gm-lora-row { display:flex; align-items:center; gap:10px; padding:8px 10px; background:#000; border:1px solid #2A2A2A; border-radius:4px; margin-bottom:6px; font-size:11px; }
+  .gm-lora-name { flex:1; }
+  .gm-lora-name small { color:#6B6B6B; font-weight:400; margin-left:4px; }
+  .gm-x { color:#6B6B6B; cursor:pointer; padding:0 4px; }
+  .gm-btn { background:#FFF; color:#0A0A0A; padding:12px 18px; border-radius:4px; font-weight:600; display:block; font-size:13px; text-align:center; cursor:pointer; margin-top:16px; }
+  .gm-dropzone { background:#000; border:2px dashed #2A2A2A; border-radius:6px; padding:14px; margin-bottom:12px; text-align:center; font-size:11px; color:#6B6B6B; }
+  .gm-dropzone.has-file { border-style:solid; border-color:#FFF; color:#FFF; text-align:left; padding:10px 12px; }
+  .gm-dropzone .filename { font-weight:600; }
+  .gm-dropzone .meta { color:#6B6B6B; font-size:9px; margin-top:2px; font-weight:400; }
+  .gm-dropzone .miniwave { height:18px; background:repeating-linear-gradient(90deg, currentColor 0 1px, transparent 1px 3px); margin-top:6px; opacity:0.5; }
+  .gm-waveform { height:60px; background:#000; border:1px solid #2A2A2A; border-radius:4px; display:flex; align-items:center; justify-content:center; gap:2px; padding:8px; margin-bottom:10px; position:relative; }
+  .gm-bar { width:2px; background:#E5E5E5; }
+  .gm-bar.muted { opacity:0.35; }
+  .gm-bar.highlight { background:#FFF; }
+  .gm-player-controls { display:flex; align-items:center; gap:10px; color:#6B6B6B; font-size:10px; margin-bottom:14px; }
+  .gm-play { width:28px; height:28px; border-radius:50%; background:#FFF; color:#0A0A0A; display:flex; align-items:center; justify-content:center; font-size:11px; }
+  .gm-meta-block { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:8px 10px; font-size:9px; color:#6B6B6B; font-family:monospace; line-height:1.6; max-height:160px; overflow:hidden; margin-top:8px; }
+  .gm-actions { display:flex; flex-wrap:wrap; gap:6px; margin-bottom:10px; }
+  .gm-secondary { border:1px solid #2A2A2A; color:#E5E5E5; padding:6px 12px; border-radius:4px; font-size:10px; cursor:pointer; }
+  .gm-segment-bar { position:relative; height:18px; background:#0F0F0F; border:1px solid #2A2A2A; border-radius:3px; margin:8px 0 12px; }
+  .gm-segment-bar .sel { position:absolute; top:0; bottom:0; background:#FFF; opacity:0.85; }
+  .gm-segment-bar .ticks { position:absolute; top:0; left:0; right:0; bottom:0; display:flex; justify-content:space-between; padding:0 2px; align-items:center; font-size:8px; color:#6B6B6B; font-family:monospace; pointer-events:none; }
+  .gm-segment-bar .label { position:absolute; top:-14px; font-size:8px; color:#FFF; font-family:monospace; }
+  /* Lyrics-specific */
+  .gm-lyrics-output { background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:14px; margin-bottom:10px; font-family:Inter, system-ui, sans-serif; font-size:11px; line-height:1.7; color:#E5E5E5; min-height:240px; }
+  .gm-lyrics-output .section-tag { color:#FFF; font-weight:600; display:block; margin-top:10px; }
+  .gm-lyrics-output .section-tag:first-child { margin-top:0; }
+  .gm-lyrics-output .body { color:#B8B0A4; margin-left:0; }
+</style>
+<h3 style="margin-top:14px">✏️ Edit — fully expanded · Repaint sub-mode active</h3>
+<div class="gm">
+  <div class="gm-header">
+    <div>
+      <div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
+      <div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">♥</span>. <strong>Drop a like</strong> · Follow <strong>@techfreakworm</strong></div>
+    </div>
+    <div class="gm-status">ready · MPS · M5 Max</div>
+  </div>
+  <div class="gm-row">
+    <div class="gm-sidebar">
+      <div class="gm-side"><span class="em">🎵</span>Generate</div>
+      <div class="gm-side"><span class="em">🎤</span>Cover</div>
+      <div class="gm-side"><span class="em">⏩</span>Extend</div>
+      <div class="gm-side active"><span class="em">✏️</span>Edit</div>
+      <div class="gm-side"><span class="em">✍️</span>Lyrics</div>
+    </div>
+    <div class="gm-main">
+      <div class="gm-form">
+        <div class="gm-label">1 · Source audio <span class="hint">the song you want to modify · ≤ 240 s</span></div>
+        <div class="gm-dropzone has-file">
+          <div class="filename">↑ my_song_draft.wav</div>
+          <div class="meta">44.1 kHz · stereo · 2:30 · 26.4 MB · BPM 138 · key A minor</div>
+          <div class="miniwave"></div>
+        </div>
+        <div class="gm-label">2 · Edit sub-mode</div>
+        <div class="gm-pills">
+          <div class="gm-pill on">Repaint segment</div>
+          <div class="gm-pill">Flow morph</div>
+        </div>
+        <div class="gm-label">3 · Source lyrics <span class="hint">paste the existing lyrics for context</span></div>
+        <div class="gm-input gm-textarea">[verse 1] original lyric line one<br>[chorus] original chorus<br>[verse 2] original lyric line two<br>[bridge] ...</div>
+        <div class="gm-label">4 · Target lyrics <span class="hint">replace only the segment selected below</span></div>
+        <div class="gm-input gm-textarea">[chorus] new chorus replaces the old<br>more punchy, more melodic</div>
+        <div class="gm-label">5 · Segment selection <span class="hint">drag handles on the waveform · or set timestamps</span></div>
+        <div class="gm-segment-bar">
+          <div class="sel" style="left:33%; width:25%;"></div>
+          <div class="ticks"><span>0:00</span><span>0:30</span><span>1:00</span><span>1:30</span><span>2:00</span><span>2:30</span></div>
+          <div class="label" style="left:33%">0:50</div>
+          <div class="label" style="left:58%">1:30</div>
+        </div>
+        <div class="gm-grid2">
+          <div><div class="gm-label">Segment start</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">50.0 s</div></div>
+          <div><div class="gm-label">Segment end</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">90.0 s</div></div>
+        </div>
+        <!-- Repaint sub-options -->
+        <div class="gm-section">
+          <div class="gm-section-h">
+            <span>Repaint options <span class="meta">· segment regeneration</span></span>
+            <span class="arrow">▾</span>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Repaint mode</div><div class="gm-select">balanced <span class="arrow">▾</span></div></div>
+            <div><div class="gm-label">Chunk mask</div><div class="gm-select">auto <span class="arrow">▾</span></div></div>
+          </div>
+          <div class="gm-slider-row"><span class="name">Repaint strength</span><span class="gm-slider p50"></span><span class="val">0.50</span></div>
+          <div class="gm-slider-row"><span class="name">Latent crossfade frames</span><span class="gm-slider p20"></span><span class="val">10</span></div>
+          <div class="gm-slider-row"><span class="name">WAV crossfade seconds</span><span class="gm-slider p10"></span><span class="val">0.0</span></div>
+          <div class="gm-toggle on"><span class="box">✓</span> Preserve segment boundary phase</div>
+        </div>
+        <!-- Flow Morph sub-options, dimmed since Repaint is active -->
+        <div class="gm-section dim">
+          <div class="gm-section-h">
+            <span>Flow morph options <span class="meta">· caption-to-caption transformation · select "Flow morph" above to use</span></span>
+            <span class="arrow">▾</span>
+          </div>
+          <div class="gm-label">Source caption <span class="hint">describes what the segment currently is</span></div>
+          <div class="gm-input">acoustic ballad, gentle piano</div>
+          <div class="gm-label">Target caption <span class="hint">what to morph it into · prompt above is reused</span></div>
+          <div class="gm-input" style="opacity:0.5">(uses style prompt from step 2)</div>
+          <div class="gm-grid3">
+            <div><div class="gm-label">n_min</div><div class="gm-input" style="margin-bottom:0">0.0</div></div>
+            <div><div class="gm-label">n_max</div><div class="gm-input" style="margin-bottom:0">1.0</div></div>
+            <div><div class="gm-label">n_avg</div><div class="gm-input" style="margin-bottom:0">1</div></div>
+          </div>
+          <div class="gm-toggle"><span class="box"></span> Enable flow_edit_morph</div>
+        </div>
+        <!-- LoRA section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h"><span>LoRA stack <span class="meta">· 1 active</span></span><span class="arrow">▾</span></div>
+          <div class="gm-label">Bundled presets</div>
+          <div style="margin-bottom:12px;">
+            <span class="gm-chip">RapMachine</span>
+            <span class="gm-chip">Chinese Rap</span>
+            <span class="gm-chip on">Lyric2Vocal</span>
+            <span class="gm-chip">Text2Samples</span>
+          </div>
+          <div class="gm-label">Active stack</div>
+          <div class="gm-lora-row">
+            <span class="gm-lora-name">Lyric2Vocal <small>· preset</small></span>
+            <span class="gm-slider p70" style="width:100px"></span>
+            <span class="val" style="color:#FFF; font-family:monospace; font-size:11px;">0.70</span>
+            <span class="gm-x">×</span>
+          </div>
+          <div style="margin-top:10px;">
+            <span class="gm-chip upload">↑ drop .safetensors here</span>
+          </div>
+        </div>
+        <!-- Advanced section, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h"><span>Advanced</span><span class="arrow">▾</span></div>
+          <div class="gm-grid3">
+            <div><div class="gm-label">BPM <span class="hint">inherits from source</span></div><div class="gm-input" style="margin-bottom:0">138</div></div>
+            <div><div class="gm-label">Key / scale</div><div class="gm-input" style="margin-bottom:0">A minor</div></div>
+            <div><div class="gm-label">Time sig</div><div class="gm-input" style="margin-bottom:0">4 / 4</div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Sampler</div><div class="gm-select">heun <span class="arrow">▾</span></div></div>
+            <div><div class="gm-label">Vocal language</div><div class="gm-select">en <span class="arrow">▾</span></div></div>
+          </div>
+          <div class="gm-slider-row"><span class="name">Inference steps</span><span class="gm-slider p20"></span><span class="val">50</span></div>
+          <div class="gm-slider-row"><span class="name">CFG scale</span><span class="gm-slider p40"></span><span class="val">5.0</span></div>
+          <div class="gm-slider-row"><span class="name">Shift</span><span class="gm-slider p33"></span><span class="val">3</span></div>
+          <div class="gm-label" style="margin-top:8px">Negative prompt</div>
+          <div class="gm-input" style="font-size:10px; margin-bottom:8px">bitcrushed, aliasing, off-key</div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Audio format</div><div class="gm-pills"><div class="gm-pill on">mp3 320</div><div class="gm-pill">wav 44.1</div></div></div>
+            <div><div class="gm-label">Loudness</div><div class="gm-toggle on"><span class="box">✓</span> -14 LUFS</div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">7331</div></div>
+            <div><div class="gm-label">&nbsp;</div><div class="gm-toggle"><span class="box"></span> Lock seed</div></div>
+          </div>
+        </div>
+        <!-- LM planner -->
+        <div class="gm-section">
+          <div class="gm-section-h"><span>LM planner · Qwen3 thinking</span><span class="arrow">▾</span></div>
+          <div class="gm-toggle on"><span class="box">✓</span> Thinking enabled</div>
+          <div class="gm-toggle on"><span class="box">✓</span> Constrained decoding</div>
+          <div class="gm-grid4" style="margin-top:8px">
+            <div><div class="gm-label">Temp</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
+            <div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">0</div></div>
+            <div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
+            <div><div class="gm-label">LM CFG</div><div class="gm-input" style="margin-bottom:0">2</div></div>
+          </div>
+          <div class="gm-label">CoT toggles</div>
+          <div class="gm-grid4">
+            <div class="gm-toggle"><span class="box"></span> metas</div>
+            <div class="gm-toggle"><span class="box"></span> caption</div>
+            <div class="gm-toggle on"><span class="box">✓</span> lyrics</div>
+            <div class="gm-toggle"><span class="box"></span> language</div>
+          </div>
+        </div>
+        <!-- DCW -->
+        <div class="gm-section">
+          <div class="gm-section-h"><span>DCW · dynamic CFG warping</span><span class="arrow">▾</span></div>
+          <div class="gm-toggle on"><span class="box">✓</span> DCW enabled</div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Mode</div><div class="gm-select">double <span class="arrow">▾</span></div></div>
+            <div><div class="gm-label">Wavelet</div><div class="gm-select">haar <span class="arrow">▾</span></div></div>
+          </div>
+          <div class="gm-slider-row"><span class="name">DCW scaler</span><span class="gm-slider p10"></span><span class="val">0.02</span></div>
+          <div class="gm-slider-row"><span class="name">High scaler</span><span class="gm-slider p10"></span><span class="val">0.06</span></div>
+        </div>
+        <div class="gm-btn">▶ Repaint segment 0:50 – 1:30 · est. ~25 s on M5 Max</div>
+      </div>
+      <!-- Output -->
+      <div class="gm-output">
+        <div class="gm-label" style="margin-bottom:10px">Output · edited · 2:30 · seed 7331 · segment 0:50 – 1:30</div>
+        <div class="gm-toggle on"><span class="box">✓</span> Show edited region (highlighted on waveform)</div>
+        <div class="gm-waveform">
+          <div class="gm-bar muted" style="height:32%"></div>
+          <div class="gm-bar muted" style="height:48%"></div>
+          <div class="gm-bar muted" style="height:60%"></div>
+          <div class="gm-bar muted" style="height:42%"></div>
+          <div class="gm-bar muted" style="height:54%"></div>
+          <div class="gm-bar highlight" style="height:78%"></div>
+          <div class="gm-bar highlight" style="height:92%"></div>
+          <div class="gm-bar highlight" style="height:84%"></div>
+          <div class="gm-bar highlight" style="height:70%"></div>
+          <div class="gm-bar highlight" style="height:88%"></div>
+          <div class="gm-bar highlight" style="height:62%"></div>
+          <div class="gm-bar muted" style="height:48%"></div>
+          <div class="gm-bar muted" style="height:36%"></div>
+          <div class="gm-bar muted" style="height:42%"></div>
+          <div class="gm-bar muted" style="height:30%"></div>
+          <div class="gm-bar muted" style="height:38%"></div>
+        </div>
+        <div class="gm-player-controls">
+          <span class="gm-play">▶</span>
+          <span>0:00 / 2:30</span>
+          <span style="margin-left:auto; cursor:pointer; color:#FFF">↻ retake segment</span>
+        </div>
+        <div class="gm-label">A / B comparison</div>
+        <div class="gm-grid2">
+          <div class="gm-secondary" style="text-align:center">▶ original</div>
+          <div class="gm-secondary" style="text-align:center; border-color:#FFF; color:#FFF">▶ edited</div>
+        </div>
+        <div class="gm-label" style="margin-top:10px">Stems · Demucs</div>
+        <div style="display:grid; grid-template-columns:1fr 1fr; gap:6px; margin-bottom:10px;">
+          <div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>vocals</span><span style="color:#FFF">↓</span></div>
+          <div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>drums</span><span style="color:#FFF">↓</span></div>
+          <div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>bass</span><span style="color:#FFF">↓</span></div>
+          <div style="background:#000; border:1px solid #2A2A2A; border-radius:4px; padding:6px 10px; font-size:10px; display:flex; justify-content:space-between;"><span>other</span><span style="color:#FFF">↓</span></div>
+        </div>
+        <div class="gm-label">Export</div>
+        <div class="gm-actions">
+          <span class="gm-secondary">↓ full mp3</span>
+          <span class="gm-secondary">↓ segment-only mp3</span>
+          <span class="gm-secondary">↓ wav</span>
+          <span class="gm-secondary">↓ stems zip</span>
+          <span class="gm-secondary">{ } meta</span>
+        </div>
+        <div class="gm-label" style="margin-top:14px">Metadata</div>
+        <div class="gm-meta-block">
+{<br>
+&nbsp;&nbsp;"mode": "edit", "sub_mode": "repaint",<br>
+&nbsp;&nbsp;"source_audio_sha256": "1a4f...8e7d",<br>
+&nbsp;&nbsp;"segment_start_s": 50.0, "segment_end_s": 90.0,<br>
+&nbsp;&nbsp;"repaint_mode": "balanced", "repaint_strength": 0.5,<br>
+&nbsp;&nbsp;"latent_crossfade_frames": 10, "wav_crossfade_s": 0.0,<br>
+&nbsp;&nbsp;"chunk_mask_mode": "auto",<br>
+&nbsp;&nbsp;"source_lyrics_hash": "3c2e...44ab",<br>
+&nbsp;&nbsp;"target_lyrics_first_line": "[chorus] new chorus replaces the old...",<br>
+&nbsp;&nbsp;"bpm": 138, "key": "A minor", "sampler": "heun", "steps": 50,<br>
+&nbsp;&nbsp;"loras": [{"name": "Lyric2Vocal", "scale": 0.7}],<br>
+&nbsp;&nbsp;"seed": 7331,<br>
+&nbsp;&nbsp;"output_sha256": "b7a2...c019"<br>
+}
+        </div>
+      </div>
+    </div>
+  </div>
+</div>
+<h3 style="margin-top:30px">✍️ Lyrics — fully expanded · Qwen 2.5 7B Instruct</h3>
+<div class="gm">
+  <div class="gm-header">
+    <div>
+      <div class="gm-brand">ACE Music Studio<span style="color:#FFF">.</span></div>
+      <div class="gm-cta" style="margin-top:2px">Built with <span style="color:#FFF">♥</span>. <strong>Drop a like</strong> · Follow <strong>@techfreakworm</strong></div>
+    </div>
+    <div class="gm-status">ready · MPS · M5 Max · Qwen 2.5 7B</div>
+  </div>
+  <div class="gm-row">
+    <div class="gm-sidebar">
+      <div class="gm-side"><span class="em">🎵</span>Generate</div>
+      <div class="gm-side"><span class="em">🎤</span>Cover</div>
+      <div class="gm-side"><span class="em">⏩</span>Extend</div>
+      <div class="gm-side"><span class="em">✏️</span>Edit</div>
+      <div class="gm-side active"><span class="em">✍️</span>Lyrics</div>
+    </div>
+    <div class="gm-main">
+      <div class="gm-form">
+        <div class="gm-label">1 · Brief <span class="hint">describe the song in plain language</span></div>
+        <div class="gm-input gm-textarea" style="min-height:80px">A driving psytrance anthem about losing yourself on the dancefloor at sunrise. First-person, present tense, references to lights, kick drum, transcendence. Avoid clichés like "feel the beat".</div>
+        <div class="gm-grid2">
+          <div>
+            <div class="gm-label">Target structure <span class="hint">section sequence</span></div>
+            <div class="gm-input" style="margin-bottom:0">intro, verse, chorus, verse, chorus, bridge, chorus, outro</div>
+          </div>
+          <div>
+            <div class="gm-label">Language</div>
+            <div class="gm-select" style="margin-bottom:0">English (en) <span class="arrow">▾</span></div>
+          </div>
+        </div>
+        <div class="gm-grid3">
+          <div>
+            <div class="gm-label">Verse lines</div>
+            <div class="gm-input" style="margin-bottom:0">6</div>
+          </div>
+          <div>
+            <div class="gm-label">Chorus lines</div>
+            <div class="gm-input" style="margin-bottom:0">4</div>
+          </div>
+          <div>
+            <div class="gm-label">Bridge lines</div>
+            <div class="gm-input" style="margin-bottom:0">2</div>
+          </div>
+        </div>
+        <div class="gm-label">Tone / mood <span class="hint">optional · comma-separated descriptors</span></div>
+        <div class="gm-input">euphoric, hypnotic, transcendent, not cheesy</div>
+        <div class="gm-label">Rhyme preference</div>
+        <div class="gm-pills">
+          <div class="gm-pill">Strict (AABB)</div>
+          <div class="gm-pill on">Loose (ABAB / free)</div>
+          <div class="gm-pill">None</div>
+        </div>
+        <!-- LM parameters, expanded -->
+        <div class="gm-section">
+          <div class="gm-section-h">
+            <span>LM parameters <span class="meta">· Qwen 2.5 7B Instruct (Apache 2.0)</span></span>
+            <span class="arrow">▾</span>
+          </div>
+          <div class="gm-grid4">
+            <div><div class="gm-label">Temperature</div><div class="gm-input" style="margin-bottom:0">0.85</div></div>
+            <div><div class="gm-label">Top-p</div><div class="gm-input" style="margin-bottom:0">0.90</div></div>
+            <div><div class="gm-label">Top-k</div><div class="gm-input" style="margin-bottom:0">40</div></div>
+            <div><div class="gm-label">Rep. penalty</div><div class="gm-input" style="margin-bottom:0">1.10</div></div>
+          </div>
+          <div class="gm-grid2">
+            <div><div class="gm-label">Max new tokens</div><div class="gm-input" style="margin-bottom:0">600</div></div>
+            <div><div class="gm-label">Seed</div><div class="gm-input" style="margin-bottom:0; font-family:monospace">42</div></div>
+          </div>
+          <div class="gm-toggle"><span class="box"></span> Show system prompt</div>
+          <div class="gm-toggle on"><span class="box">✓</span> Enforce structural-tag format <span style="color:#6B6B6B; font-size:9px; margin-left:auto">stop at [end]</span></div>
+        </div>
+        <div class="gm-btn">▶ Draft lyrics · est. ~8 s on M5 Max</div>
+      </div>
+      <!-- Lyrics output -->
+      <div class="gm-output">
+        <div class="gm-label" style="margin-bottom:10px">Draft · 1 of 1 · 312 tokens · 6.2 s</div>
+        <div class="gm-lyrics-output">
+          <span class="section-tag">[intro]</span>
+          <span class="body">the lights start low, the bass starts slow<br>another night, another holy show</span>
+          <span class="section-tag">[verse 1]</span>
+          <span class="body">six in the morning, the sun's still pretending<br>kick drum carries what the night was sending<br>shoes off, eyes closed, the city's still bending<br>no one's here for an ending<br>just one more lift, one more descending<br>the room is breathing, the floor is mending</span>
+          <span class="section-tag">[chorus]</span>
+          <span class="body">we let go, we let go, we let go<br>oh the morning, oh the morning<br>arms up, head down, no warning<br>we let go, we let go, we let go</span>
+          <span class="section-tag">[verse 2]</span>
+          <span class="body">...</span>
+          <span class="section-tag">[bridge]</span>
+          <span class="body">...</span>
+          <span class="section-tag">[outro]</span>
+          <span class="body">...</span>
+        </div>
+        <div class="gm-actions" style="margin-bottom:14px">
+          <span class="gm-secondary" style="border-color:#FFF; color:#FFF">↑ Use these in Generate</span>
+          <span class="gm-secondary">↻ regenerate</span>
+          <span class="gm-secondary">↻ continue from cursor</span>
+          <span class="gm-secondary">✎ edit inline</span>
+          <span class="gm-secondary">↓ .txt</span>
+        </div>
+        <div class="gm-label">Quick refinements <span class="hint">click to apply to next regeneration</span></div>
+        <div style="margin-bottom:14px;">
+          <span class="gm-chip">more cryptic</span>
+          <span class="gm-chip">less rhyme</span>
+          <span class="gm-chip">more concrete imagery</span>
+          <span class="gm-chip">shorter lines</span>
+          <span class="gm-chip">add chorus hook</span>
+        </div>
+        <div class="gm-label">Variants</div>
+        <div class="gm-grid2">
+          <div class="gm-secondary" style="text-align:center; border-color:#FFF; color:#FFF">v1 · current</div>
+          <div class="gm-secondary" style="text-align:center">+ generate v2</div>
+        </div>
+        <div class="gm-label" style="margin-top:14px">Metadata</div>
+        <div class="gm-meta-block">
+{<br>
+&nbsp;&nbsp;"mode": "lyrics",<br>
+&nbsp;&nbsp;"model": "Qwen2.5-7B-Instruct",<br>
+&nbsp;&nbsp;"brief_first_line": "A driving psytrance anthem about losing yourself...",<br>
+&nbsp;&nbsp;"structure": ["intro", "verse", "chorus", "verse", "chorus", "bridge", "chorus", "outro"],<br>
+&nbsp;&nbsp;"language": "en",<br>
+&nbsp;&nbsp;"tone": "euphoric, hypnotic, transcendent, not cheesy",<br>
+&nbsp;&nbsp;"verse_lines": 6, "chorus_lines": 4, "bridge_lines": 2,<br>
+&nbsp;&nbsp;"rhyme_preference": "loose",<br>
+&nbsp;&nbsp;"temperature": 0.85, "top_p": 0.9, "top_k": 40,<br>
+&nbsp;&nbsp;"repetition_penalty": 1.1, "max_new_tokens": 600,<br>
+&nbsp;&nbsp;"seed": 42,<br>
+&nbsp;&nbsp;"tokens_generated": 312, "wall_seconds": 6.2,<br>
+&nbsp;&nbsp;"output_sha256": "f1a3...88e2"<br>
+}
+        </div>
+      </div>
+    </div>
+  </div>
+</div>
+<div class="options" style="margin-top:24px">
+  <div class="option" data-choice="approve" onclick="toggleSelect(this)">
+    <div class="letter">✓</div>
+    <div class="content">
+      <h3>Both look right — refresh Generate next, then mobile + error states</h3>
+      <p>Edit (with both sub-modes visible) and Lyrics (with LM params + quick-refinement chips) work. Continue.</p>
+    </div>
+  </div>
+  <div class="option" data-choice="revise" onclick="toggleSelect(this)">
+    <div class="letter">✎</div>
+    <div class="content">
+      <h3>Revise — tell me which control or section</h3>
+      <p>Reply in terminal with specifics.</p>
+    </div>
+  </div>
+</div>

docs/superpowers/specs/mockups/README.md ADDED Viewed

	@@ -0,0 +1,50 @@

+# ACE Music Studio — UI mockups
+Visual source-of-truth for the design spec at `../2026-05-18-ace-music-studio-design.md`. Open the HTML files in a browser to see the rendered Brutalist Mono interface.
+| File | Tabs / screens covered | Source |
+|---|---|---|
+| [`01_generate_mobile_errors.html`](./01_generate_mobile_errors.html) | **Generate** tab fully expanded · 3 phone screens (Generate, Cover, Lyrics) · 6 error / edge-case states · in-progress generation banner | brainstorm session 24743 |
+| [`02_cover_extend.html`](./02_cover_extend.html) | **Cover** tab fully expanded · **Extend** tab fully expanded | brainstorm session 24743 |
+| [`03_edit_lyrics.html`](./03_edit_lyrics.html) | **Edit** tab fully expanded with both sub-modes (Repaint active, Flow Morph dimmed) · **Lyrics** tab fully expanded with Qwen 2.5 LM params | brainstorm session 24743 |
+## What every tab shares
+- Sticky header with brand "ACE Music Studio." and CTA: *Built with ♥. Drop a like · Follow @techfreakworm for what's next.*
+- Sidebar with 5 mode pills + session History list (desktop ≥ 1024 px)
+- 2-column body: form on left, output on right
+- LoRA stack section with 4 bundled preset chips + active stack rows (per-row strength slider + ×) + custom upload zone
+- Advanced accordion: BPM, key/scale, time sig, sampler, language, steps, CFG, shift, negative prompt, audio format, loudness, fade in/out, seed + lock
+- LM planner accordion: thinking, constrained decoding, temp / top-k / top-p / LM CFG, CoT toggles (metas / caption / lyrics / language), LM negative prompt, CoT override fields
+- DCW accordion: enabled, mode (single / double), wavelet, scaler, high scaler
+- Output panel: waveform · play/scrub · retake · stems (Demucs htdemucs_ft) · export (mp3 / wav / stems zip / meta JSON / share link) · full metadata JSON
+## What each tab adds
+- **Generate** — duration slider, vocals/instrumental pills, CFG-interval start/end, latent shift/rescale
+- **Cover** — reference-audio dropzone, cover-strength slider, cover-noise slider, compare-side-by-side toggle in output
+- **Extend** — seed-audio dropzone with auto-detected BPM/key, extension prompt, extra-duration slider, repaint mode, repaint strength, latent crossfade frames, WAV crossfade seconds, chunk mask mode, seed-boundary marker on output waveform, separate "extension-only" download
+- **Edit** — source audio + source/target lyrics, repaint-vs-flow-morph sub-mode pills, segment-selection bar with start/end timestamps, repaint sub-options (mode / chunk-mask / strength / crossfade), flow-morph sub-options (source caption / n_min / n_max / n_avg), A/B comparison in output
+- **Lyrics** — brief, structure sequence, language, per-section line counts (verse / chorus / bridge), tone descriptors, rhyme preference pills (strict / loose / none), LM params accordion (temp / top-p / top-k / rep penalty / max tokens / seed / show system prompt / enforce-tag-format), quick-refinement chips (more cryptic, less rhyme, etc.), variants
+## Mobile (phone)
+- Native `gr.Tabs` horizontal scroll strip at top (icons + first label visible)
+- Sidebar hidden via CSS media query at `< 640 px`
+- Output stacks below form
+- Sliders bounded by parent width (the desktop's pixel-art `━` characters were replaced with proper CSS slider tracks for mobile)
+## Error / edge states
+- **LoRAValidationError** — toast with module-mismatch diagnostics + "Remove from stack" / "View header diagnostics" actions
+- **ZeroGPU timeout** — auto-retry once at 2× duration, then warning toast with "Lower steps" / "Reduce duration" hints
+- **MPS op fallback** — info toast naming the op (e.g., `aten::_fft_r2c`), CPU fallback engaged via `PYTORCH_ENABLE_MPS_FALLBACK=1`
+- **Audio format rejected** — clear constraints (wav/mp3/flac, ≤ 60 s for Cover, ≤ 50 MB) + "Auto-convert + trim" action
+- **First-request warm-up** — informational banner ("Loading ACE-Step v1.5 XL SFT into MPS memory ~45 s")
+- **In-progress generation** — `gr.Progress`-driven banner with step / total, ETA, elapsed, cancel link
+## Note on the "approve / revise" cards
+Each HTML file has a card-options block at the bottom — vestigial from the visual-companion brainstorm flow. It's harmless when viewed outside the companion (the `toggleSelect` call is a no-op without the companion's helper.js).
+If they bother you, delete the trailing `<div class="options">…</div>` block from each file. Otherwise leave them — they document which question each mockup answered.

research/00_executive_summary.md ADDED Viewed

	@@ -0,0 +1,122 @@

+# Open-Source Song Generation for a Suno-Like Platform — Executive Summary
+*Research compiled 2026-05-18. Target hardware: Apple M5 Max, 128 GB unified memory, MPS backend. Deployment target: **free non-profit Hugging Face Space.** Commercial license is NOT a constraint.*
+---
+## TL;DR
+**Use ACE-Step 1.5 XL as the default base model.** It is the open-source full-song-with-vocals foundation model in May 2026 that combines:
+1. **First-class Apple Silicon support** (hybrid MLX + PyTorch MPS, dedicated `clockworksquirrel/ace-step-apple-silicon` fork) — best local-dev experience.
+2. **MIT license** — clean for forks, attribution, and weight redistribution on the HF Space.
+3. **State-of-art-or-better quality** — 4.4/5 vs Suno v4's 4.1/5 vocal naturalness in a 50-person blind test (folk, classical, jazz; Suno still wins pop/EDM polish).
+4. **Sub-minute generation** on M5 Max (projected ~30 – 50 s for a 4-min song). Sub-2 s/song on A100 — fits inside HF ZeroGPU's free 60 s budget.
+5. **Cheap LoRA fine-tuning** — 8 songs trainable in ~1 hour on a single 3090, LoRA training works on MPS.
+6. **50+ languages**, vocals + instrumentation natively, **<4 GB VRAM minimum** — runs on free ZeroGPU Spaces.
+7. **Active 10.4 k-star repo**, native ComfyUI integration, AMD vendor-blessed for production.
+**Now that commercial use is not a constraint** (free non-profit HF Space deployment), **SongGeneration 2 / LeVo 2** comes back into contention as a premium-quality alternative — its Tencent non-commercial license permits academic/research/education use. Vendor benchmarks (unverified) put it ahead of Suno v5 on lyric accuracy. The trade-off is **22 – 28 GB VRAM** (needs paid Space tier, not free ZeroGPU) and no first-party MPS path (only a buggy community `SongGen-Mac` fork) — meaning M5 Max local dev is painful.
+Pair the primary pick with **HeartMuLa-MLX** as an alternate-quality choice (Apache 2.0, 2.1× faster than ACE-Step on M-series via Apple's MLX) and **YuE on Replicate** as the multilingual fallback.
+---
+## Ranking (non-profit HF Space context)
+| Rank | Model | Params | bf16 weights | License | MPS | Vocal Quality vs Suno | LoRA | Verdict |
+|---|---|---|---|---|---|---|---|---|
+| **1** | **ACE-Step 1.5 XL** | ~8 B (4 B DiT + 4 B planner) | ~16 GB | MIT | First-class | 4.4/5 vs Suno v4 4.1 (blind test) | ✅ 1h on 3090 | **Default base.** Fits free ZeroGPU. |
+| **2** | **SongGeneration 2 / LeVo 2** | 4 B | ~8 GB | Tencent non-commercial (OK for non-profit Space) | Buggy community fork only | Vendor PER 8.55 % vs Suno v5 12.4 % | ❌ | Premium quality. Needs paid Space (22 – 28 GB VRAM). |
+| **3** | **HeartMuLa** | ~6.8 B (4 B MuLa + 2 B Codec + 0.8 B ASR) | ~13.6 GB | Apache 2.0 | Strong MLX port | Vendor: lowest PER per-language, unverified | ❌ public | Strong A/B alternate. |
+| **4** | **DiffRhythm 2** | ~1.17 B (1 B DiT + 170 M VAE-dec) | ~2.4 GB | Apache 2.0 | Likely OK, untested | Authors admit gap vs Suno v4.5 | ❌ no training code | Speed tier. 210 s ceiling. Cheapest to host. |
+| **5** | **YuE** | ~8 B (7 B + 1 B + upsampler) | ~16 GB | Apache 2.0 | ❌ broken (flash-attn hard dep) | Vocal range matches Suno v4 | ✅ LoRA, CUDA-only | Multilingual specialist; via Replicate only. |
+| — | SongBloom | 2 B | ~4 GB | Custom (likely NC) | Reported OK | unknown | ❌ | Research baseline. |
+| — | InspireMusic / FunMusic | 1.5 B | ~3 GB | Apache 2.0 | ❌ CUDA-only deps | No vocals yet | n/a | Skip until vocal release. |
+---
+## Decision tree (non-profit HF Space deployment)
+```
+HF Space tier?
+  ├── Free ZeroGPU (60s/req on shared A100) ─┐
+  │                                          ├── ACE-Step 1.5 (turbo workflow generates a song well under 60 s)
+  │                                          └── DiffRhythm 2 (smallest, fastest, fits easily)
+  │
+  └── Paid GPU Space (A10G / A100 dedicated) ─┐
+                                              ├── Default: ACE-Step 1.5 XL (best speed-quality, MPS for local dev)
+                                              ├── Premium tier: SongGeneration 2 v2-large (best vendor benchmarks)
+                                              ├── Multilingual breadth: YuE (50+ via Replicate; local broken)
+                                              └── Alternate: HeartMuLa via heartlib-mlx
+```
+---
+## What the research surfaced that changes the picture
+1. **Non-profit HF Space deployment removes the Tencent-license blocker.** SongGeneration 2 / LeVo 2 is back in contention as a premium-quality alternative. Its custom license permits "academic, research, and education purposes" — a free non-profit Space sits comfortably inside that scope. Practical blockers remain (22 – 28 GB VRAM means paid Space tier, no working MPS) but the licence is no longer a no-go.
+2. **The YuE team migrated to ACE-Step.** The ACE-Step paper (Jun 2025) explicitly critiques YuE for "slow inference and structural artifacts." YuE's repo has been dormant since 2025-06-04. Treat YuE as a frozen capability, not a developing one.
+3. **Vocal-support contradiction on ACE-Step is resolved: yes, it does vocals.** Several search results said "instrumental only" — that's confused with the `Text2Samples` LoRA. The base model produces vocals + instruments natively, lyric-conditioned, with `[verse] [chorus] [bridge]` structural tags.
+4. **DiffRhythm 2's biggest fix is structural coherence**, not raw quality. Its v1's brutal Hacker News thread complained "no identifiable chorus in any of the demo songs"; v2's block flow-matching (semi-autoregressive over 2 s blocks) closes that gap. Its **210 s ceiling is a regression** from v1-full's 4m45s.
+5. **HeartMuLa is the dark-horse 2026 entrant.** Apache 2.0, 4 B params, modular (CLAP + Transcriptor + Codec + MuLa LM), MLX port available. Vendor PER claims are aggressive (0.09 EN / 0.12 ZH) but not in comparable units to LeVo's 8.55 % — direct comparison unreliable until somebody runs a neutral A/B.
+6. **Every "beats Suno v5" claim is vendor-published.** The only neutral preference study located ([arXiv 2506.19085](https://arxiv.org/html/2506.19085v1)) stops at Suno v3.5. **Plan an in-house blind A/B before betting product positioning on any vendor number.**
+7. **Apple Silicon is fine for music gen — much friendlier than LTX-Video 2.3.** No complex64, no SDPA-on-meta-tensor traps, no multimodal-Gemma gotchas. The mundane MPS issues here are: `flash-attn` substitution with SDPA, fp16 conv1d → fp32 in audio decoders, `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` for OOM tuning. Three of the five candidate models already ship a working MPS or MLX path.
+8. **HF Space hardware tier dictates the model choice as much as quality does.** Free ZeroGPU = 60 s budget per request, shared A100 — only ACE-Step or DiffRhythm 2 finish in time. Paid A10G/A100 Spaces unlock SongGeneration 2 v2-large but the user has to pay (or get an HF community grant).
+---
+## Recommended starting setup for the M5 Max (with HF Space deploy in mind)
+```bash
+# 1. Primary base model — ACE-Step 1.5 XL via the Apple Silicon fork
+git clone https://github.com/clockworksquirrel/ace-step-apple-silicon \
+  ~/Projects/llm/music-generator/ace-step
+cd ~/Projects/llm/music-generator/ace-step
+python3.11 -m venv .venv && source .venv/bin/activate
+pip install -r requirements.txt
+# Hybrid backend: Qwen3 planner → MLX, DiT decoder → PyTorch MPS, bf16 throughout
+# ~16 GB bf16 weights for the XL stack; M5 Max 128 GB has massive headroom
+# 2. Production UI — ace-step-ui (stem extraction, library, LAN access)
+git clone https://github.com/fspecii/ace-step-ui \
+  ~/Projects/llm/music-generator/ace-step-ui
+# 3. Alternate model — HeartMuLa via MLX port (~13.6 GB bf16)
+git clone https://github.com/Acelogic/heartlib-mlx \
+  ~/Projects/llm/music-generator/heartlib-mlx
+# 4. (Optional) Premium-quality experiment — SongGeneration 2 / LeVo 2
+# Mac fork has a pre-chorus bug; only do this if you're OK developing on a rented
+# Linux+CUDA box and the M5 Max becomes just your control plane.
+git clone https://github.com/tencent-ailab/SongGeneration \
+  ~/Projects/llm/music-generator/songgeneration
+```
+For the throughput-sensitive **multilingual fallback (YuE)**, use Replicate's `fofr/yue` endpoint — do *not* attempt local inference on M5 Max until somebody ports Stage-1 to MPS. Treat YuE as remote-only for now.
+**HF Space deployment notes:**
+- **Free ZeroGPU Space** → only ACE-Step or DiffRhythm 2 will finish a song inside the 60 s shared-A100 budget. Use ACE-Step's turbo workflow.
+- **Paid GPU Space** → A10G (24 GB) handles ACE-Step XL comfortably; A100 (40 GB) opens the door to SongGeneration 2 v2-large.
+- **Apply for a [Community GPU Grant](https://huggingface.co/docs/hub/en/spaces-gpus#community-gpu-grants)** if budget is the deciding factor — HF approves these regularly for non-profit demos.
+---
+## Sources
+All claims are cited inline in the per-model deep-dives:
+- [01_yue.md](./01_yue.md)
+- [02_diffrhythm.md](./02_diffrhythm.md)
+- [03_acestep.md](./03_acestep.md)
+- [04_newcomers_and_survey.md](./04_newcomers_and_survey.md)
+- [05_apple_silicon_mps_audit.md](./05_apple_silicon_mps_audit.md)
+- [06_comparison_matrix.md](./06_comparison_matrix.md) — side-by-side spec table
+- [07_platform_architecture.md](./07_platform_architecture.md) — Suno-clone system design with ACE-Step at the core

research/01_yue.md ADDED Viewed

	@@ -0,0 +1,268 @@

+# YuE — Open Full-Song Music Generation Foundation Model
+*Research date: 2026-05-18*
+---
+## 1. Overview
+**YuE** (乐, "yue" — Chinese for "music") is an open-source family of long-form, lyrics-to-song foundation models that produce vocals + accompaniment end-to-end, explicitly positioned as the open competitor to Suno.ai and Udio. It was built by the **M-A-P (Multimodal Art Projection) collective**, led by researchers at **HKUST (Hong Kong University of Science and Technology)** with collaborators from multiple academic and industry institutions (58 authors are credited on the paper, with hardware support from Geely and Moonshot AI) ([arXiv 2503.08638](https://arxiv.org/abs/2503.08638), [HF model card](https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-icl)).
+**Release timeline:**
+- **2025-01-26** — Initial YuE-s1-7B series released ([GitHub README](https://github.com/multimodal-art-projection/YuE))
+- **2025-01-30** — Apache 2.0 license adopted; dual-track ICL mode added
+- **2025-02-07** — Windows / Pinokio support
+- **2025-02-17** — Music continuation + Google Colab support
+- **2025-03-11/12** — Anneal checkpoints + technical report on arXiv (v1)
+- **2025-06-04** — LoRA fine-tuning code merged (PR #126)
+- **ICLR 2026** — Paper presented
+**Current status (May 2026): effectively frozen / community-maintained.** The official `multimodal-art-projection/YuE` repo's last commit is **2025-06-04** (GitHub API, retrieved 2026-05-18), nearly 12 months stale. There is no announced YuE-2 or successor from the M-A-P org. All forward development (quantization, ComfyUI, GUI, MPS attempts, exllama, mp3 extension) now happens in community forks like [YuEGP](https://github.com/deepbeepmeep/YuEGP), [YuE-exllamav2](https://github.com/sgsdxzy/YuE-exllamav2), and [YuE-extend](https://github.com/Mozer/YuE-extend). The space the team itself has moved into is **ACE-Step** (released January 2026), which the ACE-Step paper explicitly critiques YuE for "slow inference and structural artifacts" ([arXiv 2506.00045](https://arxiv.org/abs/2506.00045)).
+---
+## 2. Architecture
+YuE is a **two-stage autoregressive LLM** pipeline built on the **LLaMA2** decoder-only transformer backbone — *not* a diffusion model ([paper](https://arxiv.org/html/2503.08638v1)).
+**Stage-1 LM (the headline 7B model):**
+- LLaMA2-style decoder, ~6B–7B parameters (HF metadata reports 6B for the s1 checkpoints).
+- Performs **track-decoupled next-token prediction**: interleaves *vocal* and *instrumental* token streams in a single sequence, so a single AR pass produces both tracks rather than mixing them. This is YuE's central architectural innovation.
+- Conditioned on (genre tags || lyrics) using **structural progressive conditioning** — lyrics are chunked per section (verse/chorus/bridge) and re-injected so attention does not lose alignment over a 5-minute generation.
+- Native context: 8192 tokens (~163 s of mix-track audio, ~81 s of dual-track); extended to **16384** in the anneal phase.
+**Stage-2 LM:**
+- 1B-parameter LLaMA2 model (HF reports ~2B for `YuE-s2-1B-general`).
+- Predicts the **residual RVQ codebooks (layers 1–7)** conditioned on Stage-1's codebook-0 output, restoring acoustic fidelity that the semantic-rich layer-0 tokens omit.
+- Context length 8192.
+**Audio tokenizer — X-Codec:**
+- YuE uses **X-Codec** (from the same M-A-P lineage as MERT), a *semantic-acoustic fused* RVQ codec that bolts a HuBERT-based semantic stream onto an RVQ-VAE acoustic stream.
+- 12 RVQ codebooks total; YuE uses the first **8** (codebook size 1024 each).
+- 50 Hz frame rate over 16 kHz audio.
+- A separate **YuE-upsampler** (GAN-based) converts the 16 kHz output up to higher sample rate / better fidelity for delivery ([paper §3](https://arxiv.org/html/2503.08638v1), [HF Transformers X-Codec docs](https://huggingface.co/docs/transformers/main/model_doc/xcodec)).
+**Track handling:** Dual-track. Vocal and accompaniment are *separately tokenized* via X-Codec, then interleaved in the AR sequence — this is the paper's claimed advantage over single-track-mixture baselines (less information loss, cleaner vocal/inst separation).
+**Max generation length:** Up to **~5 minutes** per song, generated in chunks/sessions and stitched.
+**Lyrics conditioning:** Plain text lyrics with section tags ([verse], [chorus], etc.) + a genre tag prompt (a vocabulary from `top_200_tags.json` such as "pop", "female vocal", "energetic", "120 bpm"). The progressive conditioning means each new section re-references the relevant lyric chunk.
+**Training scale:** Stage-1 used ~**2T tokens** across phases; data includes ~**650K hours of in-the-wild music** plus ~**70K hours of TTS** for vocal grounding ([paper](https://arxiv.org/html/2503.08638v1)).
+---
+## 3. Variants and Sizes
+From the [M-A-P YuE collection on HuggingFace](https://huggingface.co/collections/m-a-p/yue-6797d55e22990ae89b90a3d6) (downloads accurate as of mid-2026):
+| Model | Params | Stage | Language | Mode | Downloads (last month) |
+|---|---|---|---|---|---|
+| `YuE-s1-7B-anneal-en-cot` | 6B | 1 | English | Chain-of-Thought (default) | 8.48k |
+| `YuE-s1-7B-anneal-en-icl` | 6B | 1 | English | In-Context Learning (style cloning) | 805 |
+| `YuE-s1-7B-anneal-zh-cot` | 6B | 1 | Mandarin/Cantonese | CoT | 203 |
+| `YuE-s1-7B-anneal-zh-icl` | 6B | 1 | Mandarin/Cantonese | ICL | 89 |
+| `YuE-s1-7B-anneal-jp-kr-cot` | 6B | 1 | Japanese/Korean | CoT | 95 |
+| `YuE-s1-7B-anneal-jp-kr-icl` | 6B | 1 | Japanese/Korean | ICL | 25 |
+| `YuE-s2-1B-general` | 2B | 2 | language-agnostic | residual decoder | 6.01k |
+| `YuE-s1-0.5B` | 0.5B | 1 | research/ablation | partial training | 94 |
+| `YuE-upsampler` | – | post | n/a | GAN upsampler | – |
+| `xcodec_mini_infer` | – | tokenizer | n/a | X-Codec encoder/decoder | – |
+**Naming key:**
+- `s1` / `s2` = Stage-1 (semantic) / Stage-2 (acoustic residual).
+- `anneal` = checkpoints after the final "annealing" pretraining phase (highest quality public weights).
+- `cot` = chain-of-thought prompting variant; `icl` = in-context learning variant (used for *style/voice cloning* from a reference audio).
+- A community **GGUF quantization** of the Stage-2 model exists at [`multimodalart/YuE-s2-1B-general-Q8_0-GGUF`](https://huggingface.co/multimodalart/YuE-s2-1B-general-Q8_0-GGUF) — useful for Mac llama.cpp paths.
+There is **no official "YuE-2" or major version bump**. The team's successor effort is the separately branded ACE-Step.
+---
+## 4. License
+**Apache License 2.0** for code *and* weights — switched on 2025-01-30 in response to community pressure ([GitHub README news entry](https://github.com/multimodal-art-projection/YuE), [HF model card](https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-icl)).
+- **Commercial use:** *Permitted and explicitly encouraged.* The model card says: "Artists and content creators are encouraged to sample and incorporate outputs into their own works, and even monetize them, with attribution to the model's name (\"YuE by HKUST/M-A-P\")."
+- **Attribution:** Required for public / commercial outputs.
+- **Recommended labeling:** outputs should be marked "AI-generated", "YuE-generated", "AI-assisted", or "AI-auxiliated".
+- **No training-data redistribution clause** — Apache 2.0 covers code and the released weights; training data itself was *not* released, so no redistribution permission is granted on data.
+- **Liability:** users bear sole responsibility for any copyright infringement, plagiarism, or misuse. Likely — no explicit watermarking or content-credentials are baked into output (no direct confirmation in docs).
+Practical takeaway for the user's Suno-like platform: **YuE is one of the very few music-generation foundation models with a clean, no-strings commercial license**, which is the single most valuable thing about it.
+---
+## 5. Languages Supported
+Five officially: **English, Mandarin Chinese, Cantonese, Japanese, Korean** ([GitHub README](https://github.com/multimodal-art-projection/YuE), [demo page](https://map-yue.github.io/)).
+- English has the deepest training and the most-downloaded checkpoint.
+- `zh` covers Mandarin and Cantonese (sharing a checkpoint).
+- `jp-kr` shares one checkpoint for Japanese and Korean.
+- The demo site shows code-switching (English ↔ Mandarin within the same song) working.
+- No official support for Spanish, French, German, Hindi, Arabic, etc. — outputs in those languages will likely be poor or accented (no direct user reports confirm, but architecturally the model has never seen them at scale).
+---
+## 6. Quality Assessment
+**Strengths (from paper + demos):**
+- Wide vocal range — the paper reports YuE "closely matching top-performing closed-source systems like Suno V4" on vocal-range metrics ([WhiteFiber summary](https://www.whitefiber.com/blog/yue-ai-music-generator)).
+- Strong **musical structure** — verse/chorus/bridge transitions are coherent over 3–5 min, which most diffusion music models still struggle with.
+- Demos show death-growl metal, scatting jazz, Beijing opera, rap, ballad, country, and soul — *genre breadth* is genuinely impressive ([map-yue.github.io](https://map-yue.github.io/)).
+- ICL mode can clone the timbre/style of a reference clip — closest open-source analogue to Suno's "cover" or Udio's style transfer.
+**Weaknesses (from paper's own discussion + community feedback):**
+- **Acoustic fidelity gap.** Multiple sources, including the paper itself, note "clear deficiencies in vocal and accompaniment acoustic quality, likely due to limitations of its current audio tokenization method"; the authors propose super-resolution / better decoders as future work.
+- **Mono / narrow stereo image** — third-party reviews call out that output "lacks the production quality needed for commercial music platforms" and is essentially mono ([articlex review](https://www.articlex.com/open-source-ai-music-generation-breakthrough-with-yue-software/)).
+- **Slow inference + structural artifacts** — the explicit critique from the ACE-Step authors (ICLR 2026 submission): "LLM-based models like YuE excel at lyrics alignment but suffer from slow inference and structural artifacts" ([ACE-Step paper](https://arxiv.org/abs/2506.00045)).
+- **Mumbling / lyric drift** appears in long sections — there is no explicit Reddit thread surfacing here, but the paper's "Section 12 Unsuccessful Attempts" and `--repetition-penalty` / decoding-temperature emphasis in the GitHub Issues suggest users hit it.
+**Quality verdict vs Suno v4 / v5:**
+- Suno v4 ≈ YuE on *vocal range and genre breadth.*
+- Suno v4/v5 clearly ahead on *mix polish, stereo width, vocal clarity, and emotional nuance.*
+- YuE ahead of Suno only on *openness, controllability via lyrics tags, and structural macro-form for niche genres*.
+---
+## 7. Inference Performance
+From the README's official hardware table:
+| GPU | Time for 30 s of audio (Stage-1 + Stage-2) |
+|---|---|
+| NVIDIA H800 80GB | **~150 s** |
+| NVIDIA RTX 4090 24GB | **~360 s** |
+| ≤24GB GPU | Max ~2 concurrent sessions; cannot generate a full song in one pass |
+| ≥80GB GPU (H100/A100/H800) | Recommended for a full 4+ session song |
+Extrapolating to a **3-minute song** (~6× a 30 s clip, plus some overhead for stitching):
+- H800: ~15–18 minutes
+- A100 80GB: ~18–22 minutes (likely — close to H800 throughput)
+- RTX 4090: ~35–45 minutes
+- M5 Max MPS (user's machine): **no official support, no public benchmark.**
+**VRAM:** Full-precision FP16 Stage-1 needs ~16–18 GB; Stage-2 + upsampler add ~4–6 GB. Single-pass full-song generation comfortably wants 40–80 GB.
+**Quantized / community paths:**
+- **YuEGP** ("YuE for the GPU Poor") brings VRAM down to **<10 GB** via 8-bit quantization and sequential offload ([YuEGP repo](https://github.com/deepbeepmeep/YuEGP)).
+- **YuE-exllamav2** claims up to **5× speedup** via ExLlamaV2 + FlashAttention-2 + BF16 ([YuE-exllamav2](https://github.com/sgsdxzy/YuE-exllamav2)) — NVIDIA-only.
+- **GGUF Stage-2** exists ([multimodalart/YuE-s2-1B-general-Q8_0-GGUF](https://huggingface.co/multimodalart/YuE-s2-1B-general-Q8_0-GGUF)). Stage-1 7B GGUF is not officially published as of 2026-05.
+**Apple Silicon / MPS:**
+- **No official MPS support.** GitHub README references `--cuda_idx`, no `mps` or `mac` mentions.
+- No HF Space or fork advertises working MPS inference. The architecture is plain LLaMA2 + standard transformer ops, so MPS port is *technically feasible* (likely — Stage-1 fits well within the user's 128GB unified memory), but the X-Codec encoder/decoder has Flash-Attention CUDA kernels that would need replacement. Realistic path on M5 Max today: run the Stage-2 GGUF via llama.cpp Metal backend, but Stage-1 has no public Metal/MPS port.
+- A community attempt to MPS-port has *not* surfaced in any search or GitHub issue as of May 2026.
+---
+## 8. Repo Health
+Data from the GitHub API on 2026-05-18 for `multimodal-art-projection/YuE`:
+- **Stars:** 6,219
+- **Forks:** 741
+- **Open issues:** 86
+- **License:** Apache-2.0
+- **Default branch last push:** `2025-06-04T13:08:48Z` — **~11 months stale**
+- **Most-recent commits:** all README edits and the finetune-merge PRs on the same day (2025-06-04).
+- **Recent issue traffic (sampled 2025-Q4 through 2026-Q2):** install errors (CUDA / `codecmanipulator` missing), ComfyUI integration questions, attention-mask warnings, "how do I generate a full song" basics, a Feb-2026 PR proposing `SDPA as default attention` that received zero engagement. Maintainer responses are essentially absent in 2026.
+- **Fine-tuning support:** present, merged June 2025 via PR #126 (LoRA, no QLoRA, requires CUDA 12.1+, PyTorch 2.4, Megatron-formatted JSONL data).
+- **vLLM / SGLang:** listed in TODO, never implemented.
+- **llama.cpp:** community Stage-2 GGUF exists but no official integration; Stage-1 not converted.
+- **Tensor parallel / Stemgen mode:** TODO, never shipped.
+**Verdict:** The repo is in **maintenance/abandonment limbo.** Apache 2.0 + open weights mean anyone can fork; community forks are where the energy is.
+---
+## 9. Real-World Adoption
+- **Replicate:** Hosted at [`fofr/yue`](https://replicate.com/fofr/yue/api) with an official cog wrapper at [`replicate/cog-yue`](https://github.com/replicate/cog-yue) — production-ready pay-per-second API.
+- **HuggingFace Spaces:** at least three live demos — [`fffiloni/YuE`](https://huggingface.co/spaces/fffiloni/YuE), [`innova-ai/YuE-music-generator-demo`](https://huggingface.co/spaces/innova-ai/YuE-music-generator-demo), `Harveyu/YuE-music-generator-demo`.
+- **ComfyUI:** community node [`smthemex/ComfyUI_YuE`](https://github.com/smthemex/ComfyUI_YuE) exposes YuE as a node graph (issue #148 confirms active users in 2026).
+- **Pinokio:** one-click Windows installer ships in the official Pinokio script directory ([pinokio.co](https://pinokio.co/)).
+- **GPU-poor / consumer forks:** `deepbeepmeep/YuEGP` (sub-10 GB VRAM), `sgsdxzy/YuE-exllamav2` (5× speedup), `Mozer/YuE-extend` (mp3 extension + GUI), `Sorrymakershen/YuE-for-windows`.
+- **SiliconFlow:** no public listing found as of 2026-05 (likely — search returned no SiliconFlow YuE endpoint).
+- **Forks:** 741 total, dominated by consumer-VRAM optimization rather than research extension.
+For a Suno-like platform, the **Replicate `fofr/yue` endpoint is the lowest-friction starting point** to test quality before self-hosting.
+---
+## 10. Fine-Tuning
+- **LoRA fine-tuning is documented and supported** since June 2025, in the [`finetune/` directory](https://github.com/multimodal-art-projection/YuE/tree/main/finetune) with `scripts/preprocess_data.sh` and `scripts/run_finetune.sh`.
+- Configurable `LORA_R`, `LORA_ALPHA`, `LORA_DROPOUT`.
+- **Training scripts are open** — Megatron-style data pipeline; data must be converted to JSONL containing X-Codec tokens + lyric/structure/genre metadata, then to Megatron binary.
+- **QLoRA: not documented.** No 4-bit fine-tuning path is described in the official repo (likely — community forks may have hacked it together).
+- Requires CUDA 12.1+, PyTorch 2.4, Python 3.10; GPU memory not explicitly stated but realistically wants ≥40 GB VRAM for the 7B Stage-1 LoRA.
+- No published guide for full-parameter fine-tuning of Stage-1 — implied to need multi-node H100.
+---
+## 11. Pros and Cons
+**Pros**
+- True open weights (Apache 2.0), commercial-use-friendly, with strong attribution-only requirements.
+- Genuine dual-track output (vocals + instrumentals as separable streams), not just a mix.
+- Multilingual coverage of EN / ZH / Cantonese / JP / KR with code-switching demos.
+- Strong macro-structure for 3–5 minute songs — verses, choruses, bridges hold together.
+- Healthy ecosystem of quantized / consumer-VRAM forks and a turnkey Replicate endpoint.
+- LoRA fine-tuning code is shipped and merged.
+- Comparable vocal range to Suno v4 on the paper's metrics.
+**Cons**
+- **Repo is effectively dormant since June 2025** — no maintainer engagement on 2026 issues/PRs.
+- Acoustic fidelity is noticeably below Suno v4/v5 — mono-ish, less polished mix, occasional vocal artifacts/mumbling on long passages.
+- **No MPS / Apple Silicon support**, official or community — a real problem for the user's M5 Max workflow.
+- Slow inference even on H800 (~150 s per 30 s clip, → 15+ minutes per full song before quantization).
+- VRAM hungry: full-song single-pass wants 80 GB; consumer GPUs need session-stitching tricks.
+- No QLoRA / no vLLM / no SGLang / no tensor parallel — all in TODO purgatory.
+- Training data not released → fine-tuning needs you to bring your own licensed corpus.
+- Tokenizer (X-Codec) is the bottleneck for fidelity, and YuE inherits this ceiling — no upgrade path planned in this codebase.
+- An explicit successor effort (ACE-Step) from an adjacent team claims to fix YuE's specific weaknesses.
+---
+## 12. Verdict for the User's Suno-like Platform
+**Best fit for the user's M5 Max / 128 GB platform if:**
+- The product needs **commercial-grade licensing freedom** above all else — YuE is one of the very few open music models you can ship in a paid product without licensing carve-outs.
+- You target **multilingual song generation (EN + Mandarin/Cantonese + JP/KR)** with code-switching — YuE is the strongest open option here.
+- You can offload generation to a **rented H100/H800 (Replicate, Runpod, Lambda)** rather than insisting on local M5 Max inference — *MPS support is the blocker on the user's hardware.*
+- You want a base to **LoRA fine-tune on a proprietary genre/voice corpus** — the official fine-tune scripts work today, and Apache 2.0 lets you keep your LoRA private and commercial.
+**Where YuE will underperform competitors:**
+- **Acoustic polish** — Suno v4/v5 and Udio will sound noticeably more professional out of the box. If your platform's selling point is "studio-quality vocals", YuE is not there.
+- **Throughput per dollar** — diffusion-based ACE-Step and DiffRhythm-2 are dramatically faster (ACE-Step claims ~15× speedup); for a high-volume product, the AR-LLM architecture is expensive.
+- **Real-time / interactive generation** — not viable; YuE is batch-only.
+- **Local Mac inference** — until somebody ports Stage-1 to MPS or ships a Stage-1 GGUF, the user's M5 Max can at best play around with the Stage-2 model in llama.cpp Metal mode.
+**Concrete recommendation for the user:** use YuE via Replicate's `fofr/yue` endpoint as the **commercial-license-clean fallback / multilingual specialist** in the platform's model router, and seriously evaluate ACE-Step in parallel for the throughput-sensitive default path. Plan a future LoRA fine-tune on YuE only after the platform has clear vertical (genre, language, or vocal-style) demand that the closed APIs cannot serve.
+---
+## References
+- GitHub repo: <https://github.com/multimodal-art-projection/YuE>
+- Paper (arXiv): <https://arxiv.org/abs/2503.08638>
+- Paper (HTML): <https://arxiv.org/html/2503.08638v1>
+- OpenReview: <https://openreview.net/forum?id=hZy6YG2Ij8>
+- Project / demos: <https://map-yue.github.io/>
+- HF collection: <https://huggingface.co/collections/m-a-p/yue-6797d55e22990ae89b90a3d6>
+- HF s1 English ICL card: <https://huggingface.co/m-a-p/YuE-s1-7B-anneal-en-icl>
+- Replicate: <https://replicate.com/fofr/yue/api>
+- Replicate cog: <https://github.com/replicate/cog-yue>
+- YuEGP fork: <https://github.com/deepbeepmeep/YuEGP>
+- YuE-exllamav2 fork: <https://github.com/sgsdxzy/YuE-exllamav2>
+- YuE-extend fork: <https://github.com/Mozer/YuE-extend>
+- ComfyUI node: <https://github.com/smthemex/ComfyUI_YuE>
+- GGUF Stage-2: <https://huggingface.co/multimodalart/YuE-s2-1B-general-Q8_0-GGUF>
+- HF X-Codec docs: <https://huggingface.co/docs/transformers/main/model_doc/xcodec>
+- ACE-Step paper (successor-style critique): <https://arxiv.org/abs/2506.00045>
+- WhiteFiber technical summary: <https://www.whitefiber.com/blog/yue-ai-music-generator>
+- HF Space demo (fffiloni): <https://huggingface.co/spaces/fffiloni/YuE>
+- HF Space demo (innova-ai): <https://huggingface.co/spaces/innova-ai/YuE-music-generator-demo>

research/02_diffrhythm.md ADDED Viewed

	@@ -0,0 +1,138 @@

+# DiffRhythm and DiffRhythm 2 — Deep Technical Review
+*Compiled 2026-05-18. All claims cited; speculation flagged inline.*
+## 1. Overview
+DiffRhythm is the first open-source **latent-diffusion full-song generator** — vocals + accompaniment, end-to-end, from lyrics and a style prompt — built by the **Audio, Speech and Language Processing (ASLP) Lab at Northwestern Polytechnical University (NWPU)** in Xi'an, China, with later contributions from **Xiaomi Research** ([arxiv.org/abs/2503.01183](https://arxiv.org/abs/2503.01183), [github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)). DiffRhythm v1 dropped on **arXiv 3 Mar 2025**; the full 4m45s variant followed on **15 Mar 2025**, and an iterative v1.2 fixed repetition and audio-quality issues mid-2025 ([HF v1.2 commit](https://huggingface.co/spaces/ASLP-lab/DiffRhythm/commit/f5b749d65f62e30bdaad11e6866edc8d3b078b71)). **DiffRhythm 2** appeared on **arXiv 27 Oct 2025** (v3 revised 3 Feb 2026) under [arxiv.org/abs/2510.22950](https://arxiv.org/abs/2510.22950), and was open-sourced at [github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2) (forked from `xiaomi-research/diffrhythm2`) on **30 Oct 2025**, with HuggingFace weights at [huggingface.co/ASLP-lab/DiffRhythm2](https://huggingface.co/ASLP-lab/DiffRhythm2). The series is the leading **diffusion-side** alternative to the LLM-style approach taken by Suno, YuE, and SongBloom.
+## 2. Architecture
+DiffRhythm v1 is a **non-autoregressive (NAR) latent diffusion** model with two pieces: a music **VAE** that compresses raw 44.1 kHz stereo audio into a latent grid, and a **DiT** (Diffusion Transformer) that denoises that grid conditioned on lyrics + style ([nzqian.github.io/DiffRhythm](https://nzqian.github.io/DiffRhythm/)). The DiT uses **16 LLaMA-style decoder layers, 2048 hidden dim, 32 heads × 64 dim, totaling ~1.1B parameters** ([arxiv.org/html/2503.01183](https://arxiv.org/html/2503.01183v1)). Vocals and accompaniment are produced **jointly in a single latent stream** — not dual-track — which is what makes it "embarrassingly simple" vs. cascaded systems. Lyric conditioning is **sentence-level via LRC (timestamped) phonemes**, with the diffusion model expected to align internally; style is conditioned either via a reference audio embedding or a text prompt. Inference uses a **32-step Euler ODE with CFG scale 4** and 20% dropout on both conditions during training to enable CFG ([diffrhythm.us](https://diffrhythm.us/)).
+**DiffRhythm 2** replaces the pure-NAR DiT with a **semi-autoregressive block flow-matching** transformer: the latent sequence is sliced into **blocks of 10 frames (2s at 5 Hz)**, and "each block is generated with flow matching, while the dependency across blocks is handled autoregressively" ([alphaxiv.org/overview/2510.22950v3](https://www.alphaxiv.org/overview/2510.22950v3) — quoted via search snippet). This is the key innovation: it preserves NAR-style fast within-block parallelism while letting the model attend to prior blocks for **structural coherence** (verse → chorus → verse) and **lyric alignment without any external aligner**. The audio codec is a new **music VAE at 5 Hz frame rate** (vs. the much higher rates of EnCodec/DAC) with a **170M-param decoder**, enabling 210s of latent context to fit on a single GPU ([arxiv abs](https://arxiv.org/abs/2510.22950)). The full DiT is **~1B parameters**. Two new training objectives appear: **Stochastic Block Representation Alignment (REPA) loss** to align hidden states of clean vs. noisy blocks (improves musicality/structure), and **Cross-Pair Preference Optimization** — an RLHF variant that groups the four preference dimensions (musicality, style similarity, lyric alignment, audio quality) into pairs to dodge the merging-induced regression that plain DPO causes. **Max song length: 210 s** in v2 vs. **4m45s (~285 s)** in v1-full ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)).
+## 3. Variants and sizes
+| Checkpoint | Duration | DiT params | Notes | Source |
+|---|---|---|---|---|
+| `DiffRhythm-base` | 1m35s | ~1.1B | Original Mar 2025 | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-base) |
+| `DiffRhythm-full` | 4m45s | ~1.1B | Released 15 Mar 2025 | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-full) |
+| `DiffRhythm-vae` | — | — | Shared audio VAE | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-vae) |
+| `DiffRhythm-1_2-base` | 1m35s | ~1.1B | v1.2 quality fix | [GH README](https://github.com/ASLP-lab/DiffRhythm) |
+| `DiffRhythm-1_2-full` | 4m45s | ~1.1B | v1.2, text-style + instrumental | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-1_2-full) |
+| `DiffRhythm+` (paper) | full | ~1.1B | Adds DPO; not headlined as separate checkpoint | [arxiv 2507.12890](https://arxiv.org/html/2507.12890v2) |
+| `DiffRhythm2` | 210 s | ~1B DiT + 170M VAE-dec | Block flow matching | [HF](https://huggingface.co/ASLP-lab/DiffRhythm2) |
+(Speculation: I did not find an explicit param count posted for v2's DiT; the **~1B figure comes from a paper-extraction snippet** and aligns with v1's ~1.1B body. Treat as approximate.)
+## 4. License
+**Apache 2.0** for both code and DiT weights, declared on the v1 GitHub README and reaffirmed on the v2 README ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm), [github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2)). **Commercial use is permitted** with attribution. The v2 model card adds a **non-binding ethical disclaimer** asking users to verify originality, disclose AI involvement, and respect stylistic copyright — this is a notice, not an enforceable license restriction ([HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)).
+## 5. Languages supported
+Training is heavily **bilingual (Mandarin + English)** — v2's dataset is reported as **Chinese : English : Instrumental ≈ 4 : 5 : 1** ([alphaXiv extract](https://www.alphaxiv.org/overview/2510.22950v3)). The v1 README and several mirrors claim **cross-lingual capability** for Japanese, Korean, Spanish ([diffrhythm.us](https://diffrhythm.us/), [diffrhythm.ai](https://diffrhythmai.com/)) — but these are demo-site marketing claims, **not benchmarked in the paper**. Verdict: production-safe for **EN and ZH**; treat JP/KR/ES as best-effort. Phoneme front-end is **espeak-ng**, which itself supports 100+ languages ([HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)).
+## 6. Quality assessment
+**Objective (v2 paper, lower=better for PER, higher=better for Mulan-T):**
+| Metric | DiffRhythm 2 | DiffRhythm+ | ACE-Step | LeVo |
+|---|---|---|---|---|
+| PER (lyric alignment) ↓ | **0.13** | 0.15 | 0.23 | 0.19 |
+| Mulan-T (style match) ↑ | **0.40** | 0.25 | 0.28 | 0.35 |
+| RTF (speed) ↓ | 0.213 | 0.153 | 0.127 | 1.225 |
+So v2 has **best-in-open-source lyric alignment and style match**, slightly slower than v1+/ACE-Step but ~6× faster than LeVo ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).
+**Subjective:** v2 is the strongest open model by MOS in the paper's own user study, **but the authors explicitly state "in aspects such as musicality, it still shows a clear gap compared to commercial systems like SUNO V4.5"** ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)). The **block flow-matching does close the structural-coherence gap** that the original Hacker News thread criticized v1 for — multiple HN commenters complained "there's no identifiable chorus in any of the demo songs" and rhythm was unstable ([news.ycombinator.com/item?id=43255467](https://news.ycombinator.com/item?id=43255467)). v2 demos show real verse/chorus structure ([aslp-lab.github.io/DiffRhythm2.github.io](https://aslp-lab.github.io/DiffRhythm2.github.io/)). Specific Reddit reception threads in r/LocalLLaMA/r/StableDiffusion were not surfaced by search (low signal).
+## 7. Inference performance
+- v1-full: **~10 s for a 4m45s song on a single RTX 4090** (claimed in paper abstract, [arxiv 2503.01183](https://arxiv.org/abs/2503.01183)) — 32 ODE steps. Real-world ComfyUI users report **~62 s for 4 min** on consumer GPUs ([comfyui.org](https://comfyui.org/en/generate-music-with-comfyui-diffrhythm)).
+- **VRAM:** DiffRhythm-base needs ≥ **8 GB** with `--chunked`; full needs **24 GB** for headroom ([chutes.ai docs](https://chutes.ai/docs/examples/music-generation)).
+- v2: **RTF 0.213 on RTX 4090** → ~45 s for a 210 s song ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).
+- **Apple Silicon / MPS:** The v1 README claims Apple Silicon is "supported as of March 2025" but the GitHub issues list does not surface dedicated MPS benchmarks, and the Pinokio launcher ([github.com/pinokiofactory/diffrhythm](https://github.com/pinokiofactory/diffrhythm)) does not advertise macOS in its description. **No published M3/M4/M5 numbers exist.** Speculation: on the user's **M5 Max with 128 GB unified memory**, v1-full should run via `PYTORCH_ENABLE_MPS_FALLBACK=1`, likely 3–5× slower than 4090 — needs hands-on validation. v2 is newer and has not been tested on MPS publicly.
+## 8. DiffRhythm 2 specifics
+What changed from v1 → v2 ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950), [alphaxiv overview](https://www.alphaxiv.org/overview/2510.22950v3)):
+1. **Architecture shift:** pure NAR DiT → **semi-AR block flow-matching** (2 s blocks).
+2. **New 5 Hz music VAE** (vs. v1's higher-rate codec) — enables 210 s context within budget.
+3. **Stochastic Block REPA loss:** aligns clean vs. noisy hidden states → better musicality + structure.
+4. **Cross-Pair Preference Optimization:** four-dim RLHF without the model-merging regression that plain DPO causes.
+5. **Dataset scaling:** **~1.4 M songs / ~70,000 hours**, with a **20 k-hour high-quality subset** for SFT and **40 k preference pairs** for DPO — a step-change from v1's undisclosed-but-smaller corpus.
+6. **Lyric alignment without external constraints:** v1 needed LRC timestamps; v2 learns alignment end-to-end via the AR block dependency.
+7. **Quality numbers (paper):** PER **0.15 → 0.13**, Mulan-T **0.25 → 0.40** vs. DiffRhythm+ — i.e. **lyric-error reduced ~13 % and style-match nearly doubled**.
+## 9. Repo health
+- **DiffRhythm v1:** ~**2.2–2.3 k stars**, **268 forks**, active through 2025, last major release Mar 2025 ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)).
+- **DiffRhythm 2:** **157 stars / 11 forks / 27 commits** as of late Oct 2025 — young repo, recently pushed ([github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2)).
+- Training/fine-tuning scripts: **"Coming soon"** is the status on v1; community has filed [Issue #46](https://github.com/ASLP-lab/DiffRhythm/issues/46) asking for fine-tuning docs. v2 ships **inference only** in the public repo as of writing.
+## 10. Real-world adoption
+- **ComfyUI:** [billwuhao/ComfyUI_DiffRhythm](https://github.com/billwuhao/ComfyUI_DiffRhythm) — 153 stars, supports v1.2 + full, includes bilingual subtitle gen ([runcomfy.com node](https://www.runcomfy.com/comfyui-nodes/ComfyUI_DiffRhythm)).
+- **Pinokio:** [pinokiofactory/diffrhythm](https://github.com/pinokiofactory/diffrhythm) — 19 stars, 69 commits, one-click installer.
+- **Chutes.ai:** Public serverless endpoint for DiffRhythm-full ([chutes.ai/docs/examples/music-generation](https://chutes.ai/docs/examples/music-generation)).
+- **Replicate:** No first-party DiffRhythm 2 model found in search — gap in the ecosystem (speculation).
+- Multiple unofficial web frontends: diffrhythm.com, diffrhythm.us, diffrhythm.ai, diffrhythmai.com — quality and origin unverified, likely wrappers over the HF Space.
+## 11. Fine-tuning
+The official answer is **none yet**. The v1 repo's training code is listed as "Coming soon," and v2 only ships inference. There is no LoRA support, no published fine-tuning recipe, and no `transformers`/`diffusers` integration as of May 2026. Community workaround would require reverse-engineering the DiT class — non-trivial for a 1 B-param flow-matching model. **For the user's Suno-clone platform, fine-tuning DiffRhythm today means forking + writing your own training loop.** This is the single biggest practical weakness.
+## 12. Pros and cons
+**Pros**
+- Permissive **Apache 2.0** for code + weights — clean commercial path.
+- **Fastest open full-song model** (~10 s for 4 min on a 4090; v2's block-FM is competitive even with AR-like coherence).
+- v2 has **state-of-the-art lyric alignment (PER 0.13)** in open source.
+- Lightweight: 8 GB VRAM possible with chunking — runs on consumer GPUs.
+- Strong ecosystem: ComfyUI nodes, Pinokio installer, Chutes serverless.
+- v2's block flow-matching meaningfully **closes the structural-coherence gap** that doomed v1 demos on HN.
+**Cons**
+- Still a **clear musicality gap vs. Suno v4.5** (authors admit it; [arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).
+- **No fine-tuning / LoRA path** — training code unreleased.
+- v2's max length is **210 s** (3m30s), *shorter* than v1-full's 4m45s — a regression for radio-length pop.
+- Multilingual claims (JP/KR/ES) are **unbenchmarked**; only EN/ZH have paper-backed quality.
+- **No published MPS benchmarks** for Apple Silicon; v2 untested on Mac.
+- Demo-site proliferation (`diffrhythm.us`, etc.) muddies the brand — confusing for product positioning.
+- License disclaimer adds soft ethical obligations re. copyright that legal review may flag.
+## 13. Verdict for the user's platform
+For a Suno-style platform on an **M5 Max (128 GB unified, MPS)**, DiffRhythm 2 is the **best diffusion-side open option in May 2026**, *but* it should be paired with an **AR-style backup** (YuE / SongBloom / LeVo) covering its weak points.
+**Where DiffRhythm 2 wins:**
+- Fast, cheap inference per song — viable for high-throughput web generation.
+- Best-in-open lyric intelligibility — critical for a karaoke / lyrics-first UX.
+- Stereo 44.1 kHz output out of the box.
+- Apache-2.0 + commercial freedom.
+**Where it underperforms:**
+- **Pop musicality, hook quality, vocal timbre** are still below Suno v4.5 — premium-tier output is not there.
+- **No fine-tuning** means you cannot specialize on a target sound or your platform's curated catalog without doing R&D.
+- **210 s ceiling on v2** limits "full album track" formats — you'd fall back to v1-full (4m45s) at a quality cost.
+- **MPS path is unproven** — the user should plan a same-week feasibility test on the M5 Max before committing v2 to the inference layer; CUDA cloud (Chutes / a 4090 server) is the safer near-term backend.
+**Recommended posture:** ship v2 as the default *fast* generator behind a feature flag, keep v1.2-full for >3.5 min songs, evaluate Suno / YuE / SongBloom as quality-tier alternatives, and track the v2 repo for an eventual training-code release that would unlock fine-tuning on your platform's data.
+---
+### Primary sources
+- [DiffRhythm 2 paper (arxiv 2510.22950)](https://arxiv.org/abs/2510.22950)
+- [DiffRhythm v1 paper (arxiv 2503.01183)](https://arxiv.org/abs/2503.01183)
+- [DiffRhythm v1 GitHub](https://github.com/ASLP-lab/DiffRhythm)
+- [DiffRhythm 2 GitHub](https://github.com/ASLP-lab/DiffRhythm2)
+- [DiffRhythm 2 HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)
+- [alphaXiv overview v3](https://www.alphaxiv.org/overview/2510.22950v3)
+- [HN thread on v1](https://news.ycombinator.com/item?id=43255467)
+- [ComfyUI_DiffRhythm](https://github.com/billwuhao/ComfyUI_DiffRhythm)
+- [Pinokio DiffRhythm](https://github.com/pinokiofactory/diffrhythm)
+- [Chutes serving docs](https://chutes.ai/docs/examples/music-generation)
+- [DiffRhythm+ paper (arxiv 2507.12890)](https://arxiv.org/html/2507.12890v2)

research/03_acestep.md ADDED Viewed

	@@ -0,0 +1,224 @@

+# ACE-Step — Deep Technical Report
+*Researched 2026-05-18 for a Suno-like platform build on M5 Max (128 GB unified) / MPS.*
+---
+## 1. Overview
+ACE-Step is a foundation model for music generation jointly built by **ACE Studio** (the consumer music-tech outfit behind ACE Studio's vocal synth) and **StepFun** ("Step-AI"), a Beijing-based foundation-model lab. Core authors: Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, Joe Guo ([ace-step.github.io](https://ace-step.github.io/)).
+Release timeline:
+- **v1 (3.5B)** — open-sourced May 2025; technical report posted on arXiv on 2 Jun 2025 as 2506.00045 ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).
+- **v1.5** — released **28 Jan 2026** as a separate repo, [`ace-step/ACE-Step-1.5`](https://github.com/ace-step/ACE-Step-1.5). Adds a hybrid Language-Model + Diffusion-Transformer planner.
+- **XL series (4B DiT decoder)** — released 2 Apr 2026 as a higher-quality variant inside the v1.5 family.
+- **Latest tag** — v0.1.7 on 24 Apr 2026 ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
+- **v2** — **no public roadmap or announcement** as of 18 May 2026.
+Current status: actively maintained, 10.4k stars on the v1.5 repo and 4.5k on the original v1 repo, with a thriving ComfyUI ecosystem and third-party UIs ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step), [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
+---
+## 2. Architecture
+**v1 (3.5B):** a hybrid that fuses three pieces (per the paper, [arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)):
+1. **Sana Deep Compression AutoEncoder (DCAE)** — high-compression audio latent space borrowed from NVIDIA's Sana image work.
+2. **Lightweight linear transformer** — the diffusion backbone, deliberately linear-attention to keep RTF low.
+3. **Diffusion training** with **MERT + m-HuBERT** providing semantic-alignment supervision (REPA-style) during training so latents stay musically coherent.
+This sits between LLM-token approaches (Suno/YuE, slow but lyric-tight) and pure diffusion (DiffRhythm, fast but structurally weak). The design goal stated in the paper is "a fast, general-purpose, efficient yet flexible architecture" — explicitly a *foundation model*, not just a text-to-song pipeline ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).
+**v1.5:** a hybrid **LM-as-planner + Diffusion-Transformer (DiT)**. A small Qwen3-based LM (0.6B / 1.7B / 4B) turns the user prompt into a structured "song blueprint" (sections, key, bpm, lyrics, vocal style) which the DiT (2B standard or 4B XL) decodes into audio. This brings chain-of-thought reasoning to music structure, lifting long-range coherence — Suno's main historic advantage ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)).
+**Parameter counts:**
+| Variant | DiT | LM planner | Total |
+|---|---|---|---|
+| v1-3.5B | 3.5B (DiT only) | — | 3.5B |
+| v1.5 standard | 2B | 0.6B / 1.7B | ~2.6 – 3.7B |
+| v1.5 XL | 4B | up to 4B | up to 8B |
+---
+## 3. Variants and checkpoints
+All on Hugging Face under the `ACE-Step/` org ([ACE-Step org on HF](https://huggingface.co/ACE-Step)):
+- `ACE-Step-v1-3.5B` — the original generalist model ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)).
+- `ACE-Step-v1-chinese-rap-LoRA` ("RapMachine") — genre-specific LoRA.
+- **LoRA family** shipped by the team: `RapMachine`, `Lyric2Vocal` (vocal-only stem from lyrics), `Text2Samples` (instrumental loops/samples) ([ace-step.github.io](https://ace-step.github.io/)).
+- **v1.5 DiT checkpoints:** 2B standard and 4B XL.
+- **v1.5 LM planners:** 0.6B, 1.7B, 4B.
+- A public **Space demo** at [huggingface.co/spaces/ACE-Step/ACE-Step](https://huggingface.co/spaces/ACE-Step/ACE-Step).
+No v2 checkpoint exists yet.
+---
+## 4. License
+**Apache 2.0** for v1 ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)) and **MIT** for v1.5 ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). Both are unambiguously **commercial-use-permitted, royalty-free**. This is the single biggest licensing advantage over Suno/Udio and even over YuE (which carries non-commercial clauses in parts of its weights chain).
+---
+## 5. Vocal support — CRITICAL VERIFICATION
+**Verdict: YES — ACE-Step generates vocals natively. The "instrumental-only" claim circulating in some reviews is wrong (likely conflating it with `Text2Samples` LoRA or with DiffRhythm).**
+Evidence:
+- The **v1 HF model card** describes the model as full-song (vocals + instruments) with the explicit caveat: *"Coarse vocal synthesis lacking nuance"* and *"Rare instruments may not render perfectly"* ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)).
+- The paper claims **lyric alignment across melody/harmony/rhythm metrics** — only meaningful for sung vocals ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).
+- The ComfyUI native node `TextEncodeAceStepAudio` accepts lyrics with `[verse] [chorus] [bridge]` structural tags ([comfyui-wiki guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1)).
+- `Lyric2Vocal` LoRA exists *because* the base model already does vocals — the LoRA isolates the vocal stem ([ace-step.github.io](https://ace-step.github.io/)).
+- Blind-listening review of 50 participants scored ACE-Step v1.5 **4.4/5 on SongEval Vocal vs Suno v4 at 4.1/5** ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)).
+**Quality reality check:** v1 vocals are admitted to be "coarse"; v1.5 markedly improves vocal clarity and now beats Suno v4 in blind tests on naturalness for folk/classical/jazz, while Suno still wins on "radio-ready polish" for pop/EDM ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)).
+---
+## 6. Languages supported
+- **v1:** 19 languages, with the top 10 (English, Mandarin Chinese, Russian, Spanish, Japanese, German, French, Portuguese, Italian, Korean) performing best ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). Less-represented languages underperform due to training-data imbalance.
+- **v1.5:** Expanded to **50+ languages** with lyric control, alongside the planner LM ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
+Known weakness from the team itself: Chinese rap was historically weak, motivating the `chinese-rap-LoRA` ([ace-step.github.io](https://ace-step.github.io/)).
+---
+## 7. Speed claims — verified
+The famous claim: *"synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPU — 15× faster than LLM-based baselines"* ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045), [ace-step.github.io](https://ace-step.github.io/)). Hardware: **NVIDIA A100 80GB**.
+Published RTF table from the v1 HF card ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)):
+| Device | 27 steps RTF | 60 steps RTF |
+|---|---|---|
+| RTX 4090 | 34.48× | 15.63× |
+| A100 | 27.27× | 12.27× |
+| RTX 3090 | 12.76× | 6.48× |
+| **M2 Max** | **2.27×** | **1.03×** |
+v1.5 is faster still: *"under 2 seconds per full song on A100 and under 10 seconds on an RTX 3090"* ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
+**Apple-Silicon equivalents** (from the dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) port):
+| Task | M1 Pro 16 GB | M3 Pro 36 GB | A100 |
+|---|---|---|---|
+| 30 s turbo | ~45 s | ~25 s | ~2 s |
+| 30 s SFT (full) | ~3 min | ~1.5 min | ~8 s |
+**M5 Max projection:** The M5 Max's GPU TFLOPS lineage (MPS SGEMM scaled M1→M4: 1.36 → 2.24 → 2.47 → 2.9 TFLOPS, per [arxiv 2502.05317](https://arxiv.org/html/2502.05317v1)) plus the M5 generation's ~30 % uplift suggests roughly **3.5–4× the throughput of M2 Max**, i.e. an **estimated 8–10× RTF at 27 steps** for v1, and full-song generation in **~30–50 s for a 4-minute song**. No M5-specific public benchmark exists yet.
+---
+## 8. Quality assessment
+From the cross-model evaluation summarised in research-aggregator coverage ([researchgate paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model), [fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)):
+| Dimension | Leader | Where ACE-Step sits |
+|---|---|---|
+| Aesthetic quality | Hailuo > DiffRhythm | mid-upper |
+| Musicality (coherence) | Suno v3 | competitive, strong on memorability/clarity |
+| Style alignment | Udio v1 > Hailuo | 3rd |
+| Lyric alignment | Hailuo | strong, beats Suno v3, Udio, YuE |
+| **Vocal naturalness (v1.5)** | **ACE-Step 4.4/5** | beats Suno v4 (4.1/5) |
+| Speed (RTF) | **ACE-Step 15.63×** | best in class; DiffRhythm 10.03×, YuE 0.083× |
+User-facing reception is positive on customisability and speed; the most-cited weakness is "gacha"-style seed sensitivity — re-rolls produce noticeably different outputs ([ace-step.github.io](https://ace-step.github.io/)).
+---
+## 9. Inference performance & Apple Silicon
+- **VRAM (v1):** minimum **8 GB with CPU offload**; comfortable on 12 GB+ ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)).
+- **VRAM (v1.5):** **<4 GB** for 2B-turbo with offload; **≥12 GB** for XL with offload; **≥20 GB** without offload; **≥24 GB optimal** ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)).
+- **MPS support:** **first-class.** Use `--bf16 false` on M-series to avoid kernel issues ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). The dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) fork adds: bfloat16 throughout, MPS-safe pipeline with `torch.mps.empty_cache()` synchronisation, **MLX backend (567 LoC)** that auto-converts the Qwen3 planner LM to MLX with quantisation, and **LoRA training on MPS**.
+- **ComfyUI:** **native nodes** ship in upstream ComfyUI (`TextEncodeAceStepAudio` etc.) plus the official [`ace-step/ACE-Step-ComfyUI`](https://github.com/ace-step/ACE-Step-ComfyUI). v1.5 has dedicated workflows (split-LLM and AIO checkpoint variants) on comfy.org ([Purz blog post](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui)).
+- **128 GB unified on M5 Max** comfortably fits the full XL stack plus the 4B planner LM with no offload needed; user's hardware is essentially overkill for ACE-Step.
+---
+## 10. Repo health
+| Repo | Stars | Forks | Last release |
+|---|---|---|---|
+| `ace-step/ACE-Step` (v1) | 4.5k | 568 | quiet since v1.5 fork |
+| `ace-step/ACE-Step-1.5` | **10.4k** | 1.3k | v0.1.7 on 24 Apr 2026 |
+| `fspecii/ace-step-ui` (popular community UI) | 3.8k | 561 | active |
+| `clockworksquirrel/ace-step-apple-silicon` | — (smaller) | — | active |
+The team also curates [`ace-step/awesome-ace-step`](https://github.com/ace-step/awesome-ace-step). Issue activity, ComfyUI integration cadence, and the LM-planner architectural jump in v1.5 all indicate a project that is healthier and growing faster than YuE or DiffRhythm.
+---
+## 11. Real-world adoption
+- **AMD vendor-backed deployment:** AMD published a blog *"Commercial-grade AI music generation on AMD Ryzen AI processors and Radeon graphics with ACE Step 1.5"* in 2026, explicitly endorsing it for Ryzen AI / Radeon production stacks ([AMD blog](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html)).
+- **Third-party SaaS:** `acestep.io` and `ace-step.app` run hosted song-generation services on the open weights ([acestep.io](https://acestep.io/), [ace-step.app](https://ace-step.app/)).
+- **Production-grade UI:** `fspecii/ace-step-ui` brands itself as *"the Ultimate Open Source Suno Alternative"* with stem extraction (Demucs), batch generation, library/playlist management, LAN access ([fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui)).
+- Heart-MuLa and similar music platforms cite ACE-Step 1.5 in their stack comparisons ([heart-mula.com/ace-step](https://heart-mula.com/ace-step)).
+---
+## 12. Fine-tuning + LoRA
+- **Training code released**; documented in [`TRAIN_INSTRUCTION.md`](https://github.com/ace-step/ACE-Step) and `ZH_RAP_LORA.md` ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)).
+- **Genre / task LoRAs from the team:** `RapMachine` (general rap), `Chinese-Rap-LoRA`, `Lyric2Vocal`, `Text2Samples` ([HF org](https://huggingface.co/ACE-Step), [ace-step.github.io](https://ace-step.github.io/)).
+- v1.5 quotes **"8 songs trainable in ~1 hour on a single RTX 3090"** for LoRA personalisation ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
+- LoRA training is verified working on **MPS** via the Apple-Silicon fork ([clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)).
+---
+## 13. Pros and cons
+**Pros**
+- Apache-2.0 / MIT — **fully commercial-friendly**, unique in this tier.
+- **Fastest open music model**: 15.63× RTF on a 4090; sub-2 s/song on A100 (v1.5).
+- Vocals **and** instruments natively; v1.5 vocal quality now beats Suno v4 in blind tests.
+- 50+ languages with lyric structural tags.
+- First-class **MPS + MLX** support and a dedicated Apple-Silicon fork.
+- ComfyUI native + thriving UI ecosystem (`ace-step-ui`).
+- LoRA training is cheap (~1 hour for 8 songs on 3090), well-documented.
+- Hybrid LM-planner (v1.5) closes the long-range structure gap with Suno.
+**Cons**
+- v1 vocals are admitted "coarse"; even v1.5 trails Suno on pop/EDM polish.
+- High **seed sensitivity** → "gacha" outputs; multiple re-rolls needed in production.
+- Less-represented languages underperform.
+- Memory for XL series can exceed 24 GB without offload.
+- No official **v2** announced; the rapid v1 → v1.5 → XL fork hints at API/checkpoint churn.
+- Smaller benchmark literature than Suno/YuE; some metrics still self-reported.
+---
+## 14. Verdict for the user's platform
+For a **Suno-like platform on M5 Max with 128 GB unified memory**, ACE-Step is currently the **single strongest open-source choice** and should be the **default base model**:
+- **Best for:** full-song generation with vocals in 50+ languages, fast iteration (sub-minute per song expected on M5 Max), genre-specific LoRA fine-tuning, and any deployment where commercial rights matter (Apache/MIT vs Suno's locked-down terms).
+- **Recommended stack:** ACE-Step **v1.5 XL (4B DiT) + 1.7B Qwen3 planner**, run via the `clockworksquirrel/ace-step-apple-silicon` MPS/MLX fork, served behind the `fspecii/ace-step-ui` frontend, with ComfyUI workflows for power-user editing.
+- **Weaknesses to mitigate:** budget for **n-of-k re-roll selection** in the product UX (the gacha problem); pair with a **Demucs stem-extraction post-process** (already in `ace-step-ui`) so users can mix-down; do not pitch the platform on pop/EDM polish alone — lean into folk/classical/jazz and rap, where ACE-Step now leads.
+- **Where you may still need Suno-style commercial APIs:** clients demanding broadcast-radio pop polish; otherwise, ACE-Step is sufficient.
+---
+### Sources
+- [ACE-Step paper, arXiv 2506.00045](https://arxiv.org/abs/2506.00045)
+- [ace-step.github.io](https://ace-step.github.io/)
+- [ace-step/ACE-Step (v1 repo)](https://github.com/ace-step/ACE-Step)
+- [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)
+- [ACE-Step v1-3.5B model card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)
+- [ACE-Step org on Hugging Face](https://huggingface.co/ACE-Step)
+- [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)
+- [fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui)
+- [ace-step/ACE-Step-ComfyUI](https://github.com/ace-step/ACE-Step-ComfyUI)
+- [ace-step/awesome-ace-step](https://github.com/ace-step/awesome-ace-step)
+- [ComfyUI native ACE-Step tutorial](https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1)
+- [ComfyUI Wiki ACE-Step guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1)
+- [Purz blog – ACE-Step 1.5 in ComfyUI](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui)
+- [AMD blog – ACE-Step 1.5 on Ryzen AI / Radeon](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html)
+- [FM9 – ACE-Step vs Suno blind test](https://fm9.ai/ace-step/vs-suno)
+- [HeartMuLa – ACE-Step 1.5 review](https://heart-mula.com/ace-step)
+- [ResearchGate – ACE-Step paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model)
+- [Apple Silicon HPC benchmark, arXiv 2502.05317](https://arxiv.org/html/2502.05317v1)
+- [acestep.io – hosted service](https://acestep.io/)
+- [ace-step.app – hosted service](https://ace-step.app/)

research/04_newcomers_and_survey.md ADDED Viewed

	@@ -0,0 +1,161 @@

+# 2026 Open-Source Music Generation Models — Newcomers and Survey
+*Date: 2026-05-18. Target hardware: M5 Max, 128 GB unified memory, MPS backend.*
+This report investigates the freshest 2026 open-source song-with-vocals generators relevant to building a Suno-like platform locally. Primary focus: **SongGeneration 2 / LeVo 2** (Tencent, March 2026) and **HeartMuLa** (Jan 2026). Also covered: DiffRhythm 2, ACE-Step 1.5 XL, SongBloom, YuE, FunMusic/InspireMusic, NotaGen. Independent benchmark sources are sparse for releases this fresh; vendor claims are flagged.
+---
+## 1. SongGeneration 2 / LeVo 2 (Tencent AI Lab)
+**Overview.** Builder: Tencent AI Lab. Release: 2026-03-01 (v2-large weights), arXiv paper "LeVo" appeared 2025-06-09 (2506.07520). Status: actively updated, v2 is the headline model on the repo ([GitHub](https://github.com/tencent-ailab/SongGeneration), [HF](https://huggingface.co/tencent/SongGeneration)).
+**Architecture.** Hybrid LLM + Diffusion. The **LeLM** language model handles global structure and performance details with a hierarchical scheme that parallel-models *Mixed Tokens* (melody/structure) and *Dual-Track Tokens* (separate vocal vs. accompaniment streams). A downstream diffusion module synthesises the high-fidelity acoustic waveform from those tokens. Multi-preference DPO alignment (~200k positive/negative pairs) is applied offline ([repo README](https://github.com/tencent-ailab/SongGeneration/blob/main/README.md)).
+**Variants and sizes.** Five tiers ([HF model card](https://huggingface.co/tencent/SongGeneration/blob/main/README.md)):
+- `base` (2:30 max, zh) — 10/16 GB VRAM, RTF 0.67
+- `base-new` (zh + en) — same VRAM
+- `base-full` (4:30, zh + en) — 12/18 GB VRAM, RTF 0.69
+- `large` (zh + en) — 22/28 GB VRAM, RTF 0.82
+- **`v2-large` — 4 B params, multilingual (zh/en/es/ja/…), 22/28 GB VRAM, RTF 0.82, 4:30 max length**
+**License.** Custom Tencent "academic, research and education purposes" license, **commercial use explicitly prohibited** ([LICENSE](https://github.com/tencent-ailab/SongGeneration/blob/main/LICENSE)). This is the headline blocker for a Suno-like SaaS product.
+**Languages.** v2-large: Chinese, English, Spanish, Japanese plus others (multilingual lyrics input).
+**Vocals.** Yes. Separable dual-track output (vocals + accompaniment, instrumental-only, or a cappella).
+**Speed and hardware.** Reference numbers measured on Tencent's H20 (96 GB) GPU: RTF 0.82 for v2-large. No first-party MPS code path, but a community fork **[SongGen-Mac](https://github.com/Rdx-ai-art/SongGen-Mac)** runs the older base/large models via PyTorch MPS on M-series Macs — author reports **~6 min wall-clock per ~2 min song on M1 Max 64 GB (base), ~12 min for large**, and notes RAM+swap usage hits ~70 GB during inference. The fork is tiny (9 GitHub stars) and does **not** yet wrap v2-large — porting that to MPS on the M5 Max 128 GB is a real engineering task and will likely need careful attention bf16 casts (LeLM) + diffusion sampler patches.
+**Benchmarks.** Vendor claims ([repo README](https://github.com/tencent-ailab/SongGeneration)): Phoneme Error Rate **8.55 %** vs. Suno v5 12.4 % and Mureka v8 9.96 %. Subjective panel: 20 industry professionals scored across Overall Quality, Melody, Arrangement, Sound Quality (instrument and vocal), Structure on 100 songs/model — Tencent reports v2-large above all open-source baselines and parity with top commercial. **All numbers vendor-reported; no independent re-run located.** The arXiv "Benchmarking Music Generation Models via Human Preference Studies" paper (2506.19085) precedes v2 and tops out at Suno v3.5 / Udio — does not cover LeVo ([arXiv](https://arxiv.org/html/2506.19085v1)).
+**Repo health.** 1.6 k stars / 191 forks, last meaningful update 2026-03-01. 12 active discussion threads ([repo](https://github.com/tencent-ailab/SongGeneration)).
+**Adoption.** Hugging Face Space (free demo), WaveSpeed AI hosted endpoint ([WaveSpeed](https://wavespeed.ai/models/wavespeed-ai/song-generation)), SECourses Patreon GUI wrapper, vllm-omni issue tracking integration ([HF Space](https://huggingface.co/spaces/tencent/SongGeneration)). No production SaaS adoption seen.
+**Pros.** State-of-art lyric accuracy (vendor); dual-track outputs ready for mixing; multilingual; clear inference budget; 4 B params fits comfortably in 128 GB unified memory in fp16.
+**Cons.** **License kills commercial use** for a Suno-clone product. No official MPS path. Community Mac fork lags v2. Inference time on Apple Silicon is multi-minute per song. No independent benchmark verification.
+---
+## 2. HeartMuLa (HeartMuLa team / academic group)
+**Overview.** Builder: HeartMuLa research collective, paper credited to Jordi Pons-affiliated group ([Substack explainer](https://artintech.substack.com/p/heartmula-explained)). First weights: 2026-01-19 (`HeartMuLa-oss-3B`), latest: 2026-02-13 (`HeartMuLa-oss-3B-happy-new-year`). arXiv 2601.10547 ([abs](https://arxiv.org/abs/2601.10547)).
+**Architecture.** Four-stage family ([landing page](https://heartmula.github.io/)): **HeartCLAP** (audio-text alignment / retrieval), **HeartTranscriptor** (Whisper-style lyric ASR), **HeartCodec** (12.5 Hz neural audio codec, low frame rate but high-fi), **HeartMuLa** (LLM-based song generator conditioned on lyrics, tags, and reference audio). Section-level fine-grained control (intro/verse/chorus) is a stated feature.
+**Variants and sizes.** Six published weights on [HF](https://huggingface.co/HeartMuLa):
+- `HeartMuLa-oss-3B` — 4 B text-to-audio (1.21 k downloads, 255 likes)
+- `HeartMuLa-RL-oss-3B-20260123` — 4 B RL-tuned variant
+- `HeartMuLa-oss-3B-happy-new-year` — 4 B latest checkpoint
+- `HeartCodec-oss-20260123` — 2 B codec
+- `HeartTranscriptor-oss` — 0.8 B ASR
+- `HeartMuLa-7B` — internal/unreleased
+(Note the naming oddity: HF model card lists "3B" name but 4 B parameter size; treat as ~4 B.)
+**License.** **Apache 2.0** — confirmed via [LICENSE](https://github.com/HeartMuLa/heartlib/blob/main/LICENSE). Commercial use permitted. This is the strongest licensing position of any model in this report.
+**Languages.** Multilingual; demo page covers en, zh, ja, ko, es. Paper claims "almost all languages."
+**Vocals.** Yes — lyric-conditioned vocal synthesis is the core capability. The paper claims best-in-class lyric intelligibility.
+**Speed and hardware.** RTF ≈ 1.0 (paper). VRAM via the ComfyUI integration ([FL-HeartMuLa](https://github.com/filliptm/ComfyUI_FL-HeartMuLa)): 3 B model needs **12 GB+ VRAM** at full precision, **6 GB with 4-bit bnb quantisation** (CUDA-only). 7 B will need 24 GB / 12 GB quantised. **MPS supported** on M1/M2/M3/M4 (M5 implied), but 4-bit quantisation does not work on MPS, so the M5 Max will run native bf16. 128 GB unified memory is plenty headroom for the 4 B model and an eventual 7 B release.
+**Benchmarks.** Vendor PER claims: **0.09 (English), 0.12 (Chinese)** — flagged "lowest across every language tested," beating Suno v5 and MiniMax Music 2.0 ([blog](https://huggingface.co/blog/azhan77168/heartmula)). **Note PER unit mismatch with SongGeneration's 8.55 % — these are likely measured on different scales (HeartMuLa percentages may be normalised differently); direct comparison unreliable.** Demo page compares against Suno v4.5, Mureka v7.6, YuE, DiffRhythm 2, ACE-Step ([demos](https://heartmula.github.io/)). The single HN comment ([46691275](https://news.ycombinator.com/item?id=46691275)) said "initial results promising, more so than recent ACE-Step 1.5." Otherwise **no independent A/B tests located**; the HF promo blog is vendor-aligned content.
+**Repo health.** [github.com/HeartMuLa/heartlib](https://github.com/HeartMuLa/heartlib): 3.6 k stars / 396 forks / 71 open issues. Last release Feb 2026. Larger and more active than SongGeneration's repo.
+**Adoption.** WaveSpeed AI hosted endpoint ([blog](https://wavespeed.ai/blog/posts/introducing-wavespeed-ai-heartmula-generate-music-on-wavespeedai/)); ComfyUI node `FL-HeartMuLa`; HeartMuse local app integrating Ollama for lyric writing ([HN](https://news.ycombinator.com/item?id=46871828)).
+**Pros.** Apache 2.0 — usable for a commercial product. Modular architecture (codec + ASR + CLAP + gen) is reusable. Strong lyric intelligibility claim. Active repo. Explicit MPS support documented downstream.
+**Cons.** Heavy marketing tone in third-party coverage; benchmarks all vendor-published. 7 B not yet released. No standardised MOS or ELO numbers from a neutral evaluator. PER values reported in non-comparable units to peers.
+---
+## 3. DiffRhythm 2 (ASLP-Lab)
+**Overview.** Successor to DiffRhythm v1.2. arXiv 2510.22950, v3 2026-02-03 ([arXiv](https://arxiv.org/abs/2510.22950)). Original repo: [ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm).
+**Architecture.** Music VAE at 5 Hz frame rate + Diffusion Transformer with **block flow matching** for lyric-to-vocal alignment. Adds cross-pair preference optimisation (RLHF) and a stochastic block representation alignment loss for musicality. Semi-autoregressive blockwise generation.
+**License.** Apache 2.0 (inherited from v1, confirmed 2025-03-07).
+**Languages, vocals, hardware.** Multilingual; full vocals + instrumental; uses 44.1 kHz stereo; up to 4:45 song length. DiffRhythm v1 can generate a full song in ~10 s on a single A100 — v2 should be in the same ballpark. MPS not officially stated but PyTorch DiT models port relatively cleanly. Parameter count not disclosed in v2 abstract.
+**Benchmarks.** Vendor claims top-of-class fidelity; no independent verification specific to v2.
+**Pros/cons.** Pros: very fast, permissive license, mature codebase. Cons: no public param count, no first-party MPS path, lyric clarity historically the weak spot vs LeVo/HeartMuLa.
+---
+## 4. ACE-Step 1.5 XL (ACE Studio × StepFun)
+**Overview.** [github.com/ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5). arXiv 2602.00744. Most user-tested local-first option. 10.4 k stars / 1.3 k forks — **biggest community by far**.
+**Architecture.** LM planner (0.6 B / 1.7 B / 4 B selectable) + DiT decoder (2 B or 4 B XL). XL DiT ~9 GB bf16.
+**License.** **MIT**. Commercial use allowed.
+**Languages.** 50+.
+**Speed and hardware.** Under 2 s/song on A100, under 10 s on RTX 3090, **<4 GB VRAM** for DiT-only minimum. **Explicit Mac MPS support** with `start_gradio_ui_macos.sh`; MLX backend optimisation noted. Easiest M5 Max install of any model in this list.
+**Benchmarks.** Vendor: SongEval 8.12, AudioBox 7.76, claims to beat Suno v5 and MiniMax 2.5 across 11 dimensions ([project page](https://ace-step.github.io/ace-step-v1.5.github.io/)). DEV Community write-up positions it "between Suno v4.5 and v5" — more honest framing.
+**Pros.** Best Mac story, MIT licence, LoRA personalisation in days, tiny VRAM. **Cons.** Vocal naturalness still trails Suno v5 in casual user tests.
+---
+## 5. SongBloom (Tencent AI Lab)
+[github.com/tencent-ailab/SongBloom](https://github.com/tencent-ailab/SongBloom). 778 stars. Interleaved autoregressive sketch + diffusion refinement, 2 B params, MPS supported, lengths up to 240 s in Oct 2025 update. Same Tencent academic-only LICENSE pattern (not Apache). Up to 150 s songs from lyrics + 10 s reference audio. Useful as a research baseline; **same commercial-use prohibition as SongGeneration** likely applies — verify before deploying.
+---
+## 6. YuE (M-A-P / HKUST)
+[github.com/multimodal-art-projection/YuE](https://github.com/multimodal-art-projection/YuE). LLaMA-2 backbone, lyric-to-song, **Apache 2.0** since 2025-01-30, 5 min max length, dual-track ICL mode, no v2 announced. Strong vocal emotion for ballads/R&B. Llama.cpp issue 11467 still tracks GGUF support. Solid permissive fallback if HeartMuLa underperforms.
+---
+## 7. FunMusic / InspireMusic (Alibaba FunAudioLLM)
+[github.com/FunAudioLLM/FunMusic](https://github.com/FunAudioLLM/FunMusic). Qwen2.5 backbone + flow-matching super-res. 1.3 k stars. Apache 2.0. **No MPS support, requires Flash Attention 2.6 + CUDA 11.8+** — effectively NVIDIA-only. Song-with-vocals models announced but not yet released; current ships are music-only/audio.
+---
+## Survey table — 2026 open-source song generators
+| Model | Builder | Release | Params | License | Vocals | Repo |
+|---|---|---|---|---|---|---|
+| SongGeneration 2 / LeVo 2 | Tencent AI Lab | 2026-03 | 4 B | Custom non-commercial | Yes, dual-track | [link](https://github.com/tencent-ailab/SongGeneration) |
+| HeartMuLa-oss-3B | HeartMuLa | 2026-01 | ~4 B + 2 B codec + 0.8 B ASR | Apache 2.0 | Yes, multilingual | [link](https://github.com/HeartMuLa/heartlib) |
+| DiffRhythm 2 | ASLP-Lab | 2025-10 → 2026-02 (v3) | undisclosed | Apache 2.0 | Yes | [link](https://github.com/ASLP-lab/DiffRhythm) |
+| ACE-Step 1.5 XL | ACE Studio × StepFun | 2026-01 | LM 0.6–4 B + DiT 2–4 B | MIT | Yes | [link](https://github.com/ace-step/ACE-Step-1.5) |
+| SongBloom | Tencent AI Lab | 2025-06 → 2025-10 | 2 B | Custom (likely non-commercial) | Yes | [link](https://github.com/tencent-ailab/SongBloom) |
+| YuE | M-A-P / HKUST | 2025-01 | up to 7 B | Apache 2.0 | Yes | [link](https://github.com/multimodal-art-projection/YuE) |
+| InspireMusic (FunMusic) | Alibaba FunAudioLLM | 2025-01 | 1.5 B | Apache 2.0 | Coming (music only today) | [link](https://github.com/FunAudioLLM/FunMusic) |
+| NotaGen / NotaGen-X | Central Conservatory + ElectricAlexis | 2025 | symbolic-only | MIT | n/a (ABC/XML) | [link](https://github.com/ElectricAlexis/NotaGen) |
+---
+## Dark horses / experimental
+- **NotaGen-X** — DeepSeek-R1-style RL on symbolic music. Outputs ABC/MusicXML (not audio). Could feed a TTS-vocal model for a hybrid composer → singer pipeline ([repo](https://github.com/ElectricAlexis/NotaGen), [arXiv](https://arxiv.org/abs/2502.18008)).
+- **LLaSA / LLaSA+** — Llama-3B-backbone TTS pipeline ([arXiv](https://arxiv.org/html/2508.06262v1)); not music, but emergent prosody good enough to consider as the vocal layer behind a NotaGen score.
+- **DiffRhythm+** — preference-optimised DiffRhythm variant, arXiv 2507.12890; mid-stage between v1 and v2.
+- **AudioX** — anything-to-audio DiT, 2503.10522; useful for sound design and SFX layering, not full-song.
+- **MelodyFlow** — text-controllable DiT with flow-matching for music editing.
+- **HeartMuse** — local Ollama-orchestrated lyric → HeartMuLa song app ([HN](https://news.ycombinator.com/item?id=46871828)); reference for building a thin product wrapper.
+---
+## Skeptic's bottom line for the M5 Max 128 GB build
+1. **For a commercial Suno-clone**: **HeartMuLa** (Apache 2.0, native MPS, 4 B fits easily, Feb-2026 checkpoint, modular components reusable) is the strongest pick. Verify their PER claims yourself before fundraising-style messaging.
+2. **For best raw quality, research only**: **SongGeneration 2 v2-large** — but the Tencent licence forbids commercial deployment and the v2 weights don't yet have a maintained MPS port. The community SongGen-Mac fork targets the older base/large.
+3. **For fastest iteration / smallest VRAM**: **ACE-Step 1.5 XL** (MIT, native Mac script, <4 GB VRAM) — under-promises vocal naturalness vs HeartMuLa but ships today on Apple Silicon with the cleanest licence story.
+4. Reliable independent benchmark for these specific 2026 releases does not yet exist; the only neutral preference study found ([arXiv 2506.19085](https://arxiv.org/html/2506.19085v1)) stops at Suno v3.5 and does not cover LeVo, HeartMuLa, or ACE-Step. **Run your own blind A/B before betting a product on any vendor PER number.**

research/05_apple_silicon_mps_audit.md ADDED Viewed

	@@ -0,0 +1,105 @@

+# Apple Silicon / MPS Compatibility Audit — Music Generation Models
+Hardware target: **M5 Max, 128 GB unified memory**. Date: 2026-05-18.
+Honest read: MPS is a second-class citizen for almost every music-gen repo. CUDA is the assumed default; Mac support, when it exists, is community-driven. Below is the per-model evidence with verdicts.
+---
+## 1. YuE (multimodal-art-projection/YuE)
+- **Official MPS support:** None. The README requires `cuda >= 11.8`, conda-installed `cudatoolkit=11.8`, and **flash-attn 2 is mandatory** to avoid OOM on long sequences ([YuE README](https://github.com/multimodal-art-projection/YuE/blob/main/README.md)).
+- **Community reports:** Issue #51 ("Instructions to run on Mac") is open and **unanswered** ([#51](https://github.com/multimodal-art-projection/YuE/issues/51)). No working Mac fork.
+- **Backend compatibility:** Hard CUDA dependency through flash-attn; xformers/triton flash paths are CUDA-only ([HF forum thread](https://discuss.huggingface.co/t/best-practices-to-use-models-requiring-flash-attn-on-apple-silicon-macs-or-non-cuda/97562)). Stage 1 (7B LLaMA-2-style) and Stage 2 (1B) both transformer-based; in principle portable, but no one has shipped it.
+- **Memory:** 7B + 1B + upsampler. Author recommends **≥80 GB VRAM** for full song; 24 GB OK for short clips. On 128 GB unified memory this fits, *if* you can swap flash-attn for SDPA.
+- **Apple-Silicon timing:** None reported.
+- **Verdict:** **Doesn't work out of the box. Likely broken on MPS.** Would need a non-trivial fork: strip flash-attn, replace with `torch.nn.functional.scaled_dot_product_attention`, and audit RoPE/KV-cache for MPS dtype quirks. There is also a "GPU Poor" fork ([deepbeepmeep/YuEGP](https://github.com/deepbeepmeep/YuEGP)) but it targets CUDA/ROCm with 8-bit quant — **no Mac path**.
+## 2. DiffRhythm v1 and v2 (ASLP-lab)
+- **Official MPS support:** DiffRhythm v1 explicitly states *"DiffRhythm can now run on MacOS!"* with `brew install espeak-ng` ([Readme](https://github.com/ASLP-lab/DiffRhythm/blob/main/Readme.md)). No specific MPS notes, but it works.
+- **DiffRhythm 2:** `requirements.txt` is **clean of CUDA-only packages** — no flash-attn, xformers, triton, mamba_ssm, deepspeed, bitsandbytes ([requirements.txt](https://github.com/ASLP-lab/DiffRhythm2/blob/main/requirements.txt)). Just `torch==2.7`, `torchaudio==2.7`, `transformers`, `safetensors`, `muq`, `librosa`. The 3.9 % "CUDA" language stat in the repo is benign — auto-detected from a small kernel file, but no compiled extensions in the pip install path.
+- **Community reports:** No GitHub issues or Reddit threads surface specific MPS bugs for DiffRhythm — implying it either works quietly or no one has tried at scale. The architecture (latent diffusion + DiT with flow matching, very similar to Stable Audio Open / SD3) is the same class that *does* work on MPS via diffusers.
+- **Memory:** DiffRhythm-base needs **≥8 GB VRAM**; `--chunked` decoding reduces it further. Trivial on 128 GB.
+- **Apple-Silicon timing:** Not benchmarked publicly, but extrapolating from Stable Audio Open MPS (≈3× CPU speedup) the 285-second full-song run should land in the low minutes on M5 Max.
+- **Verdict:** **Just works on MPS (likely) / Works with workarounds.** Highest confidence pick.
+## 3. ACE-Step 1.5 (ace-step/ACE-Step)
+- **Official MPS support:** **First-class.** README explicitly advertises Mac + AMD + Intel + CUDA. macOS scripts auto-set `ACESTEP_LM_BACKEND=mlx --backend mlx` — the language-model side runs on Apple's **MLX**, the DiT side on **PyTorch MPS** ([INSTALL.md](https://github.com/ace-step/ACE-Step-1.5/blob/main/docs/en/INSTALL.md)). bfloat16 supported on MPS since PyTorch 2.4.
+- **Community reports:** Real-world M2 Air 16 GB run: 5–10 min per song, hit MPS-OOM, fixed with `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` ([bioerrorlog](https://en.bioerrorlog.work/entry/ace-step-15-local-m2-macbook)). A dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) fork already centralised MPS detection, swapped CUDA cache calls for `torch.mps.empty_cache()` / `torch.mps.synchronize()`, and tuned VAE conv1d tile sizes for Metal limits.
+- **Backend compatibility:** Flash-attn auto-disabled on MPS. `torch.compile` disabled on MPS. nanovllm not on Mac. Otherwise clean.
+- **Memory:** 4 GB DiT-only / 6 GB LLM+DiT minimum; ~10 GB total install.
+- **Apple-Silicon timing (M1 Pro 16 GB vs M3 Pro 36 GB vs A100, from the AS fork's benchmarks):**
+  | Task | M1 Pro | M3 Pro | A100 |
+  | --- | --- | --- | --- |
+  | 30 s turbo song | ~45 s | ~25 s | ~2 s |
+  | 30 s SFT song | ~3 min | ~1.5 min | ~8 s |
+  **Extrapolated M5 Max:** turbo ~10–15 s, SFT ~45–60 s for 30 s output. Best Mac-citizen of the bunch.
+- **Verdict:** **Just works on MPS.** Already production-grade on M-series.
+## 4. SongGeneration 2 / LeVo 2 (Tencent)
+- **Official MPS support:** None. Official repo pins `flash-attn 2.7.4.post1` for CUDA 12 + torch 2.6, though `--not_use_flash_attn` flag exists ([Tencent SongGeneration](https://github.com/tencent-ailab/SongGeneration)).
+- **Community reports:** [Rdx-ai-art/SongGen-Mac](https://github.com/Rdx-ai-art/SongGen-Mac) fork — "Runs completely on your Mac's GPU via MPS on PyTorch." Tested on M1 Max 64 GB / macOS 15.7.2. **Pre-chorus block produces gibberish vocals** — known regression vs CUDA.
+- **Backend compatibility:** Hybrid LLM + diffusion architecture. Once flash-attn is stripped, the LLM side uses SDPA fine on MPS.
+- **Memory (Mac fork):** Base ≥24 GB RAM, ~70 GB total app RAM including swap during inference. Large ≥32 GB, hits ~80 GB. **On 128 GB M5 Max this fits cleanly without swap.**
+- **Apple-Silicon timing (M1 Max 64 GB):** Base ~4–6 min for ~2 min of audio. Large ~10–25 min for ~2:30. M5 Max should be roughly 2–3× faster (better mem bandwidth + more GPU cores).
+- **Verdict:** **Works with workarounds (community fork only).** Functional but watch the pre-chorus bug.
+## 5. HeartMuLa (HeartMuLa/heartlib)
+- **Official MPS support:** Not in the README. CUDA-first design with `--mula_device` / `--codec_device` flags ([heartlib](https://github.com/HeartMuLa/heartlib)). RTF ≈ 1.0 on CUDA.
+- **Community reports:** **Strong MLX port exists**: [Acelogic/heartlib-mlx](https://github.com/Acelogic/heartlib-mlx). Claims **2.1× faster than PyTorch MPS** on M2 Max (13.4 s vs 27.9 s end-to-end), 8.7× faster model load, 100 % numerical parity with PyTorch.
+- **Backend compatibility:** No flash-attn / mamba / triton in the official deps — clean transformer + neural codec. MLX port supports bfloat16.
+- **Memory (MLX port):** 3B model ~6 GB, HeartCodec ~2 GB, KV-cache ~1 GB/min of audio. **Full 1-min song ≈ 11 GB.** 32 GB minimum recommended; M5 Max 128 GB blows past this. 7B variant not yet released as of Feb 2026.
+- **Apple-Silicon timing:** M2 Max ≈ 11.6 s to generate 50 frames; M5 Max should comfortably exceed real-time for the 3B model.
+- **Verdict:** **Just works on MPS via MLX port.** Second-best Mac story after ACE-Step. The official PyTorch path is untested but should run on MPS once you bypass any CUDA cache calls.
+## 6. MusicGen (Meta / audiocraft) — reference
+- **Official MPS support:** None. AudioCraft officially supports CUDA or CPU only ([audiocraft README](https://github.com/facebookresearch/audiocraft)). Issues [#13](https://github.com/facebookresearch/audiocraft/issues/13) and [#31](https://github.com/facebookresearch/audiocraft/issues/31) are open requests, no merged PR. EnCodec decoder ops misbehave on MPS — common workaround is to **move decoder to CPU** while keeping the LM on MPS.
+- **Community / MLX:** Multiple solid ports — [Andrade Olivier's port](https://medium.com/@andradeolivier/i-ported-musicgen-to-apple-silicon-generate-music-from-text-on-your-macbook-9eaf95992053), [Nat Taylor's MusicGen MLX test](https://nattaylor.com/blog/2024/musicgen-via-mlx/). M4 Max: small model 8 s audio in ~6 s (faster than realtime). M1: ~60 s for 9 s of audio at 500 steps. AudioGen (sibling model) [works on MPS](https://blog.peddals.com/en/apple-mps-to-generate-audio-with-meta-audiogen/) by moving decoder ops to CPU.
+- **Memory:** 300 M small / 1.5 B medium / 3.3 B large. Trivial on 128 GB.
+- **Verdict:** **Partial on raw PyTorch MPS (CPU fallback for decoder); Just works via MLX port.**
+## 7. Stable Audio Open (Stability AI) — reference
+- **Official MPS support:** Diffusers supports `device="mps"` for the SAO pipeline ([Stable Audio docs](https://huggingface.co/docs/diffusers/en/api/pipelines/stable_audio)).
+- **Community reports:** [phlo.info](https://phlo.info/posts/using-stable-audio-tools-on-apple-silicon/) reports 51 s CPU → 17 s MPS by swapping `cuda` → `mps` in two files. **fp16 conv1d in the decoder is pathologically slow on MPS** — fix is `model.pretransform.model_half = False; model.to(torch.float32)` ([HF discussion](https://huggingface.co/stabilityai/stable-audio-open-small/discussions/1)).
+- **Memory:** 1.21 B params. Trivial.
+- **Apple-Silicon timing:** ~17 s per 3-s sample on M1-class; M5 Max should be a few seconds.
+- **Verdict:** **Works with workarounds** (force fp32 in decoder).
+---
+## Metal / MLX Apple-Native Equivalents
+- **ACE-Step**: Native MLX backend in the official repo for the LM side. **Closest thing to a first-party Mac music model.**
+- **HeartMuLa**: [heartlib-mlx](https://github.com/Acelogic/heartlib-mlx) — 2.1× speedup over PyTorch MPS, full numerical parity.
+- **MusicGen**: Multiple MLX ports, faster than real-time on M4 Max small model.
+- **Stable Audio Open**: MLX-Audio family ([Blaizzy/mlx-audio](https://github.com/Blaizzy/mlx-audio)) covers TTS/STT; SAO has unofficial MLX ports.
+- **YuE / DiffRhythm / SongGeneration**: **No MLX ports** as of May 2026.
+There is no umbrella "MLX-music" framework; each project rolls its own port.
+---
+## Practical Recommendation
+**Start with ACE-Step 1.5.** It is the only model with first-party Apple Silicon support, hybrid MLX + MPS execution, published M-series benchmarks, and no CUDA-only dependencies. The user's 128 GB unified memory completely eliminates the OOM workaround other Mac users hit on 16–36 GB machines.
+**Second pick: HeartMuLa via the MLX port** ([heartlib-mlx](https://github.com/Acelogic/heartlib-mlx)). Faster than the PyTorch MPS path, bfloat16, well-benchmarked. 3B only for now; 7B unreleased.
+**Third pick: DiffRhythm v2** — clean deps, README claims macOS support, similar architecture class to Stable Audio Open which is known to work on MPS with the fp32 decoder workaround.
+**Avoid on MPS unless you enjoy yak-shaving:**
+- **YuE** — flash-attn-mandatory, no Mac fork, no MLX port.
+- **SongGeneration / LeVo** — only via [SongGen-Mac](https://github.com/Rdx-ai-art/SongGen-Mac) fork, pre-chorus bug, 70+ GB RAM pressure with swap. Workable on 128 GB but not pleasant.
+**Remote-dev path:** For YuE specifically, **train/develop on a rented H100 or A100** (RunPod, Lambda, Modal, Replicate) and pull weights for inference on M5 Max **only if** you fork it to drop flash-attn. Otherwise treat YuE as a remote-only model. For everything else on this list, M5 Max is sufficient as the primary development machine.
+**On the user's prior LTX-Video burns:** music models are LM/diffusion stacks without the multi-modal Gemma + complex64 + SDPA-on-meta-tensor traps that bit LTX-2.3. The main MPS gotchas here are mundane: flash-attn substitution, fp16 conv1d in audio decoders, and `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0` for high-watermark allocator behaviour.

research/06_comparison_matrix.md ADDED Viewed

	@@ -0,0 +1,93 @@

+# Open-Source Song Generation Models — Side-by-Side Comparison
+*Compiled 2026-05-18 for M5 Max / 128 GB unified memory target.*
+---
+## Headline matrix
+| Property | **ACE-Step 1.5 XL** | **HeartMuLa 4B** | **DiffRhythm 2** | **YuE 7B** | SongGeneration 2 |
+|---|---|---|---|---|---|
+| **Builder** | ACE Studio × StepFun | HeartMuLa | NWPU ASLP-lab + Xiaomi | M-A-P / HKUST | Tencent AI Lab |
+| **Release** | 2026-01-28 | 2026-01-19 | 2025-10-27 → 2026-02-03 (v3) | 2025-01-26 | 2026-03-01 |
+| **License** | **MIT** | **Apache 2.0** | **Apache 2.0** | **Apache 2.0** | **Custom NON-commercial** |
+| **Repo stars** | 10.4 k | 3.6 k | ~2.3 k (v1) + 0.16 k (v2) | 6.2 k | 1.6 k |
+| **Last major commit** | v0.1.7 (2026-04-24) | 2026-02 | 2026-02 | 2025-06-04 (stale) | 2026-03-01 |
+| **Architecture** | LM-planner (Qwen3 0.6/1.7/4 B) + DiT (2/4 B) | CLAP + ASR + 12.5 Hz Codec + 4 B LLM | 5 Hz Music VAE + DiT w/ block flow matching | LLaMA2 7B AR Stage-1 + 1B Stage-2 + X-Codec | LeLM hybrid + diffusion decoder |
+| **Params (largest)** | up to 8 B (4 B DiT + 4 B LM) | ~4 B + 2 B codec + 0.8 B ASR | ~1 B DiT + 170 M VAE-dec | 7 B + 1 B + upsampler | 4 B (v2-large) |
+| **Audio rate** | 44.1 kHz stereo | 24 kHz neural codec | 44.1 kHz stereo | 16 kHz then upsampled | High-fi via diffusion |
+| **Max length** | 4+ min | ≥1 min, scaling | **210 s (regression from v1)** | 5 min | 4:30 |
+| **Vocals + Instruments** | ✅ Native | ✅ Native | ✅ Native, single stream | ✅ Native, dual-track AR | ✅ Dual-track |
+| **Languages** | 50+ | 5+ (en/zh/ja/ko/es benchmarked) | Bilingual EN/ZH + JP/KR/ES marketing-only | EN, Mandarin, Cantonese, JP, KR | zh/en/es/ja + others |
+| **VRAM (minimum)** | **<4 GB** with offload (turbo) | 6 GB 4-bit / 12 GB bf16 | 8 GB v1 with `--chunked` | 24 GB consumer / 80 GB single-pass | 22–28 GB |
+| **VRAM (recommended)** | 12 GB+ offload, 24 GB optimal | 24 GB for 7B (unreleased) | 24 GB | 80 GB H100/H800 | 28 GB |
+| **MPS / Apple Silicon** | **First-class, MLX + MPS, dedicated fork** | **MLX port, 2.1× PyTorch MPS** | Likely OK; clean deps; untested | ❌ Mandatory flash-attn | Community fork, pre-chorus bug |
+| **MPS bench M-series (30 s clip)** | M3 Pro 25 s turbo / 1.5 min SFT | M2 Max 11.6 s for 50 frames | not published | not published | M1 Max 4–6 min for 2 min |
+| **MPS bench M5 Max (projected)** | turbo ~10–15 s / SFT ~45–60 s | <real-time | low-minute range | n/a | ~2–3× M1 Max |
+| **Speed (RTF on A100 / 4090)** | sub-2 s/song on A100 (v1.5) | RTF ≈ 1.0 | v2 RTF 0.213 (4090) → ~45 s for 210 s | 27 steps RTF 27.27× on A100 (v1, ~15 min/song) | RTF 0.82 (H20) |
+| **Vocal naturalness vs Suno v4** | **4.4/5 vs 4.1/5** (blind 50-person test) | Vendor only, unverified | Authors admit clear gap vs v4.5 | Comparable vocal range; weaker mix | Vendor claim parity, unverified |
+| **Lyric alignment (PER)** | Strong (lyric tags) | Vendor: 0.09 EN / 0.12 ZH (unit mismatch) | **0.13 (open-source SOTA)** | Strong from lyric tags | Vendor: 8.55 % |
+| **Fine-tuning support** | ✅ LoRA, 8 songs/1h on 3090, **MPS-validated** | ❌ public training code | ❌ "Coming soon" since Mar 2025 | ✅ LoRA (Megatron pipeline, CUDA 12.1+) | ❌ |
+| **ComfyUI integration** | ✅ Native, official workflows | ✅ FL-HeartMuLa | ✅ billwuhao/ComfyUI_DiffRhythm | ✅ smthemex/ComfyUI_YuE | ✅ |
+| **Replicate hosted** | ❌ no first-party | ❌ | ❌ | ✅ fofr/yue | ❌ |
+| **Style/audio reference** | LoRA + lyric tags | Reference audio supported | Reference audio supported | ICL mode (style cloning) | Limited |
+| **Stem separation** | Built into `fspecii/ace-step-ui` via Demucs | Modular Codec is reusable | ❌ single stream | ✅ AR dual-track is inherently separable | ✅ Dual-track output |
+| **Continuation / extension** | Supported in workflows | Limited | Supported | ✅ explicit continuation mode | Supported |
+| **Production deployments** | acestep.io, ace-step.app, fspecii/ace-step-ui, AMD-blessed | WaveSpeed AI, HeartMuse local app | Chutes serverless | Replicate fofr/yue, HF Spaces | WaveSpeed AI, HF Space |
+| **Watermarking / content credentials** | None baked-in | None baked-in | None baked-in | None baked-in | None baked-in |
+| **License gotchas** | None (MIT) | None (Apache 2.0) | Ethical disclaimer (non-binding) | Attribution required ("YuE by HKUST/M-A-P"), label "AI-generated" | **Commercial use prohibited** |
+| **Independent benchmarks** | Yes — 50-person blind test, AMD vendor-validated | None located | Internal MOS only | Paper + community | None — Tencent only |
+---
+## Quality dimensions (qualitative)
+| Dimension | Best (open source) | Notes |
+|---|---|---|
+| **Pop / EDM polish** | (none — Suno v4/v5 still wins) | All open models lag commercial. |
+| **Folk / classical / jazz vocal naturalness** | **ACE-Step 1.5 XL** | Wins blind test vs Suno v4 in these genres. |
+| **Lyric intelligibility (PER)** | **DiffRhythm 2** (0.13) | HeartMuLa claims lower but unit-incomparable. |
+| **Musical macro-structure (verse/chorus/bridge over 3-5 min)** | **YuE** or **ACE-Step 1.5** (planner) | LM-planner models lead diffusion-only here. |
+| **Stereo image, mix depth** | **DiffRhythm 2** (44.1 kHz stereo native) | YuE is mono-ish; ACE-Step is stereo but variable. |
+| **Genre breadth** | **YuE** | Death-growl metal to Beijing opera to rap. |
+| **Multilingual breadth** | **ACE-Step 1.5** | 50+ languages w/ lyric tags; YuE deep on 5 only. |
+| **Code-switching (English ↔ Mandarin in one song)** | **YuE** | Explicit demos. |
+| **Speed / cost per song** | **ACE-Step 1.5** | Sub-2 s/song on A100; <minute on M5 Max. |
+| **Modular reusability of components** | **HeartMuLa** | Codec/ASR/CLAP separately exportable. |
+---
+## Cost model (rough)
+| Path | Per-song cost | Latency | Best for |
+|---|---|---|---|
+| Self-host ACE-Step 1.5 on M5 Max | $0 marginal (electricity) | ~30-50 s | Dev, beta, low-volume |
+| Self-host ACE-Step 1.5 on rented A100 80 GB | ~$0.0001 (sub-2 s × $1.50/hr) | <2 s | Production, paid SaaS |
+| Replicate `fofr/yue` | ~$0.30-1.00 per song (estimated from 4090 cog runtime) | 5-15 min | Multilingual fallback, occasional |
+| Self-host DiffRhythm 2 on 4090 | $0 marginal on owned 4090 | ~45 s | Speed tier, instrumentals |
+| Replicate / WaveSpeed managed endpoints | varies | varies | Cold-start / spike capacity |
+---
+## License risk matrix
+| License | Commercial SaaS | Output ownership | Risk |
+|---|---|---|---|
+| MIT (ACE-Step 1.5) | ✅ | User owns | Lowest |
+| Apache 2.0 (ACE-Step v1, HeartMuLa, DiffRhythm v1/v2, YuE) | ✅ with attribution | User owns | Low |
+| Tencent custom (SongGeneration, SongBloom) | ❌ **prohibited** | n/a | **Blocks SaaS** |
+| Suno API (closed-source baseline) | $ paid tier | platform terms | Medium |
+---
+## Hardware sizing on M5 Max (128 GB unified memory)
+| Model | Fits? | Headroom | Notes |
+|---|---|---|---|
+| ACE-Step 1.5 XL (4 B DiT + 4 B planner) | ✅ huge | ~120 GB free | Overkill; LoRA training viable in-RAM |
+| HeartMuLa 4B + 2 B codec + 0.8 B ASR | ✅ huge | ~120 GB free | 7 B variant when released will also fit |
+| DiffRhythm 2 (~1 B + 170 M VAE-dec) | ✅ trivial | ~125 GB free | Tiny by 2026 standards |
+| YuE 7B Stage-1 + 1B Stage-2 + upsampler | ✅ but blocked | n/a | Memory fine, **flash-attn dep blocks MPS** |
+| SongGeneration 2-large (4 B + diffusion) | ✅ comfortable | ~100 GB free | Community fork bug aside, fits |
+**Conclusion:** the user's 128 GB unified memory completely eliminates memory pressure for every model in this list. The constraint is software (MPS kernel compat, flash-attn substitution), not hardware.

research/07_platform_architecture.md ADDED Viewed

	@@ -0,0 +1,200 @@

+# Suno-Clone Platform Architecture — Build Plan
+*Compiled 2026-05-18. Target hardware: Apple M5 Max, 128 GB unified memory. Core model decision: ACE-Step 1.5 XL.*
+---
+## Mental model
+Suno (and Udio) are not just a song-generation model. They are a **product stack** with at least five distinct AI components and a few non-AI scaffolds. If we want to replicate the product experience, we have to plan for all of them. The song-gen model is the headline; everything else is what makes it usable.
+```
+                ┌─────────────────────────────────────┐
+                │           Web / mobile UI           │
+                │  (text prompt + style + lyrics)     │
+                └─────────────────────────────────────┘
+                                │
+                                ▼
+┌──────────────────────────────────────────────────────────┐
+│                    Orchestrator API                       │
+│   - prompt routing, queue, billing, history, sharing      │
+└──────────────────────────────────────────────────────────┘
+                  │            │            │            │
+                  ▼            ▼            ▼            ▼
+        ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+        │  Lyrics LLM │ │  Style/Tag  │ │  Song-gen   │ │  Voice       │
+        │  (Llama 3.3 │ │  rewriter   │ │  router     │ │  cloning     │
+        │   or Qwen)  │ │  (small LM) │ │             │ │  (RVC)       │
+        └─────────────┘ └─────────────┘ └──────┬──────┘ └─────────────┘
+                                               │
+                                               ▼
+                            ┌─────────────────────────────────┐
+                            │  Model pool (the actual research)│
+                            │   - ACE-Step 1.5 XL (default)   │
+                            │   - HeartMuLa-MLX (A/B)         │
+                            │   - DiffRhythm 2 (speed tier)   │
+                            │   - YuE on Replicate (intl.)    │
+                            └─────────────────────────────────┘
+                                               │
+                                               ▼
+                            ┌─────────────────────────────────┐
+                            │   Post-processing pipeline      │
+                            │   - Loudness normalization      │
+                            │   - Demucs stem separation      │
+                            │   - Watermarking (audible+meta) │
+                            │   - FFmpeg encoding → m4a/mp3   │
+                            └─────────────────────────────────┘
+                                               │
+                                               ▼
+                            ┌─────────────────────────────────┐
+                            │   Storage + streaming           │
+                            │   - S3 / R2 origin              │
+                            │   - HLS for in-browser playback │
+                            │   - CDN                         │
+                            └─────────────────────────────────┘
+```
+---
+## Component-by-component plan
+### 1. Song generation — primary model
+- **ACE-Step 1.5 XL** via [`clockworksquirrel/ace-step-apple-silicon`](https://github.com/clockworksquirrel/ace-step-apple-silicon) on M5 Max.
+- Hybrid backend: Qwen3 planner on **MLX**, DiT decoder on **PyTorch MPS**, bf16 throughout.
+- Why XL over standard 2B: 128 GB unified eats the cost, and the 4 B DiT closes meaningful quality gaps for paying users.
+**LoRA fine-tuning path (when needed):**
+- Document the platform's target genres → curate ~50–200 song lyric/audio pairs per genre.
+- Train a per-genre LoRA on the 3090-class budget (~1 hour per LoRA per [`ace-step-1.5 README`](https://github.com/ace-step/ACE-Step-1.5)).
+- Serve via the same inference pipeline with LoRA hot-swap.
+**Fallback / A-B candidates:**
+- **HeartMuLa-MLX** ([`Acelogic/heartlib-mlx`](https://github.com/Acelogic/heartlib-mlx)) — 2.1× faster than PyTorch MPS, full numerical parity, Apache 2.0.
+- **DiffRhythm 2** ([`ASLP-lab/DiffRhythm`](https://github.com/ASLP-lab/DiffRhythm)) — for the speed/instrumental tier (210 s ceiling acceptable for short-form features like background loops).
+- **YuE via Replicate** ([`replicate.com/fofr/yue`](https://replicate.com/fofr/yue/api)) — only for EN+Mandarin+Cantonese+JP+KR generations that ACE-Step underperforms; pay-per-second, no local infra cost.
+### 2. Lyrics generation — separate LLM
+The song-gen model takes **lyrics + style** as input, not raw user prompts. Suno's "song description" flow is actually two stages: prompt → lyrics LLM → lyrics → song model.
+- Use any decent open LLM running on the user's M5 Max. Candidates:
+  - **Qwen 2.5 Coder 32B / Qwen 3 7B** — good multilingual chops, fast on MPS via Ollama or mlx-lm.
+  - **Llama 3.3 70B 4-bit** — premium tier; fits comfortably in 128 GB unified.
+  - **GPT-OSS-20B** — Apache 2.0, sturdy English.
+- Prompt template should:
+  1. Parse user style hint into tags (genre, tempo, mood, instruments).
+  2. Output structured lyrics with `[verse]`, `[chorus]`, `[bridge]`, `[outro]` markers — these are **exactly the structural tags ACE-Step's `TextEncodeAceStepAudio` consumes**.
+  3. Constrain section count and line count to roughly match the target song duration.
+**This LLM is independent of the song-gen model and can be swapped freely.**
+### 3. Style / tag normalization
+A small classifier or 3 B LM that normalizes user free-text into the controlled-vocabulary tag set the song model was trained on (per genre, BPM bucket, vocal gender, mood). For ACE-Step this maps to its lyric-tag schema; for YuE it maps to `top_200_tags.json`.
+Implementation: 1-shot prompt to the lyrics LLM with examples; cache results.
+### 4. Voice cloning / personas (optional but Suno-equivalent)
+To match Suno's "Personas" feature:
+- **RVC v2** (Retrieval-based Voice Conversion) — open source, fast, runs on MPS, well-supported.
+- Train a 5-minute reference clip → 10–15 min on M5 Max → speaker embedding.
+- Apply to the generated vocal stem (Demucs-extracted) → remix.
+ACE-Step's **ICL mode** (in-context learning from a reference clip) and YuE's ICL variants partly cover this too, but RVC gives explicit per-speaker control.
+### 5. Stem separation
+For Suno's "download stems" feature:
+- **Demucs v4 / HTDemucs** — open source, Apache 2.0, runs on MPS, separates into vocals / drums / bass / other.
+- Already bundled in [`fspecii/ace-step-ui`](https://github.com/fspecii/ace-step-ui).
+### 6. Mastering / loudness normalization
+- **pyloudnorm** for LUFS normalization to streaming spec (-14 LUFS Spotify, -16 for AirPods).
+- **ffmpeg-normalize** as a CLI wrapper.
+- **Optional: TBProAudio mvMeter / Voxengo Span equivalents** via web-audio for UI metering.
+### 7. Watermarking + content credentials
+This is a **legal must-have** for any 2026 generative-music product (training-data lawsuits against Suno/Udio set the precedent).
+- **Inaudible audio watermark**: AudioSeal or SilentCipher — open-source, Meta-built, survives MP3 transcoding.
+- **C2PA metadata**: sign the m4a with model name + version + prompt + timestamp via the C2PA SDK.
+- **Visible "AI-generated" tag** in UI per the YuE model card's recommendation (and increasingly per platform policy).
+### 8. Storage and streaming
+- **S3-compatible object store** (R2, Backblaze B2, or self-hosted MinIO on the M5 Max if dev-only).
+- **HLS encoding pipeline**: ffmpeg → m3u8 + 4 s segments; serve via NGINX or Cloudflare.
+- For local dev, plain m4a + range requests are fine.
+### 9. Orchestrator API
+- **FastAPI** for the request-handling layer.
+- **Redis Streams** or **Hatchet** for the generation queue (songs are 30 s–2 min jobs on M5 Max — non-trivial latency, must be async).
+- **PostgreSQL** for users, songs, lyrics, LoRAs, billing.
+- **Server-Sent Events** for progress streaming back to the UI ("planner stage", "DiT denoising step 14/27", "mastering...").
+### 10. Frontend
+- **Next.js 16** + Cache Components for the user dashboard / library.
+- **Wavesurfer.js** for waveform display and scrubbing.
+- **Tone.js** for any in-browser preview / mixing.
+- Auth via Clerk or Auth0 — the user's portfolio revamp may already include this.
+---
+## Build order (incremental milestones)
+| Milestone | Scope | Validates |
+|---|---|---|
+| **M0 — Spike** | Get ACE-Step 1.5 XL running locally via clockworksquirrel fork; generate one 30 s song end-to-end | Hardware compatibility, RTF on M5 Max |
+| **M1 — CLI MVP** | Wrap in a Python CLI: `genmusic --prompt "..." --lyrics "..." --out song.m4a` | Headless generation, mastering chain, file output |
+| **M2 — Local UI** | Replace UI with `fspecii/ace-step-ui` initially (fastest path); add Demucs stem download | Browser flow, multi-song library, LAN access |
+| **M3 — Lyrics LLM integration** | Plug Qwen 3 / Llama 3.3 as the lyrics generator; produce structured lyrics from a one-line prompt | Suno-equivalent prompt UX |
+| **M4 — Multi-model router** | Add HeartMuLa-MLX as alternate; add Replicate YuE as multilingual fallback; user can pick or auto-route | A/B capability, breadth |
+| **M5 — LoRA pipeline** | First custom LoRA on a target genre (e.g., user's preferred style); hot-swap at inference | Differentiation vs Suno |
+| **M6 — Production wrapper** | FastAPI + Postgres + queue + auth + watermarking + C2PA signing | Real product surface |
+| **M7 — Deploy** | Move heavy inference behind a rented A100 endpoint for paid users; keep M5 Max for free tier / personal use | Paid-tier economics |
+---
+## Open questions for the user before M0
+1. **Commercial intent.** Is this a personal portfolio project (research mode → SongGeneration 2 is fair game) or a real SaaS (must stay Apache/MIT)? The license map changes drastically.
+2. **Target audience.** Western pop (where Suno still wins polish) vs world music / experimental genres (where ACE-Step / YuE compete fairly)?
+3. **Latency target.** Suno generates in ~30 s; users tolerate up to 90 s. ACE-Step on M5 Max hits this; YuE local does not.
+4. **Hosting plan.** Local-only for personal use? Or eventually paid tier on rented GPU?
+5. **Vocal cloning.** Is Suno-style "Persona" upload a must-have v1 feature, or v2?
+6. **Catalog / training data.** Any in-house licensed song catalog for LoRA fine-tuning, or strictly the public-domain model out of the box?
+---
+## Risks and mitigations
+| Risk | Likelihood | Mitigation |
+|---|---|---|
+| MPS regression in a future PyTorch release breaks ACE-Step | medium | Pin torch version; keep CPU fallback path. |
+| ACE-Step releases v2 with breaking API mid-build | medium | Wrap inference in a thin adapter; abstract model behind a single `Generator.generate()` interface. |
+| Vendor PER claims (HeartMuLa, LeVo) overstated → quality disappointment | medium | Run internal blind A/B on 20+ prompts before featuring a model in the UI. |
+| Output watermark stripped by transcoding | low | Use AudioSeal which survives MP3; double-stamp with C2PA metadata. |
+| Lyrics LLM hallucinates copyrighted hooks | medium | Run a similarity check against an embeddings index of known songs; flag for human review. |
+| Training-data IP suit (Suno-style) | low for derivative usage | Use models with documented public-data training (ACE-Step's paper is reasonably transparent); avoid Tencent's non-commercial weights. |
+| MPS OOM on long sequences | low (128 GB) | `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0`; chunk generation; offload non-active LoRAs. |
+---
+## Why ACE-Step 1.5 XL is the foundation (not just a model pick)
+This is worth saying explicitly. Choosing the base model determines:
+1. **Inference budget and unit economics** — ACE-Step is the only model where <2 s/song on A100 makes a paid tier economically obvious.
+2. **Mac developer ergonomics** — first-class MPS means the user can iterate on the M5 Max for weeks without renting cloud GPU.
+3. **License-clean output ownership** — MIT means users own their songs unambiguously.
+4. **Future-proof on multilingual** — 50+ languages out of the box matters if the platform grows beyond an English audience.
+5. **LoRA personalization is the differentiator** — fine-tuning support that works on MPS lets the user ship genre-specialist sub-models that Suno can't, because Suno's weights are locked.
+6. **Production deployments exist** — AMD vendor-backed, `fspecii/ace-step-ui` running at scale, multiple SaaS already on the open weights. This is not betting on a research artifact.
+The compound effect of those six is why ACE-Step is recommended as the platform foundation rather than just "the model to start with."