Spaces:
Running on Zero
Running on Zero
File size: 13,949 Bytes
9071450 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | # Suno-Clone Platform Architecture β Build Plan
*Compiled 2026-05-18. Target hardware: Apple M5 Max, 128 GB unified memory. Core model decision: ACE-Step 1.5 XL.*
---
## Mental model
Suno (and Udio) are not just a song-generation model. They are a **product stack** with at least five distinct AI components and a few non-AI scaffolds. If we want to replicate the product experience, we have to plan for all of them. The song-gen model is the headline; everything else is what makes it usable.
```
βββββββββββββββββββββββββββββββββββββββ
β Web / mobile UI β
β (text prompt + style + lyrics) β
βββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Orchestrator API β
β - prompt routing, queue, billing, history, sharing β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Lyrics LLM β β Style/Tag β β Song-gen β β Voice β
β (Llama 3.3 β β rewriter β β router β β cloning β
β or Qwen) β β (small LM) β β β β (RVC) β
βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ βββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββ
β Model pool (the actual research)β
β - ACE-Step 1.5 XL (default) β
β - HeartMuLa-MLX (A/B) β
β - DiffRhythm 2 (speed tier) β
β - YuE on Replicate (intl.) β
βββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββ
β Post-processing pipeline β
β - Loudness normalization β
β - Demucs stem separation β
β - Watermarking (audible+meta) β
β - FFmpeg encoding β m4a/mp3 β
βββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββ
β Storage + streaming β
β - S3 / R2 origin β
β - HLS for in-browser playback β
β - CDN β
βββββββββββββββββββββββββββββββββββ
```
---
## Component-by-component plan
### 1. Song generation β primary model
- **ACE-Step 1.5 XL** via [`clockworksquirrel/ace-step-apple-silicon`](https://github.com/clockworksquirrel/ace-step-apple-silicon) on M5 Max.
- Hybrid backend: Qwen3 planner on **MLX**, DiT decoder on **PyTorch MPS**, bf16 throughout.
- Why XL over standard 2B: 128 GB unified eats the cost, and the 4 B DiT closes meaningful quality gaps for paying users.
**LoRA fine-tuning path (when needed):**
- Document the platform's target genres β curate ~50β200 song lyric/audio pairs per genre.
- Train a per-genre LoRA on the 3090-class budget (~1 hour per LoRA per [`ace-step-1.5 README`](https://github.com/ace-step/ACE-Step-1.5)).
- Serve via the same inference pipeline with LoRA hot-swap.
**Fallback / A-B candidates:**
- **HeartMuLa-MLX** ([`Acelogic/heartlib-mlx`](https://github.com/Acelogic/heartlib-mlx)) β 2.1Γ faster than PyTorch MPS, full numerical parity, Apache 2.0.
- **DiffRhythm 2** ([`ASLP-lab/DiffRhythm`](https://github.com/ASLP-lab/DiffRhythm)) β for the speed/instrumental tier (210 s ceiling acceptable for short-form features like background loops).
- **YuE via Replicate** ([`replicate.com/fofr/yue`](https://replicate.com/fofr/yue/api)) β only for EN+Mandarin+Cantonese+JP+KR generations that ACE-Step underperforms; pay-per-second, no local infra cost.
### 2. Lyrics generation β separate LLM
The song-gen model takes **lyrics + style** as input, not raw user prompts. Suno's "song description" flow is actually two stages: prompt β lyrics LLM β lyrics β song model.
- Use any decent open LLM running on the user's M5 Max. Candidates:
- **Qwen 2.5 Coder 32B / Qwen 3 7B** β good multilingual chops, fast on MPS via Ollama or mlx-lm.
- **Llama 3.3 70B 4-bit** β premium tier; fits comfortably in 128 GB unified.
- **GPT-OSS-20B** β Apache 2.0, sturdy English.
- Prompt template should:
1. Parse user style hint into tags (genre, tempo, mood, instruments).
2. Output structured lyrics with `[verse]`, `[chorus]`, `[bridge]`, `[outro]` markers β these are **exactly the structural tags ACE-Step's `TextEncodeAceStepAudio` consumes**.
3. Constrain section count and line count to roughly match the target song duration.
**This LLM is independent of the song-gen model and can be swapped freely.**
### 3. Style / tag normalization
A small classifier or 3 B LM that normalizes user free-text into the controlled-vocabulary tag set the song model was trained on (per genre, BPM bucket, vocal gender, mood). For ACE-Step this maps to its lyric-tag schema; for YuE it maps to `top_200_tags.json`.
Implementation: 1-shot prompt to the lyrics LLM with examples; cache results.
### 4. Voice cloning / personas (optional but Suno-equivalent)
To match Suno's "Personas" feature:
- **RVC v2** (Retrieval-based Voice Conversion) β open source, fast, runs on MPS, well-supported.
- Train a 5-minute reference clip β 10β15 min on M5 Max β speaker embedding.
- Apply to the generated vocal stem (Demucs-extracted) β remix.
ACE-Step's **ICL mode** (in-context learning from a reference clip) and YuE's ICL variants partly cover this too, but RVC gives explicit per-speaker control.
### 5. Stem separation
For Suno's "download stems" feature:
- **Demucs v4 / HTDemucs** β open source, Apache 2.0, runs on MPS, separates into vocals / drums / bass / other.
- Already bundled in [`fspecii/ace-step-ui`](https://github.com/fspecii/ace-step-ui).
### 6. Mastering / loudness normalization
- **pyloudnorm** for LUFS normalization to streaming spec (-14 LUFS Spotify, -16 for AirPods).
- **ffmpeg-normalize** as a CLI wrapper.
- **Optional: TBProAudio mvMeter / Voxengo Span equivalents** via web-audio for UI metering.
### 7. Watermarking + content credentials
This is a **legal must-have** for any 2026 generative-music product (training-data lawsuits against Suno/Udio set the precedent).
- **Inaudible audio watermark**: AudioSeal or SilentCipher β open-source, Meta-built, survives MP3 transcoding.
- **C2PA metadata**: sign the m4a with model name + version + prompt + timestamp via the C2PA SDK.
- **Visible "AI-generated" tag** in UI per the YuE model card's recommendation (and increasingly per platform policy).
### 8. Storage and streaming
- **S3-compatible object store** (R2, Backblaze B2, or self-hosted MinIO on the M5 Max if dev-only).
- **HLS encoding pipeline**: ffmpeg β m3u8 + 4 s segments; serve via NGINX or Cloudflare.
- For local dev, plain m4a + range requests are fine.
### 9. Orchestrator API
- **FastAPI** for the request-handling layer.
- **Redis Streams** or **Hatchet** for the generation queue (songs are 30 sβ2 min jobs on M5 Max β non-trivial latency, must be async).
- **PostgreSQL** for users, songs, lyrics, LoRAs, billing.
- **Server-Sent Events** for progress streaming back to the UI ("planner stage", "DiT denoising step 14/27", "mastering...").
### 10. Frontend
- **Next.js 16** + Cache Components for the user dashboard / library.
- **Wavesurfer.js** for waveform display and scrubbing.
- **Tone.js** for any in-browser preview / mixing.
- Auth via Clerk or Auth0 β the user's portfolio revamp may already include this.
---
## Build order (incremental milestones)
| Milestone | Scope | Validates |
|---|---|---|
| **M0 β Spike** | Get ACE-Step 1.5 XL running locally via clockworksquirrel fork; generate one 30 s song end-to-end | Hardware compatibility, RTF on M5 Max |
| **M1 β CLI MVP** | Wrap in a Python CLI: `genmusic --prompt "..." --lyrics "..." --out song.m4a` | Headless generation, mastering chain, file output |
| **M2 β Local UI** | Replace UI with `fspecii/ace-step-ui` initially (fastest path); add Demucs stem download | Browser flow, multi-song library, LAN access |
| **M3 β Lyrics LLM integration** | Plug Qwen 3 / Llama 3.3 as the lyrics generator; produce structured lyrics from a one-line prompt | Suno-equivalent prompt UX |
| **M4 β Multi-model router** | Add HeartMuLa-MLX as alternate; add Replicate YuE as multilingual fallback; user can pick or auto-route | A/B capability, breadth |
| **M5 β LoRA pipeline** | First custom LoRA on a target genre (e.g., user's preferred style); hot-swap at inference | Differentiation vs Suno |
| **M6 β Production wrapper** | FastAPI + Postgres + queue + auth + watermarking + C2PA signing | Real product surface |
| **M7 β Deploy** | Move heavy inference behind a rented A100 endpoint for paid users; keep M5 Max for free tier / personal use | Paid-tier economics |
---
## Open questions for the user before M0
1. **Commercial intent.** Is this a personal portfolio project (research mode β SongGeneration 2 is fair game) or a real SaaS (must stay Apache/MIT)? The license map changes drastically.
2. **Target audience.** Western pop (where Suno still wins polish) vs world music / experimental genres (where ACE-Step / YuE compete fairly)?
3. **Latency target.** Suno generates in ~30 s; users tolerate up to 90 s. ACE-Step on M5 Max hits this; YuE local does not.
4. **Hosting plan.** Local-only for personal use? Or eventually paid tier on rented GPU?
5. **Vocal cloning.** Is Suno-style "Persona" upload a must-have v1 feature, or v2?
6. **Catalog / training data.** Any in-house licensed song catalog for LoRA fine-tuning, or strictly the public-domain model out of the box?
---
## Risks and mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| MPS regression in a future PyTorch release breaks ACE-Step | medium | Pin torch version; keep CPU fallback path. |
| ACE-Step releases v2 with breaking API mid-build | medium | Wrap inference in a thin adapter; abstract model behind a single `Generator.generate()` interface. |
| Vendor PER claims (HeartMuLa, LeVo) overstated β quality disappointment | medium | Run internal blind A/B on 20+ prompts before featuring a model in the UI. |
| Output watermark stripped by transcoding | low | Use AudioSeal which survives MP3; double-stamp with C2PA metadata. |
| Lyrics LLM hallucinates copyrighted hooks | medium | Run a similarity check against an embeddings index of known songs; flag for human review. |
| Training-data IP suit (Suno-style) | low for derivative usage | Use models with documented public-data training (ACE-Step's paper is reasonably transparent); avoid Tencent's non-commercial weights. |
| MPS OOM on long sequences | low (128 GB) | `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0`; chunk generation; offload non-active LoRAs. |
---
## Why ACE-Step 1.5 XL is the foundation (not just a model pick)
This is worth saying explicitly. Choosing the base model determines:
1. **Inference budget and unit economics** β ACE-Step is the only model where <2 s/song on A100 makes a paid tier economically obvious.
2. **Mac developer ergonomics** β first-class MPS means the user can iterate on the M5 Max for weeks without renting cloud GPU.
3. **License-clean output ownership** β MIT means users own their songs unambiguously.
4. **Future-proof on multilingual** β 50+ languages out of the box matters if the platform grows beyond an English audience.
5. **LoRA personalization is the differentiator** β fine-tuning support that works on MPS lets the user ship genre-specialist sub-models that Suno can't, because Suno's weights are locked.
6. **Production deployments exist** β AMD vendor-backed, `fspecii/ace-step-ui` running at scale, multiple SaaS already on the open weights. This is not betting on a research artifact.
The compound effect of those six is why ACE-Step is recommended as the platform foundation rather than just "the model to start with."
|