Spaces:
Running on Zero
Running on Zero
| # Suno-Clone Platform Architecture β Build Plan | |
| *Compiled 2026-05-18. Target hardware: Apple M5 Max, 128 GB unified memory. Core model decision: ACE-Step 1.5 XL.* | |
| --- | |
| ## Mental model | |
| Suno (and Udio) are not just a song-generation model. They are a **product stack** with at least five distinct AI components and a few non-AI scaffolds. If we want to replicate the product experience, we have to plan for all of them. The song-gen model is the headline; everything else is what makes it usable. | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β Web / mobile UI β | |
| β (text prompt + style + lyrics) β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Orchestrator API β | |
| β - prompt routing, queue, billing, history, sharing β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β β β β | |
| βΌ βΌ βΌ βΌ | |
| βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ | |
| β Lyrics LLM β β Style/Tag β β Song-gen β β Voice β | |
| β (Llama 3.3 β β rewriter β β router β β cloning β | |
| β or Qwen) β β (small LM) β β β β (RVC) β | |
| βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ βββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββ | |
| β Model pool (the actual research)β | |
| β - ACE-Step 1.5 XL (default) β | |
| β - HeartMuLa-MLX (A/B) β | |
| β - DiffRhythm 2 (speed tier) β | |
| β - YuE on Replicate (intl.) β | |
| βββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββ | |
| β Post-processing pipeline β | |
| β - Loudness normalization β | |
| β - Demucs stem separation β | |
| β - Watermarking (audible+meta) β | |
| β - FFmpeg encoding β m4a/mp3 β | |
| βββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββ | |
| β Storage + streaming β | |
| β - S3 / R2 origin β | |
| β - HLS for in-browser playback β | |
| β - CDN β | |
| βββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Component-by-component plan | |
| ### 1. Song generation β primary model | |
| - **ACE-Step 1.5 XL** via [`clockworksquirrel/ace-step-apple-silicon`](https://github.com/clockworksquirrel/ace-step-apple-silicon) on M5 Max. | |
| - Hybrid backend: Qwen3 planner on **MLX**, DiT decoder on **PyTorch MPS**, bf16 throughout. | |
| - Why XL over standard 2B: 128 GB unified eats the cost, and the 4 B DiT closes meaningful quality gaps for paying users. | |
| **LoRA fine-tuning path (when needed):** | |
| - Document the platform's target genres β curate ~50β200 song lyric/audio pairs per genre. | |
| - Train a per-genre LoRA on the 3090-class budget (~1 hour per LoRA per [`ace-step-1.5 README`](https://github.com/ace-step/ACE-Step-1.5)). | |
| - Serve via the same inference pipeline with LoRA hot-swap. | |
| **Fallback / A-B candidates:** | |
| - **HeartMuLa-MLX** ([`Acelogic/heartlib-mlx`](https://github.com/Acelogic/heartlib-mlx)) β 2.1Γ faster than PyTorch MPS, full numerical parity, Apache 2.0. | |
| - **DiffRhythm 2** ([`ASLP-lab/DiffRhythm`](https://github.com/ASLP-lab/DiffRhythm)) β for the speed/instrumental tier (210 s ceiling acceptable for short-form features like background loops). | |
| - **YuE via Replicate** ([`replicate.com/fofr/yue`](https://replicate.com/fofr/yue/api)) β only for EN+Mandarin+Cantonese+JP+KR generations that ACE-Step underperforms; pay-per-second, no local infra cost. | |
| ### 2. Lyrics generation β separate LLM | |
| The song-gen model takes **lyrics + style** as input, not raw user prompts. Suno's "song description" flow is actually two stages: prompt β lyrics LLM β lyrics β song model. | |
| - Use any decent open LLM running on the user's M5 Max. Candidates: | |
| - **Qwen 2.5 Coder 32B / Qwen 3 7B** β good multilingual chops, fast on MPS via Ollama or mlx-lm. | |
| - **Llama 3.3 70B 4-bit** β premium tier; fits comfortably in 128 GB unified. | |
| - **GPT-OSS-20B** β Apache 2.0, sturdy English. | |
| - Prompt template should: | |
| 1. Parse user style hint into tags (genre, tempo, mood, instruments). | |
| 2. Output structured lyrics with `[verse]`, `[chorus]`, `[bridge]`, `[outro]` markers β these are **exactly the structural tags ACE-Step's `TextEncodeAceStepAudio` consumes**. | |
| 3. Constrain section count and line count to roughly match the target song duration. | |
| **This LLM is independent of the song-gen model and can be swapped freely.** | |
| ### 3. Style / tag normalization | |
| A small classifier or 3 B LM that normalizes user free-text into the controlled-vocabulary tag set the song model was trained on (per genre, BPM bucket, vocal gender, mood). For ACE-Step this maps to its lyric-tag schema; for YuE it maps to `top_200_tags.json`. | |
| Implementation: 1-shot prompt to the lyrics LLM with examples; cache results. | |
| ### 4. Voice cloning / personas (optional but Suno-equivalent) | |
| To match Suno's "Personas" feature: | |
| - **RVC v2** (Retrieval-based Voice Conversion) β open source, fast, runs on MPS, well-supported. | |
| - Train a 5-minute reference clip β 10β15 min on M5 Max β speaker embedding. | |
| - Apply to the generated vocal stem (Demucs-extracted) β remix. | |
| ACE-Step's **ICL mode** (in-context learning from a reference clip) and YuE's ICL variants partly cover this too, but RVC gives explicit per-speaker control. | |
| ### 5. Stem separation | |
| For Suno's "download stems" feature: | |
| - **Demucs v4 / HTDemucs** β open source, Apache 2.0, runs on MPS, separates into vocals / drums / bass / other. | |
| - Already bundled in [`fspecii/ace-step-ui`](https://github.com/fspecii/ace-step-ui). | |
| ### 6. Mastering / loudness normalization | |
| - **pyloudnorm** for LUFS normalization to streaming spec (-14 LUFS Spotify, -16 for AirPods). | |
| - **ffmpeg-normalize** as a CLI wrapper. | |
| - **Optional: TBProAudio mvMeter / Voxengo Span equivalents** via web-audio for UI metering. | |
| ### 7. Watermarking + content credentials | |
| This is a **legal must-have** for any 2026 generative-music product (training-data lawsuits against Suno/Udio set the precedent). | |
| - **Inaudible audio watermark**: AudioSeal or SilentCipher β open-source, Meta-built, survives MP3 transcoding. | |
| - **C2PA metadata**: sign the m4a with model name + version + prompt + timestamp via the C2PA SDK. | |
| - **Visible "AI-generated" tag** in UI per the YuE model card's recommendation (and increasingly per platform policy). | |
| ### 8. Storage and streaming | |
| - **S3-compatible object store** (R2, Backblaze B2, or self-hosted MinIO on the M5 Max if dev-only). | |
| - **HLS encoding pipeline**: ffmpeg β m3u8 + 4 s segments; serve via NGINX or Cloudflare. | |
| - For local dev, plain m4a + range requests are fine. | |
| ### 9. Orchestrator API | |
| - **FastAPI** for the request-handling layer. | |
| - **Redis Streams** or **Hatchet** for the generation queue (songs are 30 sβ2 min jobs on M5 Max β non-trivial latency, must be async). | |
| - **PostgreSQL** for users, songs, lyrics, LoRAs, billing. | |
| - **Server-Sent Events** for progress streaming back to the UI ("planner stage", "DiT denoising step 14/27", "mastering..."). | |
| ### 10. Frontend | |
| - **Next.js 16** + Cache Components for the user dashboard / library. | |
| - **Wavesurfer.js** for waveform display and scrubbing. | |
| - **Tone.js** for any in-browser preview / mixing. | |
| - Auth via Clerk or Auth0 β the user's portfolio revamp may already include this. | |
| --- | |
| ## Build order (incremental milestones) | |
| | Milestone | Scope | Validates | | |
| |---|---|---| | |
| | **M0 β Spike** | Get ACE-Step 1.5 XL running locally via clockworksquirrel fork; generate one 30 s song end-to-end | Hardware compatibility, RTF on M5 Max | | |
| | **M1 β CLI MVP** | Wrap in a Python CLI: `genmusic --prompt "..." --lyrics "..." --out song.m4a` | Headless generation, mastering chain, file output | | |
| | **M2 β Local UI** | Replace UI with `fspecii/ace-step-ui` initially (fastest path); add Demucs stem download | Browser flow, multi-song library, LAN access | | |
| | **M3 β Lyrics LLM integration** | Plug Qwen 3 / Llama 3.3 as the lyrics generator; produce structured lyrics from a one-line prompt | Suno-equivalent prompt UX | | |
| | **M4 β Multi-model router** | Add HeartMuLa-MLX as alternate; add Replicate YuE as multilingual fallback; user can pick or auto-route | A/B capability, breadth | | |
| | **M5 β LoRA pipeline** | First custom LoRA on a target genre (e.g., user's preferred style); hot-swap at inference | Differentiation vs Suno | | |
| | **M6 β Production wrapper** | FastAPI + Postgres + queue + auth + watermarking + C2PA signing | Real product surface | | |
| | **M7 β Deploy** | Move heavy inference behind a rented A100 endpoint for paid users; keep M5 Max for free tier / personal use | Paid-tier economics | | |
| --- | |
| ## Open questions for the user before M0 | |
| 1. **Commercial intent.** Is this a personal portfolio project (research mode β SongGeneration 2 is fair game) or a real SaaS (must stay Apache/MIT)? The license map changes drastically. | |
| 2. **Target audience.** Western pop (where Suno still wins polish) vs world music / experimental genres (where ACE-Step / YuE compete fairly)? | |
| 3. **Latency target.** Suno generates in ~30 s; users tolerate up to 90 s. ACE-Step on M5 Max hits this; YuE local does not. | |
| 4. **Hosting plan.** Local-only for personal use? Or eventually paid tier on rented GPU? | |
| 5. **Vocal cloning.** Is Suno-style "Persona" upload a must-have v1 feature, or v2? | |
| 6. **Catalog / training data.** Any in-house licensed song catalog for LoRA fine-tuning, or strictly the public-domain model out of the box? | |
| --- | |
| ## Risks and mitigations | |
| | Risk | Likelihood | Mitigation | | |
| |---|---|---| | |
| | MPS regression in a future PyTorch release breaks ACE-Step | medium | Pin torch version; keep CPU fallback path. | | |
| | ACE-Step releases v2 with breaking API mid-build | medium | Wrap inference in a thin adapter; abstract model behind a single `Generator.generate()` interface. | | |
| | Vendor PER claims (HeartMuLa, LeVo) overstated β quality disappointment | medium | Run internal blind A/B on 20+ prompts before featuring a model in the UI. | | |
| | Output watermark stripped by transcoding | low | Use AudioSeal which survives MP3; double-stamp with C2PA metadata. | | |
| | Lyrics LLM hallucinates copyrighted hooks | medium | Run a similarity check against an embeddings index of known songs; flag for human review. | | |
| | Training-data IP suit (Suno-style) | low for derivative usage | Use models with documented public-data training (ACE-Step's paper is reasonably transparent); avoid Tencent's non-commercial weights. | | |
| | MPS OOM on long sequences | low (128 GB) | `PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0`; chunk generation; offload non-active LoRAs. | | |
| --- | |
| ## Why ACE-Step 1.5 XL is the foundation (not just a model pick) | |
| This is worth saying explicitly. Choosing the base model determines: | |
| 1. **Inference budget and unit economics** β ACE-Step is the only model where <2 s/song on A100 makes a paid tier economically obvious. | |
| 2. **Mac developer ergonomics** β first-class MPS means the user can iterate on the M5 Max for weeks without renting cloud GPU. | |
| 3. **License-clean output ownership** β MIT means users own their songs unambiguously. | |
| 4. **Future-proof on multilingual** β 50+ languages out of the box matters if the platform grows beyond an English audience. | |
| 5. **LoRA personalization is the differentiator** β fine-tuning support that works on MPS lets the user ship genre-specialist sub-models that Suno can't, because Suno's weights are locked. | |
| 6. **Production deployments exist** β AMD vendor-backed, `fspecii/ace-step-ui` running at scale, multiple SaaS already on the open weights. This is not betting on a research artifact. | |
| The compound effect of those six is why ACE-Step is recommended as the platform foundation rather than just "the model to start with." | |