# ACE-Step — Deep Technical Report *Researched 2026-05-18 for a Suno-like platform build on M5 Max (128 GB unified) / MPS.* --- ## 1. Overview ACE-Step is a foundation model for music generation jointly built by **ACE Studio** (the consumer music-tech outfit behind ACE Studio's vocal synth) and **StepFun** ("Step-AI"), a Beijing-based foundation-model lab. Core authors: Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, Joe Guo ([ace-step.github.io](https://ace-step.github.io/)). Release timeline: - **v1 (3.5B)** — open-sourced May 2025; technical report posted on arXiv on 2 Jun 2025 as 2506.00045 ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)). - **v1.5** — released **28 Jan 2026** as a separate repo, [`ace-step/ACE-Step-1.5`](https://github.com/ace-step/ACE-Step-1.5). Adds a hybrid Language-Model + Diffusion-Transformer planner. - **XL series (4B DiT decoder)** — released 2 Apr 2026 as a higher-quality variant inside the v1.5 family. - **Latest tag** — v0.1.7 on 24 Apr 2026 ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). - **v2** — **no public roadmap or announcement** as of 18 May 2026. Current status: actively maintained, 10.4k stars on the v1.5 repo and 4.5k on the original v1 repo, with a thriving ComfyUI ecosystem and third-party UIs ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step), [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). --- ## 2. Architecture **v1 (3.5B):** a hybrid that fuses three pieces (per the paper, [arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)): 1. **Sana Deep Compression AutoEncoder (DCAE)** — high-compression audio latent space borrowed from NVIDIA's Sana image work. 2. **Lightweight linear transformer** — the diffusion backbone, deliberately linear-attention to keep RTF low. 3. **Diffusion training** with **MERT + m-HuBERT** providing semantic-alignment supervision (REPA-style) during training so latents stay musically coherent. This sits between LLM-token approaches (Suno/YuE, slow but lyric-tight) and pure diffusion (DiffRhythm, fast but structurally weak). The design goal stated in the paper is "a fast, general-purpose, efficient yet flexible architecture" — explicitly a *foundation model*, not just a text-to-song pipeline ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)). **v1.5:** a hybrid **LM-as-planner + Diffusion-Transformer (DiT)**. A small Qwen3-based LM (0.6B / 1.7B / 4B) turns the user prompt into a structured "song blueprint" (sections, key, bpm, lyrics, vocal style) which the DiT (2B standard or 4B XL) decodes into audio. This brings chain-of-thought reasoning to music structure, lifting long-range coherence — Suno's main historic advantage ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)). **Parameter counts:** | Variant | DiT | LM planner | Total | |---|---|---|---| | v1-3.5B | 3.5B (DiT only) | — | 3.5B | | v1.5 standard | 2B | 0.6B / 1.7B | ~2.6 – 3.7B | | v1.5 XL | 4B | up to 4B | up to 8B | --- ## 3. Variants and checkpoints All on Hugging Face under the `ACE-Step/` org ([ACE-Step org on HF](https://huggingface.co/ACE-Step)): - `ACE-Step-v1-3.5B` — the original generalist model ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)). - `ACE-Step-v1-chinese-rap-LoRA` ("RapMachine") — genre-specific LoRA. - **LoRA family** shipped by the team: `RapMachine`, `Lyric2Vocal` (vocal-only stem from lyrics), `Text2Samples` (instrumental loops/samples) ([ace-step.github.io](https://ace-step.github.io/)). - **v1.5 DiT checkpoints:** 2B standard and 4B XL. - **v1.5 LM planners:** 0.6B, 1.7B, 4B. - A public **Space demo** at [huggingface.co/spaces/ACE-Step/ACE-Step](https://huggingface.co/spaces/ACE-Step/ACE-Step). No v2 checkpoint exists yet. --- ## 4. License **Apache 2.0** for v1 ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)) and **MIT** for v1.5 ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). Both are unambiguously **commercial-use-permitted, royalty-free**. This is the single biggest licensing advantage over Suno/Udio and even over YuE (which carries non-commercial clauses in parts of its weights chain). --- ## 5. Vocal support — CRITICAL VERIFICATION **Verdict: YES — ACE-Step generates vocals natively. The "instrumental-only" claim circulating in some reviews is wrong (likely conflating it with `Text2Samples` LoRA or with DiffRhythm).** Evidence: - The **v1 HF model card** describes the model as full-song (vocals + instruments) with the explicit caveat: *"Coarse vocal synthesis lacking nuance"* and *"Rare instruments may not render perfectly"* ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)). - The paper claims **lyric alignment across melody/harmony/rhythm metrics** — only meaningful for sung vocals ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)). - The ComfyUI native node `TextEncodeAceStepAudio` accepts lyrics with `[verse] [chorus] [bridge]` structural tags ([comfyui-wiki guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1)). - `Lyric2Vocal` LoRA exists *because* the base model already does vocals — the LoRA isolates the vocal stem ([ace-step.github.io](https://ace-step.github.io/)). - Blind-listening review of 50 participants scored ACE-Step v1.5 **4.4/5 on SongEval Vocal vs Suno v4 at 4.1/5** ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)). **Quality reality check:** v1 vocals are admitted to be "coarse"; v1.5 markedly improves vocal clarity and now beats Suno v4 in blind tests on naturalness for folk/classical/jazz, while Suno still wins on "radio-ready polish" for pop/EDM ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)). --- ## 6. Languages supported - **v1:** 19 languages, with the top 10 (English, Mandarin Chinese, Russian, Spanish, Japanese, German, French, Portuguese, Italian, Korean) performing best ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). Less-represented languages underperform due to training-data imbalance. - **v1.5:** Expanded to **50+ languages** with lyric control, alongside the planner LM ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). Known weakness from the team itself: Chinese rap was historically weak, motivating the `chinese-rap-LoRA` ([ace-step.github.io](https://ace-step.github.io/)). --- ## 7. Speed claims — verified The famous claim: *"synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPU — 15× faster than LLM-based baselines"* ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045), [ace-step.github.io](https://ace-step.github.io/)). Hardware: **NVIDIA A100 80GB**. Published RTF table from the v1 HF card ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)): | Device | 27 steps RTF | 60 steps RTF | |---|---|---| | RTX 4090 | 34.48× | 15.63× | | A100 | 27.27× | 12.27× | | RTX 3090 | 12.76× | 6.48× | | **M2 Max** | **2.27×** | **1.03×** | v1.5 is faster still: *"under 2 seconds per full song on A100 and under 10 seconds on an RTX 3090"* ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). **Apple-Silicon equivalents** (from the dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) port): | Task | M1 Pro 16 GB | M3 Pro 36 GB | A100 | |---|---|---|---| | 30 s turbo | ~45 s | ~25 s | ~2 s | | 30 s SFT (full) | ~3 min | ~1.5 min | ~8 s | **M5 Max projection:** The M5 Max's GPU TFLOPS lineage (MPS SGEMM scaled M1→M4: 1.36 → 2.24 → 2.47 → 2.9 TFLOPS, per [arxiv 2502.05317](https://arxiv.org/html/2502.05317v1)) plus the M5 generation's ~30 % uplift suggests roughly **3.5–4× the throughput of M2 Max**, i.e. an **estimated 8–10× RTF at 27 steps** for v1, and full-song generation in **~30–50 s for a 4-minute song**. No M5-specific public benchmark exists yet. --- ## 8. Quality assessment From the cross-model evaluation summarised in research-aggregator coverage ([researchgate paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model), [fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)): | Dimension | Leader | Where ACE-Step sits | |---|---|---| | Aesthetic quality | Hailuo > DiffRhythm | mid-upper | | Musicality (coherence) | Suno v3 | competitive, strong on memorability/clarity | | Style alignment | Udio v1 > Hailuo | 3rd | | Lyric alignment | Hailuo | strong, beats Suno v3, Udio, YuE | | **Vocal naturalness (v1.5)** | **ACE-Step 4.4/5** | beats Suno v4 (4.1/5) | | Speed (RTF) | **ACE-Step 15.63×** | best in class; DiffRhythm 10.03×, YuE 0.083× | User-facing reception is positive on customisability and speed; the most-cited weakness is "gacha"-style seed sensitivity — re-rolls produce noticeably different outputs ([ace-step.github.io](https://ace-step.github.io/)). --- ## 9. Inference performance & Apple Silicon - **VRAM (v1):** minimum **8 GB with CPU offload**; comfortable on 12 GB+ ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). - **VRAM (v1.5):** **<4 GB** for 2B-turbo with offload; **≥12 GB** for XL with offload; **≥20 GB** without offload; **≥24 GB optimal** ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)). - **MPS support:** **first-class.** Use `--bf16 false` on M-series to avoid kernel issues ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). The dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) fork adds: bfloat16 throughout, MPS-safe pipeline with `torch.mps.empty_cache()` synchronisation, **MLX backend (567 LoC)** that auto-converts the Qwen3 planner LM to MLX with quantisation, and **LoRA training on MPS**. - **ComfyUI:** **native nodes** ship in upstream ComfyUI (`TextEncodeAceStepAudio` etc.) plus the official [`ace-step/ACE-Step-ComfyUI`](https://github.com/ace-step/ACE-Step-ComfyUI). v1.5 has dedicated workflows (split-LLM and AIO checkpoint variants) on comfy.org ([Purz blog post](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui)). - **128 GB unified on M5 Max** comfortably fits the full XL stack plus the 4B planner LM with no offload needed; user's hardware is essentially overkill for ACE-Step. --- ## 10. Repo health | Repo | Stars | Forks | Last release | |---|---|---|---| | `ace-step/ACE-Step` (v1) | 4.5k | 568 | quiet since v1.5 fork | | `ace-step/ACE-Step-1.5` | **10.4k** | 1.3k | v0.1.7 on 24 Apr 2026 | | `fspecii/ace-step-ui` (popular community UI) | 3.8k | 561 | active | | `clockworksquirrel/ace-step-apple-silicon` | — (smaller) | — | active | The team also curates [`ace-step/awesome-ace-step`](https://github.com/ace-step/awesome-ace-step). Issue activity, ComfyUI integration cadence, and the LM-planner architectural jump in v1.5 all indicate a project that is healthier and growing faster than YuE or DiffRhythm. --- ## 11. Real-world adoption - **AMD vendor-backed deployment:** AMD published a blog *"Commercial-grade AI music generation on AMD Ryzen AI processors and Radeon graphics with ACE Step 1.5"* in 2026, explicitly endorsing it for Ryzen AI / Radeon production stacks ([AMD blog](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html)). - **Third-party SaaS:** `acestep.io` and `ace-step.app` run hosted song-generation services on the open weights ([acestep.io](https://acestep.io/), [ace-step.app](https://ace-step.app/)). - **Production-grade UI:** `fspecii/ace-step-ui` brands itself as *"the Ultimate Open Source Suno Alternative"* with stem extraction (Demucs), batch generation, library/playlist management, LAN access ([fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui)). - Heart-MuLa and similar music platforms cite ACE-Step 1.5 in their stack comparisons ([heart-mula.com/ace-step](https://heart-mula.com/ace-step)). --- ## 12. Fine-tuning + LoRA - **Training code released**; documented in [`TRAIN_INSTRUCTION.md`](https://github.com/ace-step/ACE-Step) and `ZH_RAP_LORA.md` ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). - **Genre / task LoRAs from the team:** `RapMachine` (general rap), `Chinese-Rap-LoRA`, `Lyric2Vocal`, `Text2Samples` ([HF org](https://huggingface.co/ACE-Step), [ace-step.github.io](https://ace-step.github.io/)). - v1.5 quotes **"8 songs trainable in ~1 hour on a single RTX 3090"** for LoRA personalisation ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). - LoRA training is verified working on **MPS** via the Apple-Silicon fork ([clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)). --- ## 13. Pros and cons **Pros** - Apache-2.0 / MIT — **fully commercial-friendly**, unique in this tier. - **Fastest open music model**: 15.63× RTF on a 4090; sub-2 s/song on A100 (v1.5). - Vocals **and** instruments natively; v1.5 vocal quality now beats Suno v4 in blind tests. - 50+ languages with lyric structural tags. - First-class **MPS + MLX** support and a dedicated Apple-Silicon fork. - ComfyUI native + thriving UI ecosystem (`ace-step-ui`). - LoRA training is cheap (~1 hour for 8 songs on 3090), well-documented. - Hybrid LM-planner (v1.5) closes the long-range structure gap with Suno. **Cons** - v1 vocals are admitted "coarse"; even v1.5 trails Suno on pop/EDM polish. - High **seed sensitivity** → "gacha" outputs; multiple re-rolls needed in production. - Less-represented languages underperform. - Memory for XL series can exceed 24 GB without offload. - No official **v2** announced; the rapid v1 → v1.5 → XL fork hints at API/checkpoint churn. - Smaller benchmark literature than Suno/YuE; some metrics still self-reported. --- ## 14. Verdict for the user's platform For a **Suno-like platform on M5 Max with 128 GB unified memory**, ACE-Step is currently the **single strongest open-source choice** and should be the **default base model**: - **Best for:** full-song generation with vocals in 50+ languages, fast iteration (sub-minute per song expected on M5 Max), genre-specific LoRA fine-tuning, and any deployment where commercial rights matter (Apache/MIT vs Suno's locked-down terms). - **Recommended stack:** ACE-Step **v1.5 XL (4B DiT) + 1.7B Qwen3 planner**, run via the `clockworksquirrel/ace-step-apple-silicon` MPS/MLX fork, served behind the `fspecii/ace-step-ui` frontend, with ComfyUI workflows for power-user editing. - **Weaknesses to mitigate:** budget for **n-of-k re-roll selection** in the product UX (the gacha problem); pair with a **Demucs stem-extraction post-process** (already in `ace-step-ui`) so users can mix-down; do not pitch the platform on pop/EDM polish alone — lean into folk/classical/jazz and rap, where ACE-Step now leads. - **Where you may still need Suno-style commercial APIs:** clients demanding broadcast-radio pop polish; otherwise, ACE-Step is sufficient. --- ### Sources - [ACE-Step paper, arXiv 2506.00045](https://arxiv.org/abs/2506.00045) - [ace-step.github.io](https://ace-step.github.io/) - [ace-step/ACE-Step (v1 repo)](https://github.com/ace-step/ACE-Step) - [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5) - [ACE-Step v1-3.5B model card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B) - [ACE-Step org on Hugging Face](https://huggingface.co/ACE-Step) - [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) - [fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui) - [ace-step/ACE-Step-ComfyUI](https://github.com/ace-step/ACE-Step-ComfyUI) - [ace-step/awesome-ace-step](https://github.com/ace-step/awesome-ace-step) - [ComfyUI native ACE-Step tutorial](https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1) - [ComfyUI Wiki ACE-Step guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1) - [Purz blog – ACE-Step 1.5 in ComfyUI](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui) - [AMD blog – ACE-Step 1.5 on Ryzen AI / Radeon](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html) - [FM9 – ACE-Step vs Suno blind test](https://fm9.ai/ace-step/vs-suno) - [HeartMuLa – ACE-Step 1.5 review](https://heart-mula.com/ace-step) - [ResearchGate – ACE-Step paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model) - [Apple Silicon HPC benchmark, arXiv 2502.05317](https://arxiv.org/html/2502.05317v1) - [acestep.io – hosted service](https://acestep.io/) - [ace-step.app – hosted service](https://ace-step.app/)