Spaces:
Running on Zero
Running on Zero
| # ACE-Step β Deep Technical Report | |
| *Researched 2026-05-18 for a Suno-like platform build on M5 Max (128 GB unified) / MPS.* | |
| --- | |
| ## 1. Overview | |
| ACE-Step is a foundation model for music generation jointly built by **ACE Studio** (the consumer music-tech outfit behind ACE Studio's vocal synth) and **StepFun** ("Step-AI"), a Beijing-based foundation-model lab. Core authors: Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, Joe Guo ([ace-step.github.io](https://ace-step.github.io/)). | |
| Release timeline: | |
| - **v1 (3.5B)** β open-sourced May 2025; technical report posted on arXiv on 2 Jun 2025 as 2506.00045 ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)). | |
| - **v1.5** β released **28 Jan 2026** as a separate repo, [`ace-step/ACE-Step-1.5`](https://github.com/ace-step/ACE-Step-1.5). Adds a hybrid Language-Model + Diffusion-Transformer planner. | |
| - **XL series (4B DiT decoder)** β released 2 Apr 2026 as a higher-quality variant inside the v1.5 family. | |
| - **Latest tag** β v0.1.7 on 24 Apr 2026 ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). | |
| - **v2** β **no public roadmap or announcement** as of 18 May 2026. | |
| Current status: actively maintained, 10.4k stars on the v1.5 repo and 4.5k on the original v1 repo, with a thriving ComfyUI ecosystem and third-party UIs ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step), [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). | |
| --- | |
| ## 2. Architecture | |
| **v1 (3.5B):** a hybrid that fuses three pieces (per the paper, [arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)): | |
| 1. **Sana Deep Compression AutoEncoder (DCAE)** β high-compression audio latent space borrowed from NVIDIA's Sana image work. | |
| 2. **Lightweight linear transformer** β the diffusion backbone, deliberately linear-attention to keep RTF low. | |
| 3. **Diffusion training** with **MERT + m-HuBERT** providing semantic-alignment supervision (REPA-style) during training so latents stay musically coherent. | |
| This sits between LLM-token approaches (Suno/YuE, slow but lyric-tight) and pure diffusion (DiffRhythm, fast but structurally weak). The design goal stated in the paper is "a fast, general-purpose, efficient yet flexible architecture" β explicitly a *foundation model*, not just a text-to-song pipeline ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)). | |
| **v1.5:** a hybrid **LM-as-planner + Diffusion-Transformer (DiT)**. A small Qwen3-based LM (0.6B / 1.7B / 4B) turns the user prompt into a structured "song blueprint" (sections, key, bpm, lyrics, vocal style) which the DiT (2B standard or 4B XL) decodes into audio. This brings chain-of-thought reasoning to music structure, lifting long-range coherence β Suno's main historic advantage ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)). | |
| **Parameter counts:** | |
| | Variant | DiT | LM planner | Total | | |
| |---|---|---|---| | |
| | v1-3.5B | 3.5B (DiT only) | β | 3.5B | | |
| | v1.5 standard | 2B | 0.6B / 1.7B | ~2.6 β 3.7B | | |
| | v1.5 XL | 4B | up to 4B | up to 8B | | |
| --- | |
| ## 3. Variants and checkpoints | |
| All on Hugging Face under the `ACE-Step/` org ([ACE-Step org on HF](https://huggingface.co/ACE-Step)): | |
| - `ACE-Step-v1-3.5B` β the original generalist model ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)). | |
| - `ACE-Step-v1-chinese-rap-LoRA` ("RapMachine") β genre-specific LoRA. | |
| - **LoRA family** shipped by the team: `RapMachine`, `Lyric2Vocal` (vocal-only stem from lyrics), `Text2Samples` (instrumental loops/samples) ([ace-step.github.io](https://ace-step.github.io/)). | |
| - **v1.5 DiT checkpoints:** 2B standard and 4B XL. | |
| - **v1.5 LM planners:** 0.6B, 1.7B, 4B. | |
| - A public **Space demo** at [huggingface.co/spaces/ACE-Step/ACE-Step](https://huggingface.co/spaces/ACE-Step/ACE-Step). | |
| No v2 checkpoint exists yet. | |
| --- | |
| ## 4. License | |
| **Apache 2.0** for v1 ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)) and **MIT** for v1.5 ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). Both are unambiguously **commercial-use-permitted, royalty-free**. This is the single biggest licensing advantage over Suno/Udio and even over YuE (which carries non-commercial clauses in parts of its weights chain). | |
| --- | |
| ## 5. Vocal support β CRITICAL VERIFICATION | |
| **Verdict: YES β ACE-Step generates vocals natively. The "instrumental-only" claim circulating in some reviews is wrong (likely conflating it with `Text2Samples` LoRA or with DiffRhythm).** | |
| Evidence: | |
| - The **v1 HF model card** describes the model as full-song (vocals + instruments) with the explicit caveat: *"Coarse vocal synthesis lacking nuance"* and *"Rare instruments may not render perfectly"* ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)). | |
| - The paper claims **lyric alignment across melody/harmony/rhythm metrics** β only meaningful for sung vocals ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)). | |
| - The ComfyUI native node `TextEncodeAceStepAudio` accepts lyrics with `[verse] [chorus] [bridge]` structural tags ([comfyui-wiki guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1)). | |
| - `Lyric2Vocal` LoRA exists *because* the base model already does vocals β the LoRA isolates the vocal stem ([ace-step.github.io](https://ace-step.github.io/)). | |
| - Blind-listening review of 50 participants scored ACE-Step v1.5 **4.4/5 on SongEval Vocal vs Suno v4 at 4.1/5** ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)). | |
| **Quality reality check:** v1 vocals are admitted to be "coarse"; v1.5 markedly improves vocal clarity and now beats Suno v4 in blind tests on naturalness for folk/classical/jazz, while Suno still wins on "radio-ready polish" for pop/EDM ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)). | |
| --- | |
| ## 6. Languages supported | |
| - **v1:** 19 languages, with the top 10 (English, Mandarin Chinese, Russian, Spanish, Japanese, German, French, Portuguese, Italian, Korean) performing best ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). Less-represented languages underperform due to training-data imbalance. | |
| - **v1.5:** Expanded to **50+ languages** with lyric control, alongside the planner LM ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). | |
| Known weakness from the team itself: Chinese rap was historically weak, motivating the `chinese-rap-LoRA` ([ace-step.github.io](https://ace-step.github.io/)). | |
| --- | |
| ## 7. Speed claims β verified | |
| The famous claim: *"synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPU β 15Γ faster than LLM-based baselines"* ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045), [ace-step.github.io](https://ace-step.github.io/)). Hardware: **NVIDIA A100 80GB**. | |
| Published RTF table from the v1 HF card ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)): | |
| | Device | 27 steps RTF | 60 steps RTF | | |
| |---|---|---| | |
| | RTX 4090 | 34.48Γ | 15.63Γ | | |
| | A100 | 27.27Γ | 12.27Γ | | |
| | RTX 3090 | 12.76Γ | 6.48Γ | | |
| | **M2 Max** | **2.27Γ** | **1.03Γ** | | |
| v1.5 is faster still: *"under 2 seconds per full song on A100 and under 10 seconds on an RTX 3090"* ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). | |
| **Apple-Silicon equivalents** (from the dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) port): | |
| | Task | M1 Pro 16 GB | M3 Pro 36 GB | A100 | | |
| |---|---|---|---| | |
| | 30 s turbo | ~45 s | ~25 s | ~2 s | | |
| | 30 s SFT (full) | ~3 min | ~1.5 min | ~8 s | | |
| **M5 Max projection:** The M5 Max's GPU TFLOPS lineage (MPS SGEMM scaled M1βM4: 1.36 β 2.24 β 2.47 β 2.9 TFLOPS, per [arxiv 2502.05317](https://arxiv.org/html/2502.05317v1)) plus the M5 generation's ~30 % uplift suggests roughly **3.5β4Γ the throughput of M2 Max**, i.e. an **estimated 8β10Γ RTF at 27 steps** for v1, and full-song generation in **~30β50 s for a 4-minute song**. No M5-specific public benchmark exists yet. | |
| --- | |
| ## 8. Quality assessment | |
| From the cross-model evaluation summarised in research-aggregator coverage ([researchgate paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model), [fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)): | |
| | Dimension | Leader | Where ACE-Step sits | | |
| |---|---|---| | |
| | Aesthetic quality | Hailuo > DiffRhythm | mid-upper | | |
| | Musicality (coherence) | Suno v3 | competitive, strong on memorability/clarity | | |
| | Style alignment | Udio v1 > Hailuo | 3rd | | |
| | Lyric alignment | Hailuo | strong, beats Suno v3, Udio, YuE | | |
| | **Vocal naturalness (v1.5)** | **ACE-Step 4.4/5** | beats Suno v4 (4.1/5) | | |
| | Speed (RTF) | **ACE-Step 15.63Γ** | best in class; DiffRhythm 10.03Γ, YuE 0.083Γ | | |
| User-facing reception is positive on customisability and speed; the most-cited weakness is "gacha"-style seed sensitivity β re-rolls produce noticeably different outputs ([ace-step.github.io](https://ace-step.github.io/)). | |
| --- | |
| ## 9. Inference performance & Apple Silicon | |
| - **VRAM (v1):** minimum **8 GB with CPU offload**; comfortable on 12 GB+ ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). | |
| - **VRAM (v1.5):** **<4 GB** for 2B-turbo with offload; **β₯12 GB** for XL with offload; **β₯20 GB** without offload; **β₯24 GB optimal** ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)). | |
| - **MPS support:** **first-class.** Use `--bf16 false` on M-series to avoid kernel issues ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). The dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) fork adds: bfloat16 throughout, MPS-safe pipeline with `torch.mps.empty_cache()` synchronisation, **MLX backend (567 LoC)** that auto-converts the Qwen3 planner LM to MLX with quantisation, and **LoRA training on MPS**. | |
| - **ComfyUI:** **native nodes** ship in upstream ComfyUI (`TextEncodeAceStepAudio` etc.) plus the official [`ace-step/ACE-Step-ComfyUI`](https://github.com/ace-step/ACE-Step-ComfyUI). v1.5 has dedicated workflows (split-LLM and AIO checkpoint variants) on comfy.org ([Purz blog post](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui)). | |
| - **128 GB unified on M5 Max** comfortably fits the full XL stack plus the 4B planner LM with no offload needed; user's hardware is essentially overkill for ACE-Step. | |
| --- | |
| ## 10. Repo health | |
| | Repo | Stars | Forks | Last release | | |
| |---|---|---|---| | |
| | `ace-step/ACE-Step` (v1) | 4.5k | 568 | quiet since v1.5 fork | | |
| | `ace-step/ACE-Step-1.5` | **10.4k** | 1.3k | v0.1.7 on 24 Apr 2026 | | |
| | `fspecii/ace-step-ui` (popular community UI) | 3.8k | 561 | active | | |
| | `clockworksquirrel/ace-step-apple-silicon` | β (smaller) | β | active | | |
| The team also curates [`ace-step/awesome-ace-step`](https://github.com/ace-step/awesome-ace-step). Issue activity, ComfyUI integration cadence, and the LM-planner architectural jump in v1.5 all indicate a project that is healthier and growing faster than YuE or DiffRhythm. | |
| --- | |
| ## 11. Real-world adoption | |
| - **AMD vendor-backed deployment:** AMD published a blog *"Commercial-grade AI music generation on AMD Ryzen AI processors and Radeon graphics with ACE Step 1.5"* in 2026, explicitly endorsing it for Ryzen AI / Radeon production stacks ([AMD blog](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html)). | |
| - **Third-party SaaS:** `acestep.io` and `ace-step.app` run hosted song-generation services on the open weights ([acestep.io](https://acestep.io/), [ace-step.app](https://ace-step.app/)). | |
| - **Production-grade UI:** `fspecii/ace-step-ui` brands itself as *"the Ultimate Open Source Suno Alternative"* with stem extraction (Demucs), batch generation, library/playlist management, LAN access ([fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui)). | |
| - Heart-MuLa and similar music platforms cite ACE-Step 1.5 in their stack comparisons ([heart-mula.com/ace-step](https://heart-mula.com/ace-step)). | |
| --- | |
| ## 12. Fine-tuning + LoRA | |
| - **Training code released**; documented in [`TRAIN_INSTRUCTION.md`](https://github.com/ace-step/ACE-Step) and `ZH_RAP_LORA.md` ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). | |
| - **Genre / task LoRAs from the team:** `RapMachine` (general rap), `Chinese-Rap-LoRA`, `Lyric2Vocal`, `Text2Samples` ([HF org](https://huggingface.co/ACE-Step), [ace-step.github.io](https://ace-step.github.io/)). | |
| - v1.5 quotes **"8 songs trainable in ~1 hour on a single RTX 3090"** for LoRA personalisation ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). | |
| - LoRA training is verified working on **MPS** via the Apple-Silicon fork ([clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)). | |
| --- | |
| ## 13. Pros and cons | |
| **Pros** | |
| - Apache-2.0 / MIT β **fully commercial-friendly**, unique in this tier. | |
| - **Fastest open music model**: 15.63Γ RTF on a 4090; sub-2 s/song on A100 (v1.5). | |
| - Vocals **and** instruments natively; v1.5 vocal quality now beats Suno v4 in blind tests. | |
| - 50+ languages with lyric structural tags. | |
| - First-class **MPS + MLX** support and a dedicated Apple-Silicon fork. | |
| - ComfyUI native + thriving UI ecosystem (`ace-step-ui`). | |
| - LoRA training is cheap (~1 hour for 8 songs on 3090), well-documented. | |
| - Hybrid LM-planner (v1.5) closes the long-range structure gap with Suno. | |
| **Cons** | |
| - v1 vocals are admitted "coarse"; even v1.5 trails Suno on pop/EDM polish. | |
| - High **seed sensitivity** β "gacha" outputs; multiple re-rolls needed in production. | |
| - Less-represented languages underperform. | |
| - Memory for XL series can exceed 24 GB without offload. | |
| - No official **v2** announced; the rapid v1 β v1.5 β XL fork hints at API/checkpoint churn. | |
| - Smaller benchmark literature than Suno/YuE; some metrics still self-reported. | |
| --- | |
| ## 14. Verdict for the user's platform | |
| For a **Suno-like platform on M5 Max with 128 GB unified memory**, ACE-Step is currently the **single strongest open-source choice** and should be the **default base model**: | |
| - **Best for:** full-song generation with vocals in 50+ languages, fast iteration (sub-minute per song expected on M5 Max), genre-specific LoRA fine-tuning, and any deployment where commercial rights matter (Apache/MIT vs Suno's locked-down terms). | |
| - **Recommended stack:** ACE-Step **v1.5 XL (4B DiT) + 1.7B Qwen3 planner**, run via the `clockworksquirrel/ace-step-apple-silicon` MPS/MLX fork, served behind the `fspecii/ace-step-ui` frontend, with ComfyUI workflows for power-user editing. | |
| - **Weaknesses to mitigate:** budget for **n-of-k re-roll selection** in the product UX (the gacha problem); pair with a **Demucs stem-extraction post-process** (already in `ace-step-ui`) so users can mix-down; do not pitch the platform on pop/EDM polish alone β lean into folk/classical/jazz and rap, where ACE-Step now leads. | |
| - **Where you may still need Suno-style commercial APIs:** clients demanding broadcast-radio pop polish; otherwise, ACE-Step is sufficient. | |
| --- | |
| ### Sources | |
| - [ACE-Step paper, arXiv 2506.00045](https://arxiv.org/abs/2506.00045) | |
| - [ace-step.github.io](https://ace-step.github.io/) | |
| - [ace-step/ACE-Step (v1 repo)](https://github.com/ace-step/ACE-Step) | |
| - [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5) | |
| - [ACE-Step v1-3.5B model card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B) | |
| - [ACE-Step org on Hugging Face](https://huggingface.co/ACE-Step) | |
| - [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) | |
| - [fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui) | |
| - [ace-step/ACE-Step-ComfyUI](https://github.com/ace-step/ACE-Step-ComfyUI) | |
| - [ace-step/awesome-ace-step](https://github.com/ace-step/awesome-ace-step) | |
| - [ComfyUI native ACE-Step tutorial](https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1) | |
| - [ComfyUI Wiki ACE-Step guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1) | |
| - [Purz blog β ACE-Step 1.5 in ComfyUI](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui) | |
| - [AMD blog β ACE-Step 1.5 on Ryzen AI / Radeon](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html) | |
| - [FM9 β ACE-Step vs Suno blind test](https://fm9.ai/ace-step/vs-suno) | |
| - [HeartMuLa β ACE-Step 1.5 review](https://heart-mula.com/ace-step) | |
| - [ResearchGate β ACE-Step paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model) | |
| - [Apple Silicon HPC benchmark, arXiv 2502.05317](https://arxiv.org/html/2502.05317v1) | |
| - [acestep.io β hosted service](https://acestep.io/) | |
| - [ace-step.app β hosted service](https://ace-step.app/) | |