Spaces:
Running on Zero
ACE-Step β Deep Technical Report
Researched 2026-05-18 for a Suno-like platform build on M5 Max (128 GB unified) / MPS.
1. Overview
ACE-Step is a foundation model for music generation jointly built by ACE Studio (the consumer music-tech outfit behind ACE Studio's vocal synth) and StepFun ("Step-AI"), a Beijing-based foundation-model lab. Core authors: Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, Joe Guo (ace-step.github.io).
Release timeline:
- v1 (3.5B) β open-sourced May 2025; technical report posted on arXiv on 2 Jun 2025 as 2506.00045 (arxiv.org/abs/2506.00045).
- v1.5 β released 28 Jan 2026 as a separate repo,
ace-step/ACE-Step-1.5. Adds a hybrid Language-Model + Diffusion-Transformer planner. - XL series (4B DiT decoder) β released 2 Apr 2026 as a higher-quality variant inside the v1.5 family.
- Latest tag β v0.1.7 on 24 Apr 2026 (ACE-Step-1.5).
- v2 β no public roadmap or announcement as of 18 May 2026.
Current status: actively maintained, 10.4k stars on the v1.5 repo and 4.5k on the original v1 repo, with a thriving ComfyUI ecosystem and third-party UIs (ace-step/ACE-Step, ace-step/ACE-Step-1.5).
2. Architecture
v1 (3.5B): a hybrid that fuses three pieces (per the paper, arxiv.org/abs/2506.00045):
- Sana Deep Compression AutoEncoder (DCAE) β high-compression audio latent space borrowed from NVIDIA's Sana image work.
- Lightweight linear transformer β the diffusion backbone, deliberately linear-attention to keep RTF low.
- Diffusion training with MERT + m-HuBERT providing semantic-alignment supervision (REPA-style) during training so latents stay musically coherent.
This sits between LLM-token approaches (Suno/YuE, slow but lyric-tight) and pure diffusion (DiffRhythm, fast but structurally weak). The design goal stated in the paper is "a fast, general-purpose, efficient yet flexible architecture" β explicitly a foundation model, not just a text-to-song pipeline (arxiv.org/abs/2506.00045).
v1.5: a hybrid LM-as-planner + Diffusion-Transformer (DiT). A small Qwen3-based LM (0.6B / 1.7B / 4B) turns the user prompt into a structured "song blueprint" (sections, key, bpm, lyrics, vocal style) which the DiT (2B standard or 4B XL) decodes into audio. This brings chain-of-thought reasoning to music structure, lifting long-range coherence β Suno's main historic advantage (ACE-Step-1.5 README).
Parameter counts:
| Variant | DiT | LM planner | Total |
|---|---|---|---|
| v1-3.5B | 3.5B (DiT only) | β | 3.5B |
| v1.5 standard | 2B | 0.6B / 1.7B | ~2.6 β 3.7B |
| v1.5 XL | 4B | up to 4B | up to 8B |
3. Variants and checkpoints
All on Hugging Face under the ACE-Step/ org (ACE-Step org on HF):
ACE-Step-v1-3.5Bβ the original generalist model (HF card).ACE-Step-v1-chinese-rap-LoRA("RapMachine") β genre-specific LoRA.- LoRA family shipped by the team:
RapMachine,Lyric2Vocal(vocal-only stem from lyrics),Text2Samples(instrumental loops/samples) (ace-step.github.io). - v1.5 DiT checkpoints: 2B standard and 4B XL.
- v1.5 LM planners: 0.6B, 1.7B, 4B.
- A public Space demo at huggingface.co/spaces/ACE-Step/ACE-Step.
No v2 checkpoint exists yet.
4. License
Apache 2.0 for v1 (ace-step/ACE-Step) and MIT for v1.5 (ace-step/ACE-Step-1.5). Both are unambiguously commercial-use-permitted, royalty-free. This is the single biggest licensing advantage over Suno/Udio and even over YuE (which carries non-commercial clauses in parts of its weights chain).
5. Vocal support β CRITICAL VERIFICATION
Verdict: YES β ACE-Step generates vocals natively. The "instrumental-only" claim circulating in some reviews is wrong (likely conflating it with Text2Samples LoRA or with DiffRhythm).
Evidence:
- The v1 HF model card describes the model as full-song (vocals + instruments) with the explicit caveat: "Coarse vocal synthesis lacking nuance" and "Rare instruments may not render perfectly" (HF card).
- The paper claims lyric alignment across melody/harmony/rhythm metrics β only meaningful for sung vocals (arxiv.org/abs/2506.00045).
- The ComfyUI native node
TextEncodeAceStepAudioaccepts lyrics with[verse] [chorus] [bridge]structural tags (comfyui-wiki guide). Lyric2VocalLoRA exists because the base model already does vocals β the LoRA isolates the vocal stem (ace-step.github.io).- Blind-listening review of 50 participants scored ACE-Step v1.5 4.4/5 on SongEval Vocal vs Suno v4 at 4.1/5 (fm9.ai/ace-step/vs-suno).
Quality reality check: v1 vocals are admitted to be "coarse"; v1.5 markedly improves vocal clarity and now beats Suno v4 in blind tests on naturalness for folk/classical/jazz, while Suno still wins on "radio-ready polish" for pop/EDM (fm9.ai/ace-step/vs-suno).
6. Languages supported
- v1: 19 languages, with the top 10 (English, Mandarin Chinese, Russian, Spanish, Japanese, German, French, Portuguese, Italian, Korean) performing best (ace-step/ACE-Step). Less-represented languages underperform due to training-data imbalance.
- v1.5: Expanded to 50+ languages with lyric control, alongside the planner LM (ace-step/ACE-Step-1.5).
Known weakness from the team itself: Chinese rap was historically weak, motivating the chinese-rap-LoRA (ace-step.github.io).
7. Speed claims β verified
The famous claim: "synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPU β 15Γ faster than LLM-based baselines" (arxiv.org/abs/2506.00045, ace-step.github.io). Hardware: NVIDIA A100 80GB.
Published RTF table from the v1 HF card (HF card):
| Device | 27 steps RTF | 60 steps RTF |
|---|---|---|
| RTX 4090 | 34.48Γ | 15.63Γ |
| A100 | 27.27Γ | 12.27Γ |
| RTX 3090 | 12.76Γ | 6.48Γ |
| M2 Max | 2.27Γ | 1.03Γ |
v1.5 is faster still: "under 2 seconds per full song on A100 and under 10 seconds on an RTX 3090" (ACE-Step-1.5).
Apple-Silicon equivalents (from the dedicated clockworksquirrel/ace-step-apple-silicon port):
| Task | M1 Pro 16 GB | M3 Pro 36 GB | A100 |
|---|---|---|---|
| 30 s turbo | ~45 s | ~25 s | ~2 s |
| 30 s SFT (full) | ~3 min | ~1.5 min | ~8 s |
M5 Max projection: The M5 Max's GPU TFLOPS lineage (MPS SGEMM scaled M1βM4: 1.36 β 2.24 β 2.47 β 2.9 TFLOPS, per arxiv 2502.05317) plus the M5 generation's 30 % uplift suggests roughly 3.5β4Γ the throughput of M2 Max, i.e. an estimated 8β10Γ RTF at 27 steps for v1, and full-song generation in **30β50 s for a 4-minute song**. No M5-specific public benchmark exists yet.
8. Quality assessment
From the cross-model evaluation summarised in research-aggregator coverage (researchgate paper page, fm9.ai/ace-step/vs-suno):
| Dimension | Leader | Where ACE-Step sits |
|---|---|---|
| Aesthetic quality | Hailuo > DiffRhythm | mid-upper |
| Musicality (coherence) | Suno v3 | competitive, strong on memorability/clarity |
| Style alignment | Udio v1 > Hailuo | 3rd |
| Lyric alignment | Hailuo | strong, beats Suno v3, Udio, YuE |
| Vocal naturalness (v1.5) | ACE-Step 4.4/5 | beats Suno v4 (4.1/5) |
| Speed (RTF) | ACE-Step 15.63Γ | best in class; DiffRhythm 10.03Γ, YuE 0.083Γ |
User-facing reception is positive on customisability and speed; the most-cited weakness is "gacha"-style seed sensitivity β re-rolls produce noticeably different outputs (ace-step.github.io).
9. Inference performance & Apple Silicon
- VRAM (v1): minimum 8 GB with CPU offload; comfortable on 12 GB+ (ace-step/ACE-Step).
- VRAM (v1.5): <4 GB for 2B-turbo with offload; β₯12 GB for XL with offload; β₯20 GB without offload; β₯24 GB optimal (ACE-Step-1.5 README).
- MPS support: first-class. Use
--bf16 falseon M-series to avoid kernel issues (ace-step/ACE-Step). The dedicated clockworksquirrel/ace-step-apple-silicon fork adds: bfloat16 throughout, MPS-safe pipeline withtorch.mps.empty_cache()synchronisation, MLX backend (567 LoC) that auto-converts the Qwen3 planner LM to MLX with quantisation, and LoRA training on MPS. - ComfyUI: native nodes ship in upstream ComfyUI (
TextEncodeAceStepAudioetc.) plus the officialace-step/ACE-Step-ComfyUI. v1.5 has dedicated workflows (split-LLM and AIO checkpoint variants) on comfy.org (Purz blog post). - 128 GB unified on M5 Max comfortably fits the full XL stack plus the 4B planner LM with no offload needed; user's hardware is essentially overkill for ACE-Step.
10. Repo health
| Repo | Stars | Forks | Last release |
|---|---|---|---|
ace-step/ACE-Step (v1) |
4.5k | 568 | quiet since v1.5 fork |
ace-step/ACE-Step-1.5 |
10.4k | 1.3k | v0.1.7 on 24 Apr 2026 |
fspecii/ace-step-ui (popular community UI) |
3.8k | 561 | active |
clockworksquirrel/ace-step-apple-silicon |
β (smaller) | β | active |
The team also curates ace-step/awesome-ace-step. Issue activity, ComfyUI integration cadence, and the LM-planner architectural jump in v1.5 all indicate a project that is healthier and growing faster than YuE or DiffRhythm.
11. Real-world adoption
- AMD vendor-backed deployment: AMD published a blog "Commercial-grade AI music generation on AMD Ryzen AI processors and Radeon graphics with ACE Step 1.5" in 2026, explicitly endorsing it for Ryzen AI / Radeon production stacks (AMD blog).
- Third-party SaaS:
acestep.ioandace-step.apprun hosted song-generation services on the open weights (acestep.io, ace-step.app). - Production-grade UI:
fspecii/ace-step-uibrands itself as "the Ultimate Open Source Suno Alternative" with stem extraction (Demucs), batch generation, library/playlist management, LAN access (fspecii/ace-step-ui). - Heart-MuLa and similar music platforms cite ACE-Step 1.5 in their stack comparisons (heart-mula.com/ace-step).
12. Fine-tuning + LoRA
- Training code released; documented in
TRAIN_INSTRUCTION.mdandZH_RAP_LORA.md(ace-step/ACE-Step). - Genre / task LoRAs from the team:
RapMachine(general rap),Chinese-Rap-LoRA,Lyric2Vocal,Text2Samples(HF org, ace-step.github.io). - v1.5 quotes "8 songs trainable in ~1 hour on a single RTX 3090" for LoRA personalisation (ACE-Step-1.5).
- LoRA training is verified working on MPS via the Apple-Silicon fork (clockworksquirrel/ace-step-apple-silicon).
13. Pros and cons
Pros
- Apache-2.0 / MIT β fully commercial-friendly, unique in this tier.
- Fastest open music model: 15.63Γ RTF on a 4090; sub-2 s/song on A100 (v1.5).
- Vocals and instruments natively; v1.5 vocal quality now beats Suno v4 in blind tests.
- 50+ languages with lyric structural tags.
- First-class MPS + MLX support and a dedicated Apple-Silicon fork.
- ComfyUI native + thriving UI ecosystem (
ace-step-ui). - LoRA training is cheap (~1 hour for 8 songs on 3090), well-documented.
- Hybrid LM-planner (v1.5) closes the long-range structure gap with Suno.
Cons
- v1 vocals are admitted "coarse"; even v1.5 trails Suno on pop/EDM polish.
- High seed sensitivity β "gacha" outputs; multiple re-rolls needed in production.
- Less-represented languages underperform.
- Memory for XL series can exceed 24 GB without offload.
- No official v2 announced; the rapid v1 β v1.5 β XL fork hints at API/checkpoint churn.
- Smaller benchmark literature than Suno/YuE; some metrics still self-reported.
14. Verdict for the user's platform
For a Suno-like platform on M5 Max with 128 GB unified memory, ACE-Step is currently the single strongest open-source choice and should be the default base model:
- Best for: full-song generation with vocals in 50+ languages, fast iteration (sub-minute per song expected on M5 Max), genre-specific LoRA fine-tuning, and any deployment where commercial rights matter (Apache/MIT vs Suno's locked-down terms).
- Recommended stack: ACE-Step v1.5 XL (4B DiT) + 1.7B Qwen3 planner, run via the
clockworksquirrel/ace-step-apple-siliconMPS/MLX fork, served behind thefspecii/ace-step-uifrontend, with ComfyUI workflows for power-user editing. - Weaknesses to mitigate: budget for n-of-k re-roll selection in the product UX (the gacha problem); pair with a Demucs stem-extraction post-process (already in
ace-step-ui) so users can mix-down; do not pitch the platform on pop/EDM polish alone β lean into folk/classical/jazz and rap, where ACE-Step now leads. - Where you may still need Suno-style commercial APIs: clients demanding broadcast-radio pop polish; otherwise, ACE-Step is sufficient.
Sources
- ACE-Step paper, arXiv 2506.00045
- ace-step.github.io
- ace-step/ACE-Step (v1 repo)
- ace-step/ACE-Step-1.5
- ACE-Step v1-3.5B model card
- ACE-Step org on Hugging Face
- clockworksquirrel/ace-step-apple-silicon
- fspecii/ace-step-ui
- ace-step/ACE-Step-ComfyUI
- ace-step/awesome-ace-step
- ComfyUI native ACE-Step tutorial
- ComfyUI Wiki ACE-Step guide
- Purz blog β ACE-Step 1.5 in ComfyUI
- AMD blog β ACE-Step 1.5 on Ryzen AI / Radeon
- FM9 β ACE-Step vs Suno blind test
- HeartMuLa β ACE-Step 1.5 review
- ResearchGate β ACE-Step paper page
- Apple Silicon HPC benchmark, arXiv 2502.05317
- acestep.io β hosted service
- ace-step.app β hosted service