Spaces:

techfreakworm
/

ACE-Music-Studio

Running on Zero

App Files Files Community

ACE-Music-Studio / research /03_acestep.md

techfreakworm

docs: track spec + mockups + model research

9071450 unverified 4 days ago

preview code

raw

history blame contribute delete

16.7 kB

	# ACE-Step — Deep Technical Report

	Researched 2026-05-18 for a Suno-like platform build on M5 Max (128 GB unified) / MPS.

	---

	## 1. Overview

	ACE-Step is a foundation model for music generation jointly built by ACE Studio (the consumer music-tech outfit behind ACE Studio's vocal synth) and StepFun ("Step-AI"), a Beijing-based foundation-model lab. Core authors: Junmin Gong, Sean Zhao, Sen Wang, Shengyuan Xu, Joe Guo ([ace-step.github.io](https://ace-step.github.io/)).

	Release timeline:
	- v1 (3.5B) — open-sourced May 2025; technical report posted on arXiv on 2 Jun 2025 as 2506.00045 ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).
	- v1.5 — released 28 Jan 2026 as a separate repo, [`ace-step/ACE-Step-1.5`](https://github.com/ace-step/ACE-Step-1.5). Adds a hybrid Language-Model + Diffusion-Transformer planner.
	- XL series (4B DiT decoder) — released 2 Apr 2026 as a higher-quality variant inside the v1.5 family.
	- Latest tag — v0.1.7 on 24 Apr 2026 ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
	- v2 — no public roadmap or announcement as of 18 May 2026.

	Current status: actively maintained, 10.4k stars on the v1.5 repo and 4.5k on the original v1 repo, with a thriving ComfyUI ecosystem and third-party UIs ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step), [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).

	---

	## 2. Architecture

	v1 (3.5B): a hybrid that fuses three pieces (per the paper, [arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)):
	1. Sana Deep Compression AutoEncoder (DCAE) — high-compression audio latent space borrowed from NVIDIA's Sana image work.
	2. Lightweight linear transformer — the diffusion backbone, deliberately linear-attention to keep RTF low.
	3. Diffusion training with MERT + m-HuBERT providing semantic-alignment supervision (REPA-style) during training so latents stay musically coherent.

	This sits between LLM-token approaches (Suno/YuE, slow but lyric-tight) and pure diffusion (DiffRhythm, fast but structurally weak). The design goal stated in the paper is "a fast, general-purpose, efficient yet flexible architecture" — explicitly a foundation model, not just a text-to-song pipeline ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).

	v1.5: a hybrid LM-as-planner + Diffusion-Transformer (DiT). A small Qwen3-based LM (0.6B / 1.7B / 4B) turns the user prompt into a structured "song blueprint" (sections, key, bpm, lyrics, vocal style) which the DiT (2B standard or 4B XL) decodes into audio. This brings chain-of-thought reasoning to music structure, lifting long-range coherence — Suno's main historic advantage ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)).

	Parameter counts:
	\| Variant \| DiT \| LM planner \| Total \|
	\|---\|---\|---\|---\|
	\| v1-3.5B \| 3.5B (DiT only) \| — \| 3.5B \|
	\| v1.5 standard \| 2B \| 0.6B / 1.7B \| ~2.6 – 3.7B \|
	\| v1.5 XL \| 4B \| up to 4B \| up to 8B \|

	---

	## 3. Variants and checkpoints

	All on Hugging Face under the `ACE-Step/` org ([ACE-Step org on HF](https://huggingface.co/ACE-Step)):
	- `ACE-Step-v1-3.5B` — the original generalist model ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)).
	- `ACE-Step-v1-chinese-rap-LoRA` ("RapMachine") — genre-specific LoRA.
	- LoRA family shipped by the team: `RapMachine`, `Lyric2Vocal` (vocal-only stem from lyrics), `Text2Samples` (instrumental loops/samples) ([ace-step.github.io](https://ace-step.github.io/)).
	- v1.5 DiT checkpoints: 2B standard and 4B XL.
	- v1.5 LM planners: 0.6B, 1.7B, 4B.
	- A public Space demo at [huggingface.co/spaces/ACE-Step/ACE-Step](https://huggingface.co/spaces/ACE-Step/ACE-Step).

	No v2 checkpoint exists yet.

	---

	## 4. License

	Apache 2.0 for v1 ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)) and MIT for v1.5 ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)). Both are unambiguously commercial-use-permitted, royalty-free. This is the single biggest licensing advantage over Suno/Udio and even over YuE (which carries non-commercial clauses in parts of its weights chain).

	---

	## 5. Vocal support — CRITICAL VERIFICATION

	Verdict: YES — ACE-Step generates vocals natively. The "instrumental-only" claim circulating in some reviews is wrong (likely conflating it with `Text2Samples` LoRA or with DiffRhythm).

	Evidence:
	- The v1 HF model card describes the model as full-song (vocals + instruments) with the explicit caveat: "Coarse vocal synthesis lacking nuance" and "Rare instruments may not render perfectly" ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)).
	- The paper claims lyric alignment across melody/harmony/rhythm metrics — only meaningful for sung vocals ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045)).
	- The ComfyUI native node `TextEncodeAceStepAudio` accepts lyrics with `[verse] [chorus] [bridge]` structural tags ([comfyui-wiki guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1)).
	- `Lyric2Vocal` LoRA exists because the base model already does vocals — the LoRA isolates the vocal stem ([ace-step.github.io](https://ace-step.github.io/)).
	- Blind-listening review of 50 participants scored ACE-Step v1.5 4.4/5 on SongEval Vocal vs Suno v4 at 4.1/5 ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)).

	Quality reality check: v1 vocals are admitted to be "coarse"; v1.5 markedly improves vocal clarity and now beats Suno v4 in blind tests on naturalness for folk/classical/jazz, while Suno still wins on "radio-ready polish" for pop/EDM ([fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)).

	---

	## 6. Languages supported

	- v1: 19 languages, with the top 10 (English, Mandarin Chinese, Russian, Spanish, Japanese, German, French, Portuguese, Italian, Korean) performing best ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). Less-represented languages underperform due to training-data imbalance.
	- v1.5: Expanded to 50+ languages with lyric control, alongside the planner LM ([ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).

	Known weakness from the team itself: Chinese rap was historically weak, motivating the `chinese-rap-LoRA` ([ace-step.github.io](https://ace-step.github.io/)).

	---

	## 7. Speed claims — verified

	The famous claim: "synthesizes up to 4 minutes of music in just 20 seconds on an A100 GPU — 15× faster than LLM-based baselines" ([arxiv.org/abs/2506.00045](https://arxiv.org/abs/2506.00045), [ace-step.github.io](https://ace-step.github.io/)). Hardware: NVIDIA A100 80GB.

	Published RTF table from the v1 HF card ([HF card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)):

	\| Device \| 27 steps RTF \| 60 steps RTF \|
	\|---\|---\|---\|
	\| RTX 4090 \| 34.48× \| 15.63× \|
	\| A100 \| 27.27× \| 12.27× \|
	\| RTX 3090 \| 12.76× \| 6.48× \|
	\| M2 Max \| 2.27× \| 1.03× \|

	v1.5 is faster still: "under 2 seconds per full song on A100 and under 10 seconds on an RTX 3090" ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).

	Apple-Silicon equivalents (from the dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) port):

	\| Task \| M1 Pro 16 GB \| M3 Pro 36 GB \| A100 \|
	\|---\|---\|---\|---\|
	\| 30 s turbo \| ~45 s \| ~25 s \| ~2 s \|
	\| 30 s SFT (full) \| ~3 min \| ~1.5 min \| ~8 s \|

	M5 Max projection: The M5 Max's GPU TFLOPS lineage (MPS SGEMM scaled M1→M4: 1.36 → 2.24 → 2.47 → 2.9 TFLOPS, per [arxiv 2502.05317](https://arxiv.org/html/2502.05317v1)) plus the M5 generation's ~30 % uplift suggests roughly 3.5–4× the throughput of M2 Max, i.e. an estimated 8–10× RTF at 27 steps for v1, and full-song generation in ~30–50 s for a 4-minute song. No M5-specific public benchmark exists yet.

	---

	## 8. Quality assessment

	From the cross-model evaluation summarised in research-aggregator coverage ([researchgate paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model), [fm9.ai/ace-step/vs-suno](https://fm9.ai/ace-step/vs-suno)):

	\| Dimension \| Leader \| Where ACE-Step sits \|
	\|---\|---\|---\|
	\| Aesthetic quality \| Hailuo > DiffRhythm \| mid-upper \|
	\| Musicality (coherence) \| Suno v3 \| competitive, strong on memorability/clarity \|
	\| Style alignment \| Udio v1 > Hailuo \| 3rd \|
	\| Lyric alignment \| Hailuo \| strong, beats Suno v3, Udio, YuE \|
	\| Vocal naturalness (v1.5) \| ACE-Step 4.4/5 \| beats Suno v4 (4.1/5) \|
	\| Speed (RTF) \| ACE-Step 15.63× \| best in class; DiffRhythm 10.03×, YuE 0.083× \|

	User-facing reception is positive on customisability and speed; the most-cited weakness is "gacha"-style seed sensitivity — re-rolls produce noticeably different outputs ([ace-step.github.io](https://ace-step.github.io/)).

	---

	## 9. Inference performance & Apple Silicon

	- VRAM (v1): minimum 8 GB with CPU offload; comfortable on 12 GB+ ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)).
	- VRAM (v1.5): <4 GB for 2B-turbo with offload; ≥12 GB for XL with offload; ≥20 GB without offload; ≥24 GB optimal ([ACE-Step-1.5 README](https://github.com/ace-step/ACE-Step-1.5)).
	- MPS support: first-class. Use `--bf16 false` on M-series to avoid kernel issues ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)). The dedicated [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon) fork adds: bfloat16 throughout, MPS-safe pipeline with `torch.mps.empty_cache()` synchronisation, MLX backend (567 LoC) that auto-converts the Qwen3 planner LM to MLX with quantisation, and LoRA training on MPS.
	- ComfyUI: native nodes ship in upstream ComfyUI (`TextEncodeAceStepAudio` etc.) plus the official [`ace-step/ACE-Step-ComfyUI`](https://github.com/ace-step/ACE-Step-ComfyUI). v1.5 has dedicated workflows (split-LLM and AIO checkpoint variants) on comfy.org ([Purz blog post](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui)).
	- 128 GB unified on M5 Max comfortably fits the full XL stack plus the 4B planner LM with no offload needed; user's hardware is essentially overkill for ACE-Step.

	---

	## 10. Repo health

	\| Repo \| Stars \| Forks \| Last release \|
	\|---\|---\|---\|---\|
	\| `ace-step/ACE-Step` (v1) \| 4.5k \| 568 \| quiet since v1.5 fork \|
	\| `ace-step/ACE-Step-1.5` \| 10.4k \| 1.3k \| v0.1.7 on 24 Apr 2026 \|
	\| `fspecii/ace-step-ui` (popular community UI) \| 3.8k \| 561 \| active \|
	\| `clockworksquirrel/ace-step-apple-silicon` \| — (smaller) \| — \| active \|

	The team also curates [`ace-step/awesome-ace-step`](https://github.com/ace-step/awesome-ace-step). Issue activity, ComfyUI integration cadence, and the LM-planner architectural jump in v1.5 all indicate a project that is healthier and growing faster than YuE or DiffRhythm.

	---

	## 11. Real-world adoption

	- AMD vendor-backed deployment: AMD published a blog "Commercial-grade AI music generation on AMD Ryzen AI processors and Radeon graphics with ACE Step 1.5" in 2026, explicitly endorsing it for Ryzen AI / Radeon production stacks ([AMD blog](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html)).
	- Third-party SaaS: `acestep.io` and `ace-step.app` run hosted song-generation services on the open weights ([acestep.io](https://acestep.io/), [ace-step.app](https://ace-step.app/)).
	- Production-grade UI: `fspecii/ace-step-ui` brands itself as "the Ultimate Open Source Suno Alternative" with stem extraction (Demucs), batch generation, library/playlist management, LAN access ([fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui)).
	- Heart-MuLa and similar music platforms cite ACE-Step 1.5 in their stack comparisons ([heart-mula.com/ace-step](https://heart-mula.com/ace-step)).

	---

	## 12. Fine-tuning + LoRA

	- Training code released; documented in [`TRAIN_INSTRUCTION.md`](https://github.com/ace-step/ACE-Step) and `ZH_RAP_LORA.md` ([ace-step/ACE-Step](https://github.com/ace-step/ACE-Step)).
	- Genre / task LoRAs from the team: `RapMachine` (general rap), `Chinese-Rap-LoRA`, `Lyric2Vocal`, `Text2Samples` ([HF org](https://huggingface.co/ACE-Step), [ace-step.github.io](https://ace-step.github.io/)).
	- v1.5 quotes "8 songs trainable in ~1 hour on a single RTX 3090" for LoRA personalisation ([ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)).
	- LoRA training is verified working on MPS via the Apple-Silicon fork ([clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)).

	---

	## 13. Pros and cons

	Pros
	- Apache-2.0 / MIT — fully commercial-friendly, unique in this tier.
	- Fastest open music model: 15.63× RTF on a 4090; sub-2 s/song on A100 (v1.5).
	- Vocals and instruments natively; v1.5 vocal quality now beats Suno v4 in blind tests.
	- 50+ languages with lyric structural tags.
	- First-class MPS + MLX support and a dedicated Apple-Silicon fork.
	- ComfyUI native + thriving UI ecosystem (`ace-step-ui`).
	- LoRA training is cheap (~1 hour for 8 songs on 3090), well-documented.
	- Hybrid LM-planner (v1.5) closes the long-range structure gap with Suno.

	Cons
	- v1 vocals are admitted "coarse"; even v1.5 trails Suno on pop/EDM polish.
	- High seed sensitivity → "gacha" outputs; multiple re-rolls needed in production.
	- Less-represented languages underperform.
	- Memory for XL series can exceed 24 GB without offload.
	- No official v2 announced; the rapid v1 → v1.5 → XL fork hints at API/checkpoint churn.
	- Smaller benchmark literature than Suno/YuE; some metrics still self-reported.

	---

	## 14. Verdict for the user's platform

	For a Suno-like platform on M5 Max with 128 GB unified memory, ACE-Step is currently the single strongest open-source choice and should be the default base model:

	- Best for: full-song generation with vocals in 50+ languages, fast iteration (sub-minute per song expected on M5 Max), genre-specific LoRA fine-tuning, and any deployment where commercial rights matter (Apache/MIT vs Suno's locked-down terms).
	- Recommended stack: ACE-Step v1.5 XL (4B DiT) + 1.7B Qwen3 planner, run via the `clockworksquirrel/ace-step-apple-silicon` MPS/MLX fork, served behind the `fspecii/ace-step-ui` frontend, with ComfyUI workflows for power-user editing.
	- Weaknesses to mitigate: budget for n-of-k re-roll selection in the product UX (the gacha problem); pair with a Demucs stem-extraction post-process (already in `ace-step-ui`) so users can mix-down; do not pitch the platform on pop/EDM polish alone — lean into folk/classical/jazz and rap, where ACE-Step now leads.
	- Where you may still need Suno-style commercial APIs: clients demanding broadcast-radio pop polish; otherwise, ACE-Step is sufficient.

	---

	### Sources

	- [ACE-Step paper, arXiv 2506.00045](https://arxiv.org/abs/2506.00045)
	- [ace-step.github.io](https://ace-step.github.io/)
	- [ace-step/ACE-Step (v1 repo)](https://github.com/ace-step/ACE-Step)
	- [ace-step/ACE-Step-1.5](https://github.com/ace-step/ACE-Step-1.5)
	- [ACE-Step v1-3.5B model card](https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B)
	- [ACE-Step org on Hugging Face](https://huggingface.co/ACE-Step)
	- [clockworksquirrel/ace-step-apple-silicon](https://github.com/clockworksquirrel/ace-step-apple-silicon)
	- [fspecii/ace-step-ui](https://github.com/fspecii/ace-step-ui)
	- [ace-step/ACE-Step-ComfyUI](https://github.com/ace-step/ACE-Step-ComfyUI)
	- [ace-step/awesome-ace-step](https://github.com/ace-step/awesome-ace-step)
	- [ComfyUI native ACE-Step tutorial](https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1)
	- [ComfyUI Wiki ACE-Step guide](https://comfyui-wiki.com/en/tutorial/advanced/audio/ace-step/ace-step-v1)
	- [Purz blog – ACE-Step 1.5 in ComfyUI](https://blog.comfy.org/p/ace-step-15-is-now-available-in-comfyui)
	- [AMD blog – ACE-Step 1.5 on Ryzen AI / Radeon](https://www.amd.com/en/blogs/2026/commercial-grade-ai-music-generation-on-amd-ryzen-ai-and-radeon-ace-step-1-5.html)
	- [FM9 – ACE-Step vs Suno blind test](https://fm9.ai/ace-step/vs-suno)
	- [HeartMuLa – ACE-Step 1.5 review](https://heart-mula.com/ace-step)
	- [ResearchGate – ACE-Step paper page](https://www.researchgate.net/publication/392334894_ACE-Step_A_Step_Towards_Music_Generation_Foundation_Model)
	- [Apple Silicon HPC benchmark, arXiv 2502.05317](https://arxiv.org/html/2502.05317v1)
	- [acestep.io – hosted service](https://acestep.io/)
	- [ace-step.app – hosted service](https://ace-step.app/)