Spaces:

techfreakworm
/

ACE-Music-Studio

Running on Zero

App Files Files Community

ACE-Music-Studio / research /06_comparison_matrix.md

techfreakworm

docs: track spec + mockups + model research

9071450 unverified 2 days ago

preview code

raw

history blame contribute delete

7.64 kB

Open-Source Song Generation Models — Side-by-Side Comparison

Compiled 2026-05-18 for M5 Max / 128 GB unified memory target.

Headline matrix

Property	ACE-Step 1.5 XL	HeartMuLa 4B	DiffRhythm 2	YuE 7B	SongGeneration 2
Builder	ACE Studio × StepFun	HeartMuLa	NWPU ASLP-lab + Xiaomi	M-A-P / HKUST	Tencent AI Lab
Release	2026-01-28	2026-01-19	2025-10-27 → 2026-02-03 (v3)	2025-01-26	2026-03-01
License	MIT	Apache 2.0	Apache 2.0	Apache 2.0	Custom NON-commercial
Repo stars	10.4 k	3.6 k	~2.3 k (v1) + 0.16 k (v2)	6.2 k	1.6 k
Last major commit	v0.1.7 (2026-04-24)	2026-02	2026-02	2025-06-04 (stale)	2026-03-01
Architecture	LM-planner (Qwen3 0.6/1.7/4 B) + DiT (2/4 B)	CLAP + ASR + 12.5 Hz Codec + 4 B LLM	5 Hz Music VAE + DiT w/ block flow matching	LLaMA2 7B AR Stage-1 + 1B Stage-2 + X-Codec	LeLM hybrid + diffusion decoder
Params (largest)	up to 8 B (4 B DiT + 4 B LM)	~4 B + 2 B codec + 0.8 B ASR	~1 B DiT + 170 M VAE-dec	7 B + 1 B + upsampler	4 B (v2-large)
Audio rate	44.1 kHz stereo	24 kHz neural codec	44.1 kHz stereo	16 kHz then upsampled	High-fi via diffusion
Max length	4+ min	≥1 min, scaling	210 s (regression from v1)	5 min	4:30
Vocals + Instruments	✅ Native	✅ Native	✅ Native, single stream	✅ Native, dual-track AR	✅ Dual-track
Languages	50+	5+ (en/zh/ja/ko/es benchmarked)	Bilingual EN/ZH + JP/KR/ES marketing-only	EN, Mandarin, Cantonese, JP, KR	zh/en/es/ja + others
VRAM (minimum)	<4 GB with offload (turbo)	6 GB 4-bit / 12 GB bf16	8 GB v1 with `--chunked`	24 GB consumer / 80 GB single-pass	22–28 GB
VRAM (recommended)	12 GB+ offload, 24 GB optimal	24 GB for 7B (unreleased)	24 GB	80 GB H100/H800	28 GB
MPS / Apple Silicon	First-class, MLX + MPS, dedicated fork	MLX port, 2.1× PyTorch MPS	Likely OK; clean deps; untested	❌ Mandatory flash-attn	Community fork, pre-chorus bug
MPS bench M-series (30 s clip)	M3 Pro 25 s turbo / 1.5 min SFT	M2 Max 11.6 s for 50 frames	not published	not published	M1 Max 4–6 min for 2 min
MPS bench M5 Max (projected)	turbo ~10–15 s / SFT ~45–60 s	<real-time	low-minute range	n/a	~2–3× M1 Max
Speed (RTF on A100 / 4090)	sub-2 s/song on A100 (v1.5)	RTF ≈ 1.0	v2 RTF 0.213 (4090) → ~45 s for 210 s	27 steps RTF 27.27× on A100 (v1, ~15 min/song)	RTF 0.82 (H20)
Vocal naturalness vs Suno v4	4.4/5 vs 4.1/5 (blind 50-person test)	Vendor only, unverified	Authors admit clear gap vs v4.5	Comparable vocal range; weaker mix	Vendor claim parity, unverified
Lyric alignment (PER)	Strong (lyric tags)	Vendor: 0.09 EN / 0.12 ZH (unit mismatch)	0.13 (open-source SOTA)	Strong from lyric tags	Vendor: 8.55 %
Fine-tuning support	✅ LoRA, 8 songs/1h on 3090, MPS-validated	❌ public training code	❌ "Coming soon" since Mar 2025	✅ LoRA (Megatron pipeline, CUDA 12.1+)	❌
ComfyUI integration	✅ Native, official workflows	✅ FL-HeartMuLa	✅ billwuhao/ComfyUI_DiffRhythm	✅ smthemex/ComfyUI_YuE	✅
Replicate hosted	❌ no first-party	❌	❌	✅ fofr/yue	❌
Style/audio reference	LoRA + lyric tags	Reference audio supported	Reference audio supported	ICL mode (style cloning)	Limited
Stem separation	Built into `fspecii/ace-step-ui` via Demucs	Modular Codec is reusable	❌ single stream	✅ AR dual-track is inherently separable	✅ Dual-track output
Continuation / extension	Supported in workflows	Limited	Supported	✅ explicit continuation mode	Supported
Production deployments	acestep.io, ace-step.app, fspecii/ace-step-ui, AMD-blessed	WaveSpeed AI, HeartMuse local app	Chutes serverless	Replicate fofr/yue, HF Spaces	WaveSpeed AI, HF Space
Watermarking / content credentials	None baked-in	None baked-in	None baked-in	None baked-in	None baked-in
License gotchas	None (MIT)	None (Apache 2.0)	Ethical disclaimer (non-binding)	Attribution required ("YuE by HKUST/M-A-P"), label "AI-generated"	Commercial use prohibited
Independent benchmarks	Yes — 50-person blind test, AMD vendor-validated	None located	Internal MOS only	Paper + community	None — Tencent only

Quality dimensions (qualitative)

Dimension	Best (open source)	Notes
Pop / EDM polish	(none — Suno v4/v5 still wins)	All open models lag commercial.
Folk / classical / jazz vocal naturalness	ACE-Step 1.5 XL	Wins blind test vs Suno v4 in these genres.
Lyric intelligibility (PER)	DiffRhythm 2 (0.13)	HeartMuLa claims lower but unit-incomparable.
Musical macro-structure (verse/chorus/bridge over 3-5 min)	YuE or ACE-Step 1.5 (planner)	LM-planner models lead diffusion-only here.
Stereo image, mix depth	DiffRhythm 2 (44.1 kHz stereo native)	YuE is mono-ish; ACE-Step is stereo but variable.
Genre breadth	YuE	Death-growl metal to Beijing opera to rap.
Multilingual breadth	ACE-Step 1.5	50+ languages w/ lyric tags; YuE deep on 5 only.
Code-switching (English ↔ Mandarin in one song)	YuE	Explicit demos.
Speed / cost per song	ACE-Step 1.5	Sub-2 s/song on A100; <minute on M5 Max.
Modular reusability of components	HeartMuLa	Codec/ASR/CLAP separately exportable.

Cost model (rough)

Path	Per-song cost	Latency	Best for
Self-host ACE-Step 1.5 on M5 Max	$0 marginal (electricity)	~30-50 s	Dev, beta, low-volume
Self-host ACE-Step 1.5 on rented A100 80 GB	~$0.0001 (sub-2 s × $1.50/hr)	<2 s	Production, paid SaaS
Replicate `fofr/yue`	~$0.30-1.00 per song (estimated from 4090 cog runtime)	5-15 min	Multilingual fallback, occasional
Self-host DiffRhythm 2 on 4090	$0 marginal on owned 4090	~45 s	Speed tier, instrumentals
Replicate / WaveSpeed managed endpoints	varies	varies	Cold-start / spike capacity

License risk matrix

License	Commercial SaaS	Output ownership	Risk
MIT (ACE-Step 1.5)	✅	User owns	Lowest
Apache 2.0 (ACE-Step v1, HeartMuLa, DiffRhythm v1/v2, YuE)	✅ with attribution	User owns	Low
Tencent custom (SongGeneration, SongBloom)	❌ prohibited	n/a	Blocks SaaS
Suno API (closed-source baseline)	$ paid tier	platform terms	Medium

Hardware sizing on M5 Max (128 GB unified memory)

Model	Fits?	Headroom	Notes
ACE-Step 1.5 XL (4 B DiT + 4 B planner)	✅ huge	~120 GB free	Overkill; LoRA training viable in-RAM
HeartMuLa 4B + 2 B codec + 0.8 B ASR	✅ huge	~120 GB free	7 B variant when released will also fit
DiffRhythm 2 (~1 B + 170 M VAE-dec)	✅ trivial	~125 GB free	Tiny by 2026 standards
YuE 7B Stage-1 + 1B Stage-2 + upsampler	✅ but blocked	n/a	Memory fine, flash-attn dep blocks MPS
SongGeneration 2-large (4 B + diffusion)	✅ comfortable	~100 GB free	Community fork bug aside, fits

Conclusion: the user's 128 GB unified memory completely eliminates memory pressure for every model in this list. The constraint is software (MPS kernel compat, flash-attn substitution), not hardware.