File size: 7,643 Bytes
9071450
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# Open-Source Song Generation Models β€” Side-by-Side Comparison

*Compiled 2026-05-18 for M5 Max / 128 GB unified memory target.*

---

## Headline matrix

| Property | **ACE-Step 1.5 XL** | **HeartMuLa 4B** | **DiffRhythm 2** | **YuE 7B** | SongGeneration 2 |
|---|---|---|---|---|---|
| **Builder** | ACE Studio Γ— StepFun | HeartMuLa | NWPU ASLP-lab + Xiaomi | M-A-P / HKUST | Tencent AI Lab |
| **Release** | 2026-01-28 | 2026-01-19 | 2025-10-27 β†’ 2026-02-03 (v3) | 2025-01-26 | 2026-03-01 |
| **License** | **MIT** | **Apache 2.0** | **Apache 2.0** | **Apache 2.0** | **Custom NON-commercial** |
| **Repo stars** | 10.4 k | 3.6 k | ~2.3 k (v1) + 0.16 k (v2) | 6.2 k | 1.6 k |
| **Last major commit** | v0.1.7 (2026-04-24) | 2026-02 | 2026-02 | 2025-06-04 (stale) | 2026-03-01 |
| **Architecture** | LM-planner (Qwen3 0.6/1.7/4 B) + DiT (2/4 B) | CLAP + ASR + 12.5 Hz Codec + 4 B LLM | 5 Hz Music VAE + DiT w/ block flow matching | LLaMA2 7B AR Stage-1 + 1B Stage-2 + X-Codec | LeLM hybrid + diffusion decoder |
| **Params (largest)** | up to 8 B (4 B DiT + 4 B LM) | ~4 B + 2 B codec + 0.8 B ASR | ~1 B DiT + 170 M VAE-dec | 7 B + 1 B + upsampler | 4 B (v2-large) |
| **Audio rate** | 44.1 kHz stereo | 24 kHz neural codec | 44.1 kHz stereo | 16 kHz then upsampled | High-fi via diffusion |
| **Max length** | 4+ min | β‰₯1 min, scaling | **210 s (regression from v1)** | 5 min | 4:30 |
| **Vocals + Instruments** | βœ… Native | βœ… Native | βœ… Native, single stream | βœ… Native, dual-track AR | βœ… Dual-track |
| **Languages** | 50+ | 5+ (en/zh/ja/ko/es benchmarked) | Bilingual EN/ZH + JP/KR/ES marketing-only | EN, Mandarin, Cantonese, JP, KR | zh/en/es/ja + others |
| **VRAM (minimum)** | **<4 GB** with offload (turbo) | 6 GB 4-bit / 12 GB bf16 | 8 GB v1 with `--chunked` | 24 GB consumer / 80 GB single-pass | 22–28 GB |
| **VRAM (recommended)** | 12 GB+ offload, 24 GB optimal | 24 GB for 7B (unreleased) | 24 GB | 80 GB H100/H800 | 28 GB |
| **MPS / Apple Silicon** | **First-class, MLX + MPS, dedicated fork** | **MLX port, 2.1Γ— PyTorch MPS** | Likely OK; clean deps; untested | ❌ Mandatory flash-attn | Community fork, pre-chorus bug |
| **MPS bench M-series (30 s clip)** | M3 Pro 25 s turbo / 1.5 min SFT | M2 Max 11.6 s for 50 frames | not published | not published | M1 Max 4–6 min for 2 min |
| **MPS bench M5 Max (projected)** | turbo ~10–15 s / SFT ~45–60 s | <real-time | low-minute range | n/a | ~2–3Γ— M1 Max | 
| **Speed (RTF on A100 / 4090)** | sub-2 s/song on A100 (v1.5) | RTF β‰ˆ 1.0 | v2 RTF 0.213 (4090) β†’ ~45 s for 210 s | 27 steps RTF 27.27Γ— on A100 (v1, ~15 min/song) | RTF 0.82 (H20) |
| **Vocal naturalness vs Suno v4** | **4.4/5 vs 4.1/5** (blind 50-person test) | Vendor only, unverified | Authors admit clear gap vs v4.5 | Comparable vocal range; weaker mix | Vendor claim parity, unverified |
| **Lyric alignment (PER)** | Strong (lyric tags) | Vendor: 0.09 EN / 0.12 ZH (unit mismatch) | **0.13 (open-source SOTA)** | Strong from lyric tags | Vendor: 8.55 % |
| **Fine-tuning support** | βœ… LoRA, 8 songs/1h on 3090, **MPS-validated** | ❌ public training code | ❌ "Coming soon" since Mar 2025 | βœ… LoRA (Megatron pipeline, CUDA 12.1+) | ❌ |
| **ComfyUI integration** | βœ… Native, official workflows | βœ… FL-HeartMuLa | βœ… billwuhao/ComfyUI_DiffRhythm | βœ… smthemex/ComfyUI_YuE | βœ… |
| **Replicate hosted** | ❌ no first-party | ❌ | ❌ | βœ… fofr/yue | ❌ |
| **Style/audio reference** | LoRA + lyric tags | Reference audio supported | Reference audio supported | ICL mode (style cloning) | Limited |
| **Stem separation** | Built into `fspecii/ace-step-ui` via Demucs | Modular Codec is reusable | ❌ single stream | βœ… AR dual-track is inherently separable | βœ… Dual-track output |
| **Continuation / extension** | Supported in workflows | Limited | Supported | βœ… explicit continuation mode | Supported |
| **Production deployments** | acestep.io, ace-step.app, fspecii/ace-step-ui, AMD-blessed | WaveSpeed AI, HeartMuse local app | Chutes serverless | Replicate fofr/yue, HF Spaces | WaveSpeed AI, HF Space |
| **Watermarking / content credentials** | None baked-in | None baked-in | None baked-in | None baked-in | None baked-in |
| **License gotchas** | None (MIT) | None (Apache 2.0) | Ethical disclaimer (non-binding) | Attribution required ("YuE by HKUST/M-A-P"), label "AI-generated" | **Commercial use prohibited** |
| **Independent benchmarks** | Yes β€” 50-person blind test, AMD vendor-validated | None located | Internal MOS only | Paper + community | None β€” Tencent only |

---

## Quality dimensions (qualitative)

| Dimension | Best (open source) | Notes |
|---|---|---|
| **Pop / EDM polish** | (none β€” Suno v4/v5 still wins) | All open models lag commercial. |
| **Folk / classical / jazz vocal naturalness** | **ACE-Step 1.5 XL** | Wins blind test vs Suno v4 in these genres. |
| **Lyric intelligibility (PER)** | **DiffRhythm 2** (0.13) | HeartMuLa claims lower but unit-incomparable. |
| **Musical macro-structure (verse/chorus/bridge over 3-5 min)** | **YuE** or **ACE-Step 1.5** (planner) | LM-planner models lead diffusion-only here. |
| **Stereo image, mix depth** | **DiffRhythm 2** (44.1 kHz stereo native) | YuE is mono-ish; ACE-Step is stereo but variable. |
| **Genre breadth** | **YuE** | Death-growl metal to Beijing opera to rap. |
| **Multilingual breadth** | **ACE-Step 1.5** | 50+ languages w/ lyric tags; YuE deep on 5 only. |
| **Code-switching (English ↔ Mandarin in one song)** | **YuE** | Explicit demos. |
| **Speed / cost per song** | **ACE-Step 1.5** | Sub-2 s/song on A100; <minute on M5 Max. |
| **Modular reusability of components** | **HeartMuLa** | Codec/ASR/CLAP separately exportable. |

---

## Cost model (rough)

| Path | Per-song cost | Latency | Best for |
|---|---|---|---|
| Self-host ACE-Step 1.5 on M5 Max | $0 marginal (electricity) | ~30-50 s | Dev, beta, low-volume |
| Self-host ACE-Step 1.5 on rented A100 80 GB | ~$0.0001 (sub-2 s Γ— $1.50/hr) | <2 s | Production, paid SaaS |
| Replicate `fofr/yue` | ~$0.30-1.00 per song (estimated from 4090 cog runtime) | 5-15 min | Multilingual fallback, occasional |
| Self-host DiffRhythm 2 on 4090 | $0 marginal on owned 4090 | ~45 s | Speed tier, instrumentals |
| Replicate / WaveSpeed managed endpoints | varies | varies | Cold-start / spike capacity |

---

## License risk matrix

| License | Commercial SaaS | Output ownership | Risk |
|---|---|---|---|
| MIT (ACE-Step 1.5) | βœ… | User owns | Lowest |
| Apache 2.0 (ACE-Step v1, HeartMuLa, DiffRhythm v1/v2, YuE) | βœ… with attribution | User owns | Low |
| Tencent custom (SongGeneration, SongBloom) | ❌ **prohibited** | n/a | **Blocks SaaS** |
| Suno API (closed-source baseline) | $ paid tier | platform terms | Medium |

---

## Hardware sizing on M5 Max (128 GB unified memory)

| Model | Fits? | Headroom | Notes |
|---|---|---|---|
| ACE-Step 1.5 XL (4 B DiT + 4 B planner) | βœ… huge | ~120 GB free | Overkill; LoRA training viable in-RAM |
| HeartMuLa 4B + 2 B codec + 0.8 B ASR | βœ… huge | ~120 GB free | 7 B variant when released will also fit |
| DiffRhythm 2 (~1 B + 170 M VAE-dec) | βœ… trivial | ~125 GB free | Tiny by 2026 standards |
| YuE 7B Stage-1 + 1B Stage-2 + upsampler | βœ… but blocked | n/a | Memory fine, **flash-attn dep blocks MPS** |
| SongGeneration 2-large (4 B + diffusion) | βœ… comfortable | ~100 GB free | Community fork bug aside, fits |

**Conclusion:** the user's 128 GB unified memory completely eliminates memory pressure for every model in this list. The constraint is software (MPS kernel compat, flash-attn substitution), not hardware.