Spaces:
Running on Zero
Running on Zero
File size: 7,643 Bytes
9071450 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | # Open-Source Song Generation Models β Side-by-Side Comparison
*Compiled 2026-05-18 for M5 Max / 128 GB unified memory target.*
---
## Headline matrix
| Property | **ACE-Step 1.5 XL** | **HeartMuLa 4B** | **DiffRhythm 2** | **YuE 7B** | SongGeneration 2 |
|---|---|---|---|---|---|
| **Builder** | ACE Studio Γ StepFun | HeartMuLa | NWPU ASLP-lab + Xiaomi | M-A-P / HKUST | Tencent AI Lab |
| **Release** | 2026-01-28 | 2026-01-19 | 2025-10-27 β 2026-02-03 (v3) | 2025-01-26 | 2026-03-01 |
| **License** | **MIT** | **Apache 2.0** | **Apache 2.0** | **Apache 2.0** | **Custom NON-commercial** |
| **Repo stars** | 10.4 k | 3.6 k | ~2.3 k (v1) + 0.16 k (v2) | 6.2 k | 1.6 k |
| **Last major commit** | v0.1.7 (2026-04-24) | 2026-02 | 2026-02 | 2025-06-04 (stale) | 2026-03-01 |
| **Architecture** | LM-planner (Qwen3 0.6/1.7/4 B) + DiT (2/4 B) | CLAP + ASR + 12.5 Hz Codec + 4 B LLM | 5 Hz Music VAE + DiT w/ block flow matching | LLaMA2 7B AR Stage-1 + 1B Stage-2 + X-Codec | LeLM hybrid + diffusion decoder |
| **Params (largest)** | up to 8 B (4 B DiT + 4 B LM) | ~4 B + 2 B codec + 0.8 B ASR | ~1 B DiT + 170 M VAE-dec | 7 B + 1 B + upsampler | 4 B (v2-large) |
| **Audio rate** | 44.1 kHz stereo | 24 kHz neural codec | 44.1 kHz stereo | 16 kHz then upsampled | High-fi via diffusion |
| **Max length** | 4+ min | β₯1 min, scaling | **210 s (regression from v1)** | 5 min | 4:30 |
| **Vocals + Instruments** | β
Native | β
Native | β
Native, single stream | β
Native, dual-track AR | β
Dual-track |
| **Languages** | 50+ | 5+ (en/zh/ja/ko/es benchmarked) | Bilingual EN/ZH + JP/KR/ES marketing-only | EN, Mandarin, Cantonese, JP, KR | zh/en/es/ja + others |
| **VRAM (minimum)** | **<4 GB** with offload (turbo) | 6 GB 4-bit / 12 GB bf16 | 8 GB v1 with `--chunked` | 24 GB consumer / 80 GB single-pass | 22β28 GB |
| **VRAM (recommended)** | 12 GB+ offload, 24 GB optimal | 24 GB for 7B (unreleased) | 24 GB | 80 GB H100/H800 | 28 GB |
| **MPS / Apple Silicon** | **First-class, MLX + MPS, dedicated fork** | **MLX port, 2.1Γ PyTorch MPS** | Likely OK; clean deps; untested | β Mandatory flash-attn | Community fork, pre-chorus bug |
| **MPS bench M-series (30 s clip)** | M3 Pro 25 s turbo / 1.5 min SFT | M2 Max 11.6 s for 50 frames | not published | not published | M1 Max 4β6 min for 2 min |
| **MPS bench M5 Max (projected)** | turbo ~10β15 s / SFT ~45β60 s | <real-time | low-minute range | n/a | ~2β3Γ M1 Max |
| **Speed (RTF on A100 / 4090)** | sub-2 s/song on A100 (v1.5) | RTF β 1.0 | v2 RTF 0.213 (4090) β ~45 s for 210 s | 27 steps RTF 27.27Γ on A100 (v1, ~15 min/song) | RTF 0.82 (H20) |
| **Vocal naturalness vs Suno v4** | **4.4/5 vs 4.1/5** (blind 50-person test) | Vendor only, unverified | Authors admit clear gap vs v4.5 | Comparable vocal range; weaker mix | Vendor claim parity, unverified |
| **Lyric alignment (PER)** | Strong (lyric tags) | Vendor: 0.09 EN / 0.12 ZH (unit mismatch) | **0.13 (open-source SOTA)** | Strong from lyric tags | Vendor: 8.55 % |
| **Fine-tuning support** | β
LoRA, 8 songs/1h on 3090, **MPS-validated** | β public training code | β "Coming soon" since Mar 2025 | β
LoRA (Megatron pipeline, CUDA 12.1+) | β |
| **ComfyUI integration** | β
Native, official workflows | β
FL-HeartMuLa | β
billwuhao/ComfyUI_DiffRhythm | β
smthemex/ComfyUI_YuE | β
|
| **Replicate hosted** | β no first-party | β | β | β
fofr/yue | β |
| **Style/audio reference** | LoRA + lyric tags | Reference audio supported | Reference audio supported | ICL mode (style cloning) | Limited |
| **Stem separation** | Built into `fspecii/ace-step-ui` via Demucs | Modular Codec is reusable | β single stream | β
AR dual-track is inherently separable | β
Dual-track output |
| **Continuation / extension** | Supported in workflows | Limited | Supported | β
explicit continuation mode | Supported |
| **Production deployments** | acestep.io, ace-step.app, fspecii/ace-step-ui, AMD-blessed | WaveSpeed AI, HeartMuse local app | Chutes serverless | Replicate fofr/yue, HF Spaces | WaveSpeed AI, HF Space |
| **Watermarking / content credentials** | None baked-in | None baked-in | None baked-in | None baked-in | None baked-in |
| **License gotchas** | None (MIT) | None (Apache 2.0) | Ethical disclaimer (non-binding) | Attribution required ("YuE by HKUST/M-A-P"), label "AI-generated" | **Commercial use prohibited** |
| **Independent benchmarks** | Yes β 50-person blind test, AMD vendor-validated | None located | Internal MOS only | Paper + community | None β Tencent only |
---
## Quality dimensions (qualitative)
| Dimension | Best (open source) | Notes |
|---|---|---|
| **Pop / EDM polish** | (none β Suno v4/v5 still wins) | All open models lag commercial. |
| **Folk / classical / jazz vocal naturalness** | **ACE-Step 1.5 XL** | Wins blind test vs Suno v4 in these genres. |
| **Lyric intelligibility (PER)** | **DiffRhythm 2** (0.13) | HeartMuLa claims lower but unit-incomparable. |
| **Musical macro-structure (verse/chorus/bridge over 3-5 min)** | **YuE** or **ACE-Step 1.5** (planner) | LM-planner models lead diffusion-only here. |
| **Stereo image, mix depth** | **DiffRhythm 2** (44.1 kHz stereo native) | YuE is mono-ish; ACE-Step is stereo but variable. |
| **Genre breadth** | **YuE** | Death-growl metal to Beijing opera to rap. |
| **Multilingual breadth** | **ACE-Step 1.5** | 50+ languages w/ lyric tags; YuE deep on 5 only. |
| **Code-switching (English β Mandarin in one song)** | **YuE** | Explicit demos. |
| **Speed / cost per song** | **ACE-Step 1.5** | Sub-2 s/song on A100; <minute on M5 Max. |
| **Modular reusability of components** | **HeartMuLa** | Codec/ASR/CLAP separately exportable. |
---
## Cost model (rough)
| Path | Per-song cost | Latency | Best for |
|---|---|---|---|
| Self-host ACE-Step 1.5 on M5 Max | $0 marginal (electricity) | ~30-50 s | Dev, beta, low-volume |
| Self-host ACE-Step 1.5 on rented A100 80 GB | ~$0.0001 (sub-2 s Γ $1.50/hr) | <2 s | Production, paid SaaS |
| Replicate `fofr/yue` | ~$0.30-1.00 per song (estimated from 4090 cog runtime) | 5-15 min | Multilingual fallback, occasional |
| Self-host DiffRhythm 2 on 4090 | $0 marginal on owned 4090 | ~45 s | Speed tier, instrumentals |
| Replicate / WaveSpeed managed endpoints | varies | varies | Cold-start / spike capacity |
---
## License risk matrix
| License | Commercial SaaS | Output ownership | Risk |
|---|---|---|---|
| MIT (ACE-Step 1.5) | β
| User owns | Lowest |
| Apache 2.0 (ACE-Step v1, HeartMuLa, DiffRhythm v1/v2, YuE) | β
with attribution | User owns | Low |
| Tencent custom (SongGeneration, SongBloom) | β **prohibited** | n/a | **Blocks SaaS** |
| Suno API (closed-source baseline) | $ paid tier | platform terms | Medium |
---
## Hardware sizing on M5 Max (128 GB unified memory)
| Model | Fits? | Headroom | Notes |
|---|---|---|---|
| ACE-Step 1.5 XL (4 B DiT + 4 B planner) | β
huge | ~120 GB free | Overkill; LoRA training viable in-RAM |
| HeartMuLa 4B + 2 B codec + 0.8 B ASR | β
huge | ~120 GB free | 7 B variant when released will also fit |
| DiffRhythm 2 (~1 B + 170 M VAE-dec) | β
trivial | ~125 GB free | Tiny by 2026 standards |
| YuE 7B Stage-1 + 1B Stage-2 + upsampler | β
but blocked | n/a | Memory fine, **flash-attn dep blocks MPS** |
| SongGeneration 2-large (4 B + diffusion) | β
comfortable | ~100 GB free | Community fork bug aside, fits |
**Conclusion:** the user's 128 GB unified memory completely eliminates memory pressure for every model in this list. The constraint is software (MPS kernel compat, flash-attn substitution), not hardware.
|