# Open-Source Song Generation Models — Side-by-Side Comparison *Compiled 2026-05-18 for M5 Max / 128 GB unified memory target.* --- ## Headline matrix | Property | **ACE-Step 1.5 XL** | **HeartMuLa 4B** | **DiffRhythm 2** | **YuE 7B** | SongGeneration 2 | |---|---|---|---|---|---| | **Builder** | ACE Studio × StepFun | HeartMuLa | NWPU ASLP-lab + Xiaomi | M-A-P / HKUST | Tencent AI Lab | | **Release** | 2026-01-28 | 2026-01-19 | 2025-10-27 → 2026-02-03 (v3) | 2025-01-26 | 2026-03-01 | | **License** | **MIT** | **Apache 2.0** | **Apache 2.0** | **Apache 2.0** | **Custom NON-commercial** | | **Repo stars** | 10.4 k | 3.6 k | ~2.3 k (v1) + 0.16 k (v2) | 6.2 k | 1.6 k | | **Last major commit** | v0.1.7 (2026-04-24) | 2026-02 | 2026-02 | 2025-06-04 (stale) | 2026-03-01 | | **Architecture** | LM-planner (Qwen3 0.6/1.7/4 B) + DiT (2/4 B) | CLAP + ASR + 12.5 Hz Codec + 4 B LLM | 5 Hz Music VAE + DiT w/ block flow matching | LLaMA2 7B AR Stage-1 + 1B Stage-2 + X-Codec | LeLM hybrid + diffusion decoder | | **Params (largest)** | up to 8 B (4 B DiT + 4 B LM) | ~4 B + 2 B codec + 0.8 B ASR | ~1 B DiT + 170 M VAE-dec | 7 B + 1 B + upsampler | 4 B (v2-large) | | **Audio rate** | 44.1 kHz stereo | 24 kHz neural codec | 44.1 kHz stereo | 16 kHz then upsampled | High-fi via diffusion | | **Max length** | 4+ min | ≥1 min, scaling | **210 s (regression from v1)** | 5 min | 4:30 | | **Vocals + Instruments** | ✅ Native | ✅ Native | ✅ Native, single stream | ✅ Native, dual-track AR | ✅ Dual-track | | **Languages** | 50+ | 5+ (en/zh/ja/ko/es benchmarked) | Bilingual EN/ZH + JP/KR/ES marketing-only | EN, Mandarin, Cantonese, JP, KR | zh/en/es/ja + others | | **VRAM (minimum)** | **<4 GB** with offload (turbo) | 6 GB 4-bit / 12 GB bf16 | 8 GB v1 with `--chunked` | 24 GB consumer / 80 GB single-pass | 22–28 GB | | **VRAM (recommended)** | 12 GB+ offload, 24 GB optimal | 24 GB for 7B (unreleased) | 24 GB | 80 GB H100/H800 | 28 GB | | **MPS / Apple Silicon** | **First-class, MLX + MPS, dedicated fork** | **MLX port, 2.1× PyTorch MPS** | Likely OK; clean deps; untested | ❌ Mandatory flash-attn | Community fork, pre-chorus bug | | **MPS bench M-series (30 s clip)** | M3 Pro 25 s turbo / 1.5 min SFT | M2 Max 11.6 s for 50 frames | not published | not published | M1 Max 4–6 min for 2 min | | **MPS bench M5 Max (projected)** | turbo ~10–15 s / SFT ~45–60 s |