| Builder |
ACE Studio Γ StepFun |
HeartMuLa |
NWPU ASLP-lab + Xiaomi |
M-A-P / HKUST |
Tencent AI Lab |
| Release |
2026-01-28 |
2026-01-19 |
2025-10-27 β 2026-02-03 (v3) |
2025-01-26 |
2026-03-01 |
| License |
MIT |
Apache 2.0 |
Apache 2.0 |
Apache 2.0 |
Custom NON-commercial |
| Repo stars |
10.4 k |
3.6 k |
~2.3 k (v1) + 0.16 k (v2) |
6.2 k |
1.6 k |
| Last major commit |
v0.1.7 (2026-04-24) |
2026-02 |
2026-02 |
2025-06-04 (stale) |
2026-03-01 |
| Architecture |
LM-planner (Qwen3 0.6/1.7/4 B) + DiT (2/4 B) |
CLAP + ASR + 12.5 Hz Codec + 4 B LLM |
5 Hz Music VAE + DiT w/ block flow matching |
LLaMA2 7B AR Stage-1 + 1B Stage-2 + X-Codec |
LeLM hybrid + diffusion decoder |
| Params (largest) |
up to 8 B (4 B DiT + 4 B LM) |
~4 B + 2 B codec + 0.8 B ASR |
~1 B DiT + 170 M VAE-dec |
7 B + 1 B + upsampler |
4 B (v2-large) |
| Audio rate |
44.1 kHz stereo |
24 kHz neural codec |
44.1 kHz stereo |
16 kHz then upsampled |
High-fi via diffusion |
| Max length |
4+ min |
β₯1 min, scaling |
210 s (regression from v1) |
5 min |
4:30 |
| Vocals + Instruments |
β
Native |
β
Native |
β
Native, single stream |
β
Native, dual-track AR |
β
Dual-track |
| Languages |
50+ |
5+ (en/zh/ja/ko/es benchmarked) |
Bilingual EN/ZH + JP/KR/ES marketing-only |
EN, Mandarin, Cantonese, JP, KR |
zh/en/es/ja + others |
| VRAM (minimum) |
<4 GB with offload (turbo) |
6 GB 4-bit / 12 GB bf16 |
8 GB v1 with --chunked |
24 GB consumer / 80 GB single-pass |
22β28 GB |
| VRAM (recommended) |
12 GB+ offload, 24 GB optimal |
24 GB for 7B (unreleased) |
24 GB |
80 GB H100/H800 |
28 GB |
| MPS / Apple Silicon |
First-class, MLX + MPS, dedicated fork |
MLX port, 2.1Γ PyTorch MPS |
Likely OK; clean deps; untested |
β Mandatory flash-attn |
Community fork, pre-chorus bug |
| MPS bench M-series (30 s clip) |
M3 Pro 25 s turbo / 1.5 min SFT |
M2 Max 11.6 s for 50 frames |
not published |
not published |
M1 Max 4β6 min for 2 min |
| MPS bench M5 Max (projected) |
turbo ~10β15 s / SFT ~45β60 s |
<real-time |
low-minute range |
n/a |
~2β3Γ M1 Max |
| Speed (RTF on A100 / 4090) |
sub-2 s/song on A100 (v1.5) |
RTF β 1.0 |
v2 RTF 0.213 (4090) β ~45 s for 210 s |
27 steps RTF 27.27Γ on A100 (v1, ~15 min/song) |
RTF 0.82 (H20) |
| Vocal naturalness vs Suno v4 |
4.4/5 vs 4.1/5 (blind 50-person test) |
Vendor only, unverified |
Authors admit clear gap vs v4.5 |
Comparable vocal range; weaker mix |
Vendor claim parity, unverified |
| Lyric alignment (PER) |
Strong (lyric tags) |
Vendor: 0.09 EN / 0.12 ZH (unit mismatch) |
0.13 (open-source SOTA) |
Strong from lyric tags |
Vendor: 8.55 % |
| Fine-tuning support |
β
LoRA, 8 songs/1h on 3090, MPS-validated |
β public training code |
β "Coming soon" since Mar 2025 |
β
LoRA (Megatron pipeline, CUDA 12.1+) |
β |
| ComfyUI integration |
β
Native, official workflows |
β
FL-HeartMuLa |
β
billwuhao/ComfyUI_DiffRhythm |
β
smthemex/ComfyUI_YuE |
β
|
| Replicate hosted |
β no first-party |
β |
β |
β
fofr/yue |
β |
| Style/audio reference |
LoRA + lyric tags |
Reference audio supported |
Reference audio supported |
ICL mode (style cloning) |
Limited |
| Stem separation |
Built into fspecii/ace-step-ui via Demucs |
Modular Codec is reusable |
β single stream |
β
AR dual-track is inherently separable |
β
Dual-track output |
| Continuation / extension |
Supported in workflows |
Limited |
Supported |
β
explicit continuation mode |
Supported |
| Production deployments |
acestep.io, ace-step.app, fspecii/ace-step-ui, AMD-blessed |
WaveSpeed AI, HeartMuse local app |
Chutes serverless |
Replicate fofr/yue, HF Spaces |
WaveSpeed AI, HF Space |
| Watermarking / content credentials |
None baked-in |
None baked-in |
None baked-in |
None baked-in |
None baked-in |
| License gotchas |
None (MIT) |
None (Apache 2.0) |
Ethical disclaimer (non-binding) |
Attribution required ("YuE by HKUST/M-A-P"), label "AI-generated" |
Commercial use prohibited |
| Independent benchmarks |
Yes β 50-person blind test, AMD vendor-validated |
None located |
Internal MOS only |
Paper + community |
None β Tencent only |