Spaces:

techfreakworm
/

ACE-Music-Studio

Running on Zero

App Files Files Community

ACE-Music-Studio / research /02_diffrhythm.md

techfreakworm

docs: track spec + mockups + model research

9071450 unverified 2 days ago

preview code

raw

history blame contribute delete

15.7 kB

DiffRhythm and DiffRhythm 2 — Deep Technical Review

Compiled 2026-05-18. All claims cited; speculation flagged inline.

1. Overview

DiffRhythm is the first open-source latent-diffusion full-song generator — vocals + accompaniment, end-to-end, from lyrics and a style prompt — built by the Audio, Speech and Language Processing (ASLP) Lab at Northwestern Polytechnical University (NWPU) in Xi'an, China, with later contributions from Xiaomi Research (arxiv.org/abs/2503.01183, github.com/ASLP-lab/DiffRhythm). DiffRhythm v1 dropped on arXiv 3 Mar 2025; the full 4m45s variant followed on 15 Mar 2025, and an iterative v1.2 fixed repetition and audio-quality issues mid-2025 (HF v1.2 commit). DiffRhythm 2 appeared on arXiv 27 Oct 2025 (v3 revised 3 Feb 2026) under arxiv.org/abs/2510.22950, and was open-sourced at github.com/ASLP-lab/DiffRhythm2 (forked from xiaomi-research/diffrhythm2) on 30 Oct 2025, with HuggingFace weights at huggingface.co/ASLP-lab/DiffRhythm2. The series is the leading diffusion-side alternative to the LLM-style approach taken by Suno, YuE, and SongBloom.

2. Architecture

DiffRhythm v1 is a non-autoregressive (NAR) latent diffusion model with two pieces: a music VAE that compresses raw 44.1 kHz stereo audio into a latent grid, and a DiT (Diffusion Transformer) that denoises that grid conditioned on lyrics + style (nzqian.github.io/DiffRhythm). The DiT uses 16 LLaMA-style decoder layers, 2048 hidden dim, 32 heads × 64 dim, totaling ~1.1B parameters (arxiv.org/html/2503.01183). Vocals and accompaniment are produced jointly in a single latent stream — not dual-track — which is what makes it "embarrassingly simple" vs. cascaded systems. Lyric conditioning is sentence-level via LRC (timestamped) phonemes, with the diffusion model expected to align internally; style is conditioned either via a reference audio embedding or a text prompt. Inference uses a 32-step Euler ODE with CFG scale 4 and 20% dropout on both conditions during training to enable CFG (diffrhythm.us).

DiffRhythm 2 replaces the pure-NAR DiT with a semi-autoregressive block flow-matching transformer: the latent sequence is sliced into blocks of 10 frames (2s at 5 Hz), and "each block is generated with flow matching, while the dependency across blocks is handled autoregressively" (alphaxiv.org/overview/2510.22950v3 — quoted via search snippet). This is the key innovation: it preserves NAR-style fast within-block parallelism while letting the model attend to prior blocks for structural coherence (verse → chorus → verse) and lyric alignment without any external aligner. The audio codec is a new music VAE at 5 Hz frame rate (vs. the much higher rates of EnCodec/DAC) with a 170M-param decoder, enabling 210s of latent context to fit on a single GPU (arxiv abs). The full DiT is ~1B parameters. Two new training objectives appear: Stochastic Block Representation Alignment (REPA) loss to align hidden states of clean vs. noisy blocks (improves musicality/structure), and Cross-Pair Preference Optimization — an RLHF variant that groups the four preference dimensions (musicality, style similarity, lyric alignment, audio quality) into pairs to dodge the merging-induced regression that plain DPO causes. Max song length: 210 s in v2 vs. 4m45s (~285 s) in v1-full (github.com/ASLP-lab/DiffRhythm).

3. Variants and sizes

Checkpoint	Duration	DiT params	Notes	Source
`DiffRhythm-base`	1m35s	~1.1B	Original Mar 2025	HF
`DiffRhythm-full`	4m45s	~1.1B	Released 15 Mar 2025	HF
`DiffRhythm-vae`	—	—	Shared audio VAE	HF
`DiffRhythm-1_2-base`	1m35s	~1.1B	v1.2 quality fix	GH README
`DiffRhythm-1_2-full`	4m45s	~1.1B	v1.2, text-style + instrumental	HF
`DiffRhythm+` (paper)	full	~1.1B	Adds DPO; not headlined as separate checkpoint	arxiv 2507.12890
`DiffRhythm2`	210 s	~1B DiT + 170M VAE-dec	Block flow matching	HF

(Speculation: I did not find an explicit param count posted for v2's DiT; the ~1B figure comes from a paper-extraction snippet and aligns with v1's ~1.1B body. Treat as approximate.)

4. License

Apache 2.0 for both code and DiT weights, declared on the v1 GitHub README and reaffirmed on the v2 README (github.com/ASLP-lab/DiffRhythm, github.com/ASLP-lab/DiffRhythm2). Commercial use is permitted with attribution. The v2 model card adds a non-binding ethical disclaimer asking users to verify originality, disclose AI involvement, and respect stylistic copyright — this is a notice, not an enforceable license restriction (HF model card).

5. Languages supported

Training is heavily bilingual (Mandarin + English) — v2's dataset is reported as Chinese : English : Instrumental ≈ 4 : 5 : 1 (alphaXiv extract). The v1 README and several mirrors claim cross-lingual capability for Japanese, Korean, Spanish (diffrhythm.us, diffrhythm.ai) — but these are demo-site marketing claims, not benchmarked in the paper. Verdict: production-safe for EN and ZH; treat JP/KR/ES as best-effort. Phoneme front-end is espeak-ng, which itself supports 100+ languages (HF model card).

6. Quality assessment

Objective (v2 paper, lower=better for PER, higher=better for Mulan-T):

Metric	DiffRhythm 2	DiffRhythm+	ACE-Step	LeVo
PER (lyric alignment) ↓	0.13	0.15	0.23	0.19
Mulan-T (style match) ↑	0.40	0.25	0.28	0.35
RTF (speed) ↓	0.213	0.153	0.127	1.225

So v2 has best-in-open-source lyric alignment and style match, slightly slower than v1+/ACE-Step but ~6× faster than LeVo (arxiv 2510.22950).

Subjective: v2 is the strongest open model by MOS in the paper's own user study, but the authors explicitly state "in aspects such as musicality, it still shows a clear gap compared to commercial systems like SUNO V4.5" (arxiv 2510.22950). The block flow-matching does close the structural-coherence gap that the original Hacker News thread criticized v1 for — multiple HN commenters complained "there's no identifiable chorus in any of the demo songs" and rhythm was unstable (news.ycombinator.com/item?id=43255467). v2 demos show real verse/chorus structure (aslp-lab.github.io/DiffRhythm2.github.io). Specific Reddit reception threads in r/LocalLLaMA/r/StableDiffusion were not surfaced by search (low signal).

7. Inference performance

v1-full: ~10 s for a 4m45s song on a single RTX 4090 (claimed in paper abstract, arxiv 2503.01183) — 32 ODE steps. Real-world ComfyUI users report ~62 s for 4 min on consumer GPUs (comfyui.org).
VRAM: DiffRhythm-base needs ≥ 8 GB with --chunked; full needs 24 GB for headroom (chutes.ai docs).
v2: RTF 0.213 on RTX 4090 → ~45 s for a 210 s song (arxiv 2510.22950).
Apple Silicon / MPS: The v1 README claims Apple Silicon is "supported as of March 2025" but the GitHub issues list does not surface dedicated MPS benchmarks, and the Pinokio launcher (github.com/pinokiofactory/diffrhythm) does not advertise macOS in its description. No published M3/M4/M5 numbers exist. Speculation: on the user's M5 Max with 128 GB unified memory, v1-full should run via PYTORCH_ENABLE_MPS_FALLBACK=1, likely 3–5× slower than 4090 — needs hands-on validation. v2 is newer and has not been tested on MPS publicly.

8. DiffRhythm 2 specifics

What changed from v1 → v2 (arxiv 2510.22950, alphaxiv overview):

Architecture shift: pure NAR DiT → semi-AR block flow-matching (2 s blocks).
New 5 Hz music VAE (vs. v1's higher-rate codec) — enables 210 s context within budget.
Stochastic Block REPA loss: aligns clean vs. noisy hidden states → better musicality + structure.
Cross-Pair Preference Optimization: four-dim RLHF without the model-merging regression that plain DPO causes.
Dataset scaling: ~1.4 M songs / ~70,000 hours, with a 20 k-hour high-quality subset for SFT and 40 k preference pairs for DPO — a step-change from v1's undisclosed-but-smaller corpus.
Lyric alignment without external constraints: v1 needed LRC timestamps; v2 learns alignment end-to-end via the AR block dependency.
Quality numbers (paper): PER 0.15 → 0.13, Mulan-T 0.25 → 0.40 vs. DiffRhythm+ — i.e. lyric-error reduced ~13 % and style-match nearly doubled.

9. Repo health

DiffRhythm v1: ~2.2–2.3 k stars, 268 forks, active through 2025, last major release Mar 2025 (github.com/ASLP-lab/DiffRhythm).
DiffRhythm 2: 157 stars / 11 forks / 27 commits as of late Oct 2025 — young repo, recently pushed (github.com/ASLP-lab/DiffRhythm2).
Training/fine-tuning scripts: "Coming soon" is the status on v1; community has filed Issue #46 asking for fine-tuning docs. v2 ships inference only in the public repo as of writing.

10. Real-world adoption

ComfyUI: billwuhao/ComfyUI_DiffRhythm — 153 stars, supports v1.2 + full, includes bilingual subtitle gen (runcomfy.com node).
Pinokio: pinokiofactory/diffrhythm — 19 stars, 69 commits, one-click installer.
Chutes.ai: Public serverless endpoint for DiffRhythm-full (chutes.ai/docs/examples/music-generation).
Replicate: No first-party DiffRhythm 2 model found in search — gap in the ecosystem (speculation).
Multiple unofficial web frontends: diffrhythm.com, diffrhythm.us, diffrhythm.ai, diffrhythmai.com — quality and origin unverified, likely wrappers over the HF Space.

11. Fine-tuning

The official answer is none yet. The v1 repo's training code is listed as "Coming soon," and v2 only ships inference. There is no LoRA support, no published fine-tuning recipe, and no transformers/diffusers integration as of May 2026. Community workaround would require reverse-engineering the DiT class — non-trivial for a 1 B-param flow-matching model. For the user's Suno-clone platform, fine-tuning DiffRhythm today means forking + writing your own training loop. This is the single biggest practical weakness.

12. Pros and cons

Pros

Permissive Apache 2.0 for code + weights — clean commercial path.
Fastest open full-song model (~10 s for 4 min on a 4090; v2's block-FM is competitive even with AR-like coherence).
v2 has state-of-the-art lyric alignment (PER 0.13) in open source.
Lightweight: 8 GB VRAM possible with chunking — runs on consumer GPUs.
Strong ecosystem: ComfyUI nodes, Pinokio installer, Chutes serverless.
v2's block flow-matching meaningfully closes the structural-coherence gap that doomed v1 demos on HN.

Cons

Still a clear musicality gap vs. Suno v4.5 (authors admit it; arxiv 2510.22950).
No fine-tuning / LoRA path — training code unreleased.
v2's max length is 210 s (3m30s), shorter than v1-full's 4m45s — a regression for radio-length pop.
Multilingual claims (JP/KR/ES) are unbenchmarked; only EN/ZH have paper-backed quality.
No published MPS benchmarks for Apple Silicon; v2 untested on Mac.
Demo-site proliferation (diffrhythm.us, etc.) muddies the brand — confusing for product positioning.
License disclaimer adds soft ethical obligations re. copyright that legal review may flag.

13. Verdict for the user's platform

For a Suno-style platform on an M5 Max (128 GB unified, MPS), DiffRhythm 2 is the best diffusion-side open option in May 2026, but it should be paired with an AR-style backup (YuE / SongBloom / LeVo) covering its weak points.

Where DiffRhythm 2 wins:

Fast, cheap inference per song — viable for high-throughput web generation.
Best-in-open lyric intelligibility — critical for a karaoke / lyrics-first UX.
Stereo 44.1 kHz output out of the box.
Apache-2.0 + commercial freedom.

Where it underperforms:

Pop musicality, hook quality, vocal timbre are still below Suno v4.5 — premium-tier output is not there.
No fine-tuning means you cannot specialize on a target sound or your platform's curated catalog without doing R&D.
210 s ceiling on v2 limits "full album track" formats — you'd fall back to v1-full (4m45s) at a quality cost.
MPS path is unproven — the user should plan a same-week feasibility test on the M5 Max before committing v2 to the inference layer; CUDA cloud (Chutes / a 4090 server) is the safer near-term backend.

Recommended posture: ship v2 as the default fast generator behind a feature flag, keep v1.2-full for >3.5 min songs, evaluate Suno / YuE / SongBloom as quality-tier alternatives, and track the v2 repo for an eventual training-code release that would unlock fine-tuning on your platform's data.