ACE-Music-Studio / research /02_diffrhythm.md
techfreakworm's picture
docs: track spec + mockups + model research
9071450 unverified

DiffRhythm and DiffRhythm 2 β€” Deep Technical Review

Compiled 2026-05-18. All claims cited; speculation flagged inline.

1. Overview

DiffRhythm is the first open-source latent-diffusion full-song generator β€” vocals + accompaniment, end-to-end, from lyrics and a style prompt β€” built by the Audio, Speech and Language Processing (ASLP) Lab at Northwestern Polytechnical University (NWPU) in Xi'an, China, with later contributions from Xiaomi Research (arxiv.org/abs/2503.01183, github.com/ASLP-lab/DiffRhythm). DiffRhythm v1 dropped on arXiv 3 Mar 2025; the full 4m45s variant followed on 15 Mar 2025, and an iterative v1.2 fixed repetition and audio-quality issues mid-2025 (HF v1.2 commit). DiffRhythm 2 appeared on arXiv 27 Oct 2025 (v3 revised 3 Feb 2026) under arxiv.org/abs/2510.22950, and was open-sourced at github.com/ASLP-lab/DiffRhythm2 (forked from xiaomi-research/diffrhythm2) on 30 Oct 2025, with HuggingFace weights at huggingface.co/ASLP-lab/DiffRhythm2. The series is the leading diffusion-side alternative to the LLM-style approach taken by Suno, YuE, and SongBloom.

2. Architecture

DiffRhythm v1 is a non-autoregressive (NAR) latent diffusion model with two pieces: a music VAE that compresses raw 44.1 kHz stereo audio into a latent grid, and a DiT (Diffusion Transformer) that denoises that grid conditioned on lyrics + style (nzqian.github.io/DiffRhythm). The DiT uses 16 LLaMA-style decoder layers, 2048 hidden dim, 32 heads Γ— 64 dim, totaling ~1.1B parameters (arxiv.org/html/2503.01183). Vocals and accompaniment are produced jointly in a single latent stream β€” not dual-track β€” which is what makes it "embarrassingly simple" vs. cascaded systems. Lyric conditioning is sentence-level via LRC (timestamped) phonemes, with the diffusion model expected to align internally; style is conditioned either via a reference audio embedding or a text prompt. Inference uses a 32-step Euler ODE with CFG scale 4 and 20% dropout on both conditions during training to enable CFG (diffrhythm.us).

DiffRhythm 2 replaces the pure-NAR DiT with a semi-autoregressive block flow-matching transformer: the latent sequence is sliced into blocks of 10 frames (2s at 5 Hz), and "each block is generated with flow matching, while the dependency across blocks is handled autoregressively" (alphaxiv.org/overview/2510.22950v3 β€” quoted via search snippet). This is the key innovation: it preserves NAR-style fast within-block parallelism while letting the model attend to prior blocks for structural coherence (verse β†’ chorus β†’ verse) and lyric alignment without any external aligner. The audio codec is a new music VAE at 5 Hz frame rate (vs. the much higher rates of EnCodec/DAC) with a 170M-param decoder, enabling 210s of latent context to fit on a single GPU (arxiv abs). The full DiT is ~1B parameters. Two new training objectives appear: Stochastic Block Representation Alignment (REPA) loss to align hidden states of clean vs. noisy blocks (improves musicality/structure), and Cross-Pair Preference Optimization β€” an RLHF variant that groups the four preference dimensions (musicality, style similarity, lyric alignment, audio quality) into pairs to dodge the merging-induced regression that plain DPO causes. Max song length: 210 s in v2 vs. 4m45s (~285 s) in v1-full (github.com/ASLP-lab/DiffRhythm).

3. Variants and sizes

Checkpoint Duration DiT params Notes Source
DiffRhythm-base 1m35s ~1.1B Original Mar 2025 HF
DiffRhythm-full 4m45s ~1.1B Released 15 Mar 2025 HF
DiffRhythm-vae β€” β€” Shared audio VAE HF
DiffRhythm-1_2-base 1m35s ~1.1B v1.2 quality fix GH README
DiffRhythm-1_2-full 4m45s ~1.1B v1.2, text-style + instrumental HF
DiffRhythm+ (paper) full ~1.1B Adds DPO; not headlined as separate checkpoint arxiv 2507.12890
DiffRhythm2 210 s ~1B DiT + 170M VAE-dec Block flow matching HF

(Speculation: I did not find an explicit param count posted for v2's DiT; the ~1B figure comes from a paper-extraction snippet and aligns with v1's ~1.1B body. Treat as approximate.)

4. License

Apache 2.0 for both code and DiT weights, declared on the v1 GitHub README and reaffirmed on the v2 README (github.com/ASLP-lab/DiffRhythm, github.com/ASLP-lab/DiffRhythm2). Commercial use is permitted with attribution. The v2 model card adds a non-binding ethical disclaimer asking users to verify originality, disclose AI involvement, and respect stylistic copyright β€” this is a notice, not an enforceable license restriction (HF model card).

5. Languages supported

Training is heavily bilingual (Mandarin + English) β€” v2's dataset is reported as Chinese : English : Instrumental β‰ˆ 4 : 5 : 1 (alphaXiv extract). The v1 README and several mirrors claim cross-lingual capability for Japanese, Korean, Spanish (diffrhythm.us, diffrhythm.ai) β€” but these are demo-site marketing claims, not benchmarked in the paper. Verdict: production-safe for EN and ZH; treat JP/KR/ES as best-effort. Phoneme front-end is espeak-ng, which itself supports 100+ languages (HF model card).

6. Quality assessment

Objective (v2 paper, lower=better for PER, higher=better for Mulan-T):

Metric DiffRhythm 2 DiffRhythm+ ACE-Step LeVo
PER (lyric alignment) ↓ 0.13 0.15 0.23 0.19
Mulan-T (style match) ↑ 0.40 0.25 0.28 0.35
RTF (speed) ↓ 0.213 0.153 0.127 1.225

So v2 has best-in-open-source lyric alignment and style match, slightly slower than v1+/ACE-Step but ~6Γ— faster than LeVo (arxiv 2510.22950).

Subjective: v2 is the strongest open model by MOS in the paper's own user study, but the authors explicitly state "in aspects such as musicality, it still shows a clear gap compared to commercial systems like SUNO V4.5" (arxiv 2510.22950). The block flow-matching does close the structural-coherence gap that the original Hacker News thread criticized v1 for β€” multiple HN commenters complained "there's no identifiable chorus in any of the demo songs" and rhythm was unstable (news.ycombinator.com/item?id=43255467). v2 demos show real verse/chorus structure (aslp-lab.github.io/DiffRhythm2.github.io). Specific Reddit reception threads in r/LocalLLaMA/r/StableDiffusion were not surfaced by search (low signal).

7. Inference performance

  • v1-full: ~10 s for a 4m45s song on a single RTX 4090 (claimed in paper abstract, arxiv 2503.01183) β€” 32 ODE steps. Real-world ComfyUI users report ~62 s for 4 min on consumer GPUs (comfyui.org).
  • VRAM: DiffRhythm-base needs β‰₯ 8 GB with --chunked; full needs 24 GB for headroom (chutes.ai docs).
  • v2: RTF 0.213 on RTX 4090 β†’ ~45 s for a 210 s song (arxiv 2510.22950).
  • Apple Silicon / MPS: The v1 README claims Apple Silicon is "supported as of March 2025" but the GitHub issues list does not surface dedicated MPS benchmarks, and the Pinokio launcher (github.com/pinokiofactory/diffrhythm) does not advertise macOS in its description. No published M3/M4/M5 numbers exist. Speculation: on the user's M5 Max with 128 GB unified memory, v1-full should run via PYTORCH_ENABLE_MPS_FALLBACK=1, likely 3–5Γ— slower than 4090 β€” needs hands-on validation. v2 is newer and has not been tested on MPS publicly.

8. DiffRhythm 2 specifics

What changed from v1 β†’ v2 (arxiv 2510.22950, alphaxiv overview):

  1. Architecture shift: pure NAR DiT β†’ semi-AR block flow-matching (2 s blocks).
  2. New 5 Hz music VAE (vs. v1's higher-rate codec) β€” enables 210 s context within budget.
  3. Stochastic Block REPA loss: aligns clean vs. noisy hidden states β†’ better musicality + structure.
  4. Cross-Pair Preference Optimization: four-dim RLHF without the model-merging regression that plain DPO causes.
  5. Dataset scaling: ~1.4 M songs / ~70,000 hours, with a 20 k-hour high-quality subset for SFT and 40 k preference pairs for DPO β€” a step-change from v1's undisclosed-but-smaller corpus.
  6. Lyric alignment without external constraints: v1 needed LRC timestamps; v2 learns alignment end-to-end via the AR block dependency.
  7. Quality numbers (paper): PER 0.15 β†’ 0.13, Mulan-T 0.25 β†’ 0.40 vs. DiffRhythm+ β€” i.e. lyric-error reduced ~13 % and style-match nearly doubled.

9. Repo health

  • DiffRhythm v1: ~2.2–2.3 k stars, 268 forks, active through 2025, last major release Mar 2025 (github.com/ASLP-lab/DiffRhythm).
  • DiffRhythm 2: 157 stars / 11 forks / 27 commits as of late Oct 2025 β€” young repo, recently pushed (github.com/ASLP-lab/DiffRhythm2).
  • Training/fine-tuning scripts: "Coming soon" is the status on v1; community has filed Issue #46 asking for fine-tuning docs. v2 ships inference only in the public repo as of writing.

10. Real-world adoption

  • ComfyUI: billwuhao/ComfyUI_DiffRhythm β€” 153 stars, supports v1.2 + full, includes bilingual subtitle gen (runcomfy.com node).
  • Pinokio: pinokiofactory/diffrhythm β€” 19 stars, 69 commits, one-click installer.
  • Chutes.ai: Public serverless endpoint for DiffRhythm-full (chutes.ai/docs/examples/music-generation).
  • Replicate: No first-party DiffRhythm 2 model found in search β€” gap in the ecosystem (speculation).
  • Multiple unofficial web frontends: diffrhythm.com, diffrhythm.us, diffrhythm.ai, diffrhythmai.com β€” quality and origin unverified, likely wrappers over the HF Space.

11. Fine-tuning

The official answer is none yet. The v1 repo's training code is listed as "Coming soon," and v2 only ships inference. There is no LoRA support, no published fine-tuning recipe, and no transformers/diffusers integration as of May 2026. Community workaround would require reverse-engineering the DiT class β€” non-trivial for a 1 B-param flow-matching model. For the user's Suno-clone platform, fine-tuning DiffRhythm today means forking + writing your own training loop. This is the single biggest practical weakness.

12. Pros and cons

Pros

  • Permissive Apache 2.0 for code + weights β€” clean commercial path.
  • Fastest open full-song model (~10 s for 4 min on a 4090; v2's block-FM is competitive even with AR-like coherence).
  • v2 has state-of-the-art lyric alignment (PER 0.13) in open source.
  • Lightweight: 8 GB VRAM possible with chunking β€” runs on consumer GPUs.
  • Strong ecosystem: ComfyUI nodes, Pinokio installer, Chutes serverless.
  • v2's block flow-matching meaningfully closes the structural-coherence gap that doomed v1 demos on HN.

Cons

  • Still a clear musicality gap vs. Suno v4.5 (authors admit it; arxiv 2510.22950).
  • No fine-tuning / LoRA path β€” training code unreleased.
  • v2's max length is 210 s (3m30s), shorter than v1-full's 4m45s β€” a regression for radio-length pop.
  • Multilingual claims (JP/KR/ES) are unbenchmarked; only EN/ZH have paper-backed quality.
  • No published MPS benchmarks for Apple Silicon; v2 untested on Mac.
  • Demo-site proliferation (diffrhythm.us, etc.) muddies the brand β€” confusing for product positioning.
  • License disclaimer adds soft ethical obligations re. copyright that legal review may flag.

13. Verdict for the user's platform

For a Suno-style platform on an M5 Max (128 GB unified, MPS), DiffRhythm 2 is the best diffusion-side open option in May 2026, but it should be paired with an AR-style backup (YuE / SongBloom / LeVo) covering its weak points.

Where DiffRhythm 2 wins:

  • Fast, cheap inference per song β€” viable for high-throughput web generation.
  • Best-in-open lyric intelligibility β€” critical for a karaoke / lyrics-first UX.
  • Stereo 44.1 kHz output out of the box.
  • Apache-2.0 + commercial freedom.

Where it underperforms:

  • Pop musicality, hook quality, vocal timbre are still below Suno v4.5 β€” premium-tier output is not there.
  • No fine-tuning means you cannot specialize on a target sound or your platform's curated catalog without doing R&D.
  • 210 s ceiling on v2 limits "full album track" formats β€” you'd fall back to v1-full (4m45s) at a quality cost.
  • MPS path is unproven β€” the user should plan a same-week feasibility test on the M5 Max before committing v2 to the inference layer; CUDA cloud (Chutes / a 4090 server) is the safer near-term backend.

Recommended posture: ship v2 as the default fast generator behind a feature flag, keep v1.2-full for >3.5 min songs, evaluate Suno / YuE / SongBloom as quality-tier alternatives, and track the v2 repo for an eventual training-code release that would unlock fine-tuning on your platform's data.


Primary sources