File size: 15,731 Bytes
9071450
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# DiffRhythm and DiffRhythm 2 β€” Deep Technical Review

*Compiled 2026-05-18. All claims cited; speculation flagged inline.*

## 1. Overview

DiffRhythm is the first open-source **latent-diffusion full-song generator** β€” vocals + accompaniment, end-to-end, from lyrics and a style prompt β€” built by the **Audio, Speech and Language Processing (ASLP) Lab at Northwestern Polytechnical University (NWPU)** in Xi'an, China, with later contributions from **Xiaomi Research** ([arxiv.org/abs/2503.01183](https://arxiv.org/abs/2503.01183), [github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)). DiffRhythm v1 dropped on **arXiv 3 Mar 2025**; the full 4m45s variant followed on **15 Mar 2025**, and an iterative v1.2 fixed repetition and audio-quality issues mid-2025 ([HF v1.2 commit](https://huggingface.co/spaces/ASLP-lab/DiffRhythm/commit/f5b749d65f62e30bdaad11e6866edc8d3b078b71)). **DiffRhythm 2** appeared on **arXiv 27 Oct 2025** (v3 revised 3 Feb 2026) under [arxiv.org/abs/2510.22950](https://arxiv.org/abs/2510.22950), and was open-sourced at [github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2) (forked from `xiaomi-research/diffrhythm2`) on **30 Oct 2025**, with HuggingFace weights at [huggingface.co/ASLP-lab/DiffRhythm2](https://huggingface.co/ASLP-lab/DiffRhythm2). The series is the leading **diffusion-side** alternative to the LLM-style approach taken by Suno, YuE, and SongBloom.

## 2. Architecture

DiffRhythm v1 is a **non-autoregressive (NAR) latent diffusion** model with two pieces: a music **VAE** that compresses raw 44.1 kHz stereo audio into a latent grid, and a **DiT** (Diffusion Transformer) that denoises that grid conditioned on lyrics + style ([nzqian.github.io/DiffRhythm](https://nzqian.github.io/DiffRhythm/)). The DiT uses **16 LLaMA-style decoder layers, 2048 hidden dim, 32 heads Γ— 64 dim, totaling ~1.1B parameters** ([arxiv.org/html/2503.01183](https://arxiv.org/html/2503.01183v1)). Vocals and accompaniment are produced **jointly in a single latent stream** β€” not dual-track β€” which is what makes it "embarrassingly simple" vs. cascaded systems. Lyric conditioning is **sentence-level via LRC (timestamped) phonemes**, with the diffusion model expected to align internally; style is conditioned either via a reference audio embedding or a text prompt. Inference uses a **32-step Euler ODE with CFG scale 4** and 20% dropout on both conditions during training to enable CFG ([diffrhythm.us](https://diffrhythm.us/)).

**DiffRhythm 2** replaces the pure-NAR DiT with a **semi-autoregressive block flow-matching** transformer: the latent sequence is sliced into **blocks of 10 frames (2s at 5 Hz)**, and "each block is generated with flow matching, while the dependency across blocks is handled autoregressively" ([alphaxiv.org/overview/2510.22950v3](https://www.alphaxiv.org/overview/2510.22950v3) β€” quoted via search snippet). This is the key innovation: it preserves NAR-style fast within-block parallelism while letting the model attend to prior blocks for **structural coherence** (verse β†’ chorus β†’ verse) and **lyric alignment without any external aligner**. The audio codec is a new **music VAE at 5 Hz frame rate** (vs. the much higher rates of EnCodec/DAC) with a **170M-param decoder**, enabling 210s of latent context to fit on a single GPU ([arxiv abs](https://arxiv.org/abs/2510.22950)). The full DiT is **~1B parameters**. Two new training objectives appear: **Stochastic Block Representation Alignment (REPA) loss** to align hidden states of clean vs. noisy blocks (improves musicality/structure), and **Cross-Pair Preference Optimization** β€” an RLHF variant that groups the four preference dimensions (musicality, style similarity, lyric alignment, audio quality) into pairs to dodge the merging-induced regression that plain DPO causes. **Max song length: 210 s** in v2 vs. **4m45s (~285 s)** in v1-full ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)).

## 3. Variants and sizes

| Checkpoint | Duration | DiT params | Notes | Source |
|---|---|---|---|---|
| `DiffRhythm-base` | 1m35s | ~1.1B | Original Mar 2025 | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-base) |
| `DiffRhythm-full` | 4m45s | ~1.1B | Released 15 Mar 2025 | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-full) |
| `DiffRhythm-vae` | β€” | β€” | Shared audio VAE | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-vae) |
| `DiffRhythm-1_2-base` | 1m35s | ~1.1B | v1.2 quality fix | [GH README](https://github.com/ASLP-lab/DiffRhythm) |
| `DiffRhythm-1_2-full` | 4m45s | ~1.1B | v1.2, text-style + instrumental | [HF](https://huggingface.co/ASLP-lab/DiffRhythm-1_2-full) |
| `DiffRhythm+` (paper) | full | ~1.1B | Adds DPO; not headlined as separate checkpoint | [arxiv 2507.12890](https://arxiv.org/html/2507.12890v2) |
| `DiffRhythm2` | 210 s | ~1B DiT + 170M VAE-dec | Block flow matching | [HF](https://huggingface.co/ASLP-lab/DiffRhythm2) |

(Speculation: I did not find an explicit param count posted for v2's DiT; the **~1B figure comes from a paper-extraction snippet** and aligns with v1's ~1.1B body. Treat as approximate.)

## 4. License

**Apache 2.0** for both code and DiT weights, declared on the v1 GitHub README and reaffirmed on the v2 README ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm), [github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2)). **Commercial use is permitted** with attribution. The v2 model card adds a **non-binding ethical disclaimer** asking users to verify originality, disclose AI involvement, and respect stylistic copyright β€” this is a notice, not an enforceable license restriction ([HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)).

## 5. Languages supported

Training is heavily **bilingual (Mandarin + English)** β€” v2's dataset is reported as **Chinese : English : Instrumental β‰ˆ 4 : 5 : 1** ([alphaXiv extract](https://www.alphaxiv.org/overview/2510.22950v3)). The v1 README and several mirrors claim **cross-lingual capability** for Japanese, Korean, Spanish ([diffrhythm.us](https://diffrhythm.us/), [diffrhythm.ai](https://diffrhythmai.com/)) β€” but these are demo-site marketing claims, **not benchmarked in the paper**. Verdict: production-safe for **EN and ZH**; treat JP/KR/ES as best-effort. Phoneme front-end is **espeak-ng**, which itself supports 100+ languages ([HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)).

## 6. Quality assessment

**Objective (v2 paper, lower=better for PER, higher=better for Mulan-T):**

| Metric | DiffRhythm 2 | DiffRhythm+ | ACE-Step | LeVo |
|---|---|---|---|---|
| PER (lyric alignment) ↓ | **0.13** | 0.15 | 0.23 | 0.19 |
| Mulan-T (style match) ↑ | **0.40** | 0.25 | 0.28 | 0.35 |
| RTF (speed) ↓ | 0.213 | 0.153 | 0.127 | 1.225 |

So v2 has **best-in-open-source lyric alignment and style match**, slightly slower than v1+/ACE-Step but ~6Γ— faster than LeVo ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).

**Subjective:** v2 is the strongest open model by MOS in the paper's own user study, **but the authors explicitly state "in aspects such as musicality, it still shows a clear gap compared to commercial systems like SUNO V4.5"** ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)). The **block flow-matching does close the structural-coherence gap** that the original Hacker News thread criticized v1 for β€” multiple HN commenters complained "there's no identifiable chorus in any of the demo songs" and rhythm was unstable ([news.ycombinator.com/item?id=43255467](https://news.ycombinator.com/item?id=43255467)). v2 demos show real verse/chorus structure ([aslp-lab.github.io/DiffRhythm2.github.io](https://aslp-lab.github.io/DiffRhythm2.github.io/)). Specific Reddit reception threads in r/LocalLLaMA/r/StableDiffusion were not surfaced by search (low signal).

## 7. Inference performance

- v1-full: **~10 s for a 4m45s song on a single RTX 4090** (claimed in paper abstract, [arxiv 2503.01183](https://arxiv.org/abs/2503.01183)) β€” 32 ODE steps. Real-world ComfyUI users report **~62 s for 4 min** on consumer GPUs ([comfyui.org](https://comfyui.org/en/generate-music-with-comfyui-diffrhythm)).
- **VRAM:** DiffRhythm-base needs β‰₯ **8 GB** with `--chunked`; full needs **24 GB** for headroom ([chutes.ai docs](https://chutes.ai/docs/examples/music-generation)).
- v2: **RTF 0.213 on RTX 4090** β†’ ~45 s for a 210 s song ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).
- **Apple Silicon / MPS:** The v1 README claims Apple Silicon is "supported as of March 2025" but the GitHub issues list does not surface dedicated MPS benchmarks, and the Pinokio launcher ([github.com/pinokiofactory/diffrhythm](https://github.com/pinokiofactory/diffrhythm)) does not advertise macOS in its description. **No published M3/M4/M5 numbers exist.** Speculation: on the user's **M5 Max with 128 GB unified memory**, v1-full should run via `PYTORCH_ENABLE_MPS_FALLBACK=1`, likely 3–5Γ— slower than 4090 β€” needs hands-on validation. v2 is newer and has not been tested on MPS publicly.

## 8. DiffRhythm 2 specifics

What changed from v1 β†’ v2 ([arxiv 2510.22950](https://arxiv.org/abs/2510.22950), [alphaxiv overview](https://www.alphaxiv.org/overview/2510.22950v3)):

1. **Architecture shift:** pure NAR DiT β†’ **semi-AR block flow-matching** (2 s blocks).
2. **New 5 Hz music VAE** (vs. v1's higher-rate codec) β€” enables 210 s context within budget.
3. **Stochastic Block REPA loss:** aligns clean vs. noisy hidden states β†’ better musicality + structure.
4. **Cross-Pair Preference Optimization:** four-dim RLHF without the model-merging regression that plain DPO causes.
5. **Dataset scaling:** **~1.4 M songs / ~70,000 hours**, with a **20 k-hour high-quality subset** for SFT and **40 k preference pairs** for DPO β€” a step-change from v1's undisclosed-but-smaller corpus.
6. **Lyric alignment without external constraints:** v1 needed LRC timestamps; v2 learns alignment end-to-end via the AR block dependency.
7. **Quality numbers (paper):** PER **0.15 β†’ 0.13**, Mulan-T **0.25 β†’ 0.40** vs. DiffRhythm+ β€” i.e. **lyric-error reduced ~13 % and style-match nearly doubled**.

## 9. Repo health

- **DiffRhythm v1:** ~**2.2–2.3 k stars**, **268 forks**, active through 2025, last major release Mar 2025 ([github.com/ASLP-lab/DiffRhythm](https://github.com/ASLP-lab/DiffRhythm)).
- **DiffRhythm 2:** **157 stars / 11 forks / 27 commits** as of late Oct 2025 β€” young repo, recently pushed ([github.com/ASLP-lab/DiffRhythm2](https://github.com/ASLP-lab/DiffRhythm2)).
- Training/fine-tuning scripts: **"Coming soon"** is the status on v1; community has filed [Issue #46](https://github.com/ASLP-lab/DiffRhythm/issues/46) asking for fine-tuning docs. v2 ships **inference only** in the public repo as of writing.

## 10. Real-world adoption

- **ComfyUI:** [billwuhao/ComfyUI_DiffRhythm](https://github.com/billwuhao/ComfyUI_DiffRhythm) β€” 153 stars, supports v1.2 + full, includes bilingual subtitle gen ([runcomfy.com node](https://www.runcomfy.com/comfyui-nodes/ComfyUI_DiffRhythm)).
- **Pinokio:** [pinokiofactory/diffrhythm](https://github.com/pinokiofactory/diffrhythm) β€” 19 stars, 69 commits, one-click installer.
- **Chutes.ai:** Public serverless endpoint for DiffRhythm-full ([chutes.ai/docs/examples/music-generation](https://chutes.ai/docs/examples/music-generation)).
- **Replicate:** No first-party DiffRhythm 2 model found in search β€” gap in the ecosystem (speculation).
- Multiple unofficial web frontends: diffrhythm.com, diffrhythm.us, diffrhythm.ai, diffrhythmai.com β€” quality and origin unverified, likely wrappers over the HF Space.

## 11. Fine-tuning

The official answer is **none yet**. The v1 repo's training code is listed as "Coming soon," and v2 only ships inference. There is no LoRA support, no published fine-tuning recipe, and no `transformers`/`diffusers` integration as of May 2026. Community workaround would require reverse-engineering the DiT class β€” non-trivial for a 1 B-param flow-matching model. **For the user's Suno-clone platform, fine-tuning DiffRhythm today means forking + writing your own training loop.** This is the single biggest practical weakness.

## 12. Pros and cons

**Pros**
- Permissive **Apache 2.0** for code + weights β€” clean commercial path.
- **Fastest open full-song model** (~10 s for 4 min on a 4090; v2's block-FM is competitive even with AR-like coherence).
- v2 has **state-of-the-art lyric alignment (PER 0.13)** in open source.
- Lightweight: 8 GB VRAM possible with chunking β€” runs on consumer GPUs.
- Strong ecosystem: ComfyUI nodes, Pinokio installer, Chutes serverless.
- v2's block flow-matching meaningfully **closes the structural-coherence gap** that doomed v1 demos on HN.

**Cons**
- Still a **clear musicality gap vs. Suno v4.5** (authors admit it; [arxiv 2510.22950](https://arxiv.org/abs/2510.22950)).
- **No fine-tuning / LoRA path** β€” training code unreleased.
- v2's max length is **210 s** (3m30s), *shorter* than v1-full's 4m45s β€” a regression for radio-length pop.
- Multilingual claims (JP/KR/ES) are **unbenchmarked**; only EN/ZH have paper-backed quality.
- **No published MPS benchmarks** for Apple Silicon; v2 untested on Mac.
- Demo-site proliferation (`diffrhythm.us`, etc.) muddies the brand β€” confusing for product positioning.
- License disclaimer adds soft ethical obligations re. copyright that legal review may flag.

## 13. Verdict for the user's platform

For a Suno-style platform on an **M5 Max (128 GB unified, MPS)**, DiffRhythm 2 is the **best diffusion-side open option in May 2026**, *but* it should be paired with an **AR-style backup** (YuE / SongBloom / LeVo) covering its weak points.

**Where DiffRhythm 2 wins:**
- Fast, cheap inference per song β€” viable for high-throughput web generation.
- Best-in-open lyric intelligibility β€” critical for a karaoke / lyrics-first UX.
- Stereo 44.1 kHz output out of the box.
- Apache-2.0 + commercial freedom.

**Where it underperforms:**
- **Pop musicality, hook quality, vocal timbre** are still below Suno v4.5 β€” premium-tier output is not there.
- **No fine-tuning** means you cannot specialize on a target sound or your platform's curated catalog without doing R&D.
- **210 s ceiling on v2** limits "full album track" formats β€” you'd fall back to v1-full (4m45s) at a quality cost.
- **MPS path is unproven** β€” the user should plan a same-week feasibility test on the M5 Max before committing v2 to the inference layer; CUDA cloud (Chutes / a 4090 server) is the safer near-term backend.

**Recommended posture:** ship v2 as the default *fast* generator behind a feature flag, keep v1.2-full for >3.5 min songs, evaluate Suno / YuE / SongBloom as quality-tier alternatives, and track the v2 repo for an eventual training-code release that would unlock fine-tuning on your platform's data.

---

### Primary sources
- [DiffRhythm 2 paper (arxiv 2510.22950)](https://arxiv.org/abs/2510.22950)
- [DiffRhythm v1 paper (arxiv 2503.01183)](https://arxiv.org/abs/2503.01183)
- [DiffRhythm v1 GitHub](https://github.com/ASLP-lab/DiffRhythm)
- [DiffRhythm 2 GitHub](https://github.com/ASLP-lab/DiffRhythm2)
- [DiffRhythm 2 HF model card](https://huggingface.co/ASLP-lab/DiffRhythm2)
- [alphaXiv overview v3](https://www.alphaxiv.org/overview/2510.22950v3)
- [HN thread on v1](https://news.ycombinator.com/item?id=43255467)
- [ComfyUI_DiffRhythm](https://github.com/billwuhao/ComfyUI_DiffRhythm)
- [Pinokio DiffRhythm](https://github.com/pinokiofactory/diffrhythm)
- [Chutes serving docs](https://chutes.ai/docs/examples/music-generation)
- [DiffRhythm+ paper (arxiv 2507.12890)](https://arxiv.org/html/2507.12890v2)