Spaces:
Running on Zero
DiffRhythm and DiffRhythm 2 β Deep Technical Review
Compiled 2026-05-18. All claims cited; speculation flagged inline.
1. Overview
DiffRhythm is the first open-source latent-diffusion full-song generator β vocals + accompaniment, end-to-end, from lyrics and a style prompt β built by the Audio, Speech and Language Processing (ASLP) Lab at Northwestern Polytechnical University (NWPU) in Xi'an, China, with later contributions from Xiaomi Research (arxiv.org/abs/2503.01183, github.com/ASLP-lab/DiffRhythm). DiffRhythm v1 dropped on arXiv 3 Mar 2025; the full 4m45s variant followed on 15 Mar 2025, and an iterative v1.2 fixed repetition and audio-quality issues mid-2025 (HF v1.2 commit). DiffRhythm 2 appeared on arXiv 27 Oct 2025 (v3 revised 3 Feb 2026) under arxiv.org/abs/2510.22950, and was open-sourced at github.com/ASLP-lab/DiffRhythm2 (forked from xiaomi-research/diffrhythm2) on 30 Oct 2025, with HuggingFace weights at huggingface.co/ASLP-lab/DiffRhythm2. The series is the leading diffusion-side alternative to the LLM-style approach taken by Suno, YuE, and SongBloom.
2. Architecture
DiffRhythm v1 is a non-autoregressive (NAR) latent diffusion model with two pieces: a music VAE that compresses raw 44.1 kHz stereo audio into a latent grid, and a DiT (Diffusion Transformer) that denoises that grid conditioned on lyrics + style (nzqian.github.io/DiffRhythm). The DiT uses 16 LLaMA-style decoder layers, 2048 hidden dim, 32 heads Γ 64 dim, totaling ~1.1B parameters (arxiv.org/html/2503.01183). Vocals and accompaniment are produced jointly in a single latent stream β not dual-track β which is what makes it "embarrassingly simple" vs. cascaded systems. Lyric conditioning is sentence-level via LRC (timestamped) phonemes, with the diffusion model expected to align internally; style is conditioned either via a reference audio embedding or a text prompt. Inference uses a 32-step Euler ODE with CFG scale 4 and 20% dropout on both conditions during training to enable CFG (diffrhythm.us).
DiffRhythm 2 replaces the pure-NAR DiT with a semi-autoregressive block flow-matching transformer: the latent sequence is sliced into blocks of 10 frames (2s at 5 Hz), and "each block is generated with flow matching, while the dependency across blocks is handled autoregressively" (alphaxiv.org/overview/2510.22950v3 β quoted via search snippet). This is the key innovation: it preserves NAR-style fast within-block parallelism while letting the model attend to prior blocks for structural coherence (verse β chorus β verse) and lyric alignment without any external aligner. The audio codec is a new music VAE at 5 Hz frame rate (vs. the much higher rates of EnCodec/DAC) with a 170M-param decoder, enabling 210s of latent context to fit on a single GPU (arxiv abs). The full DiT is ~1B parameters. Two new training objectives appear: Stochastic Block Representation Alignment (REPA) loss to align hidden states of clean vs. noisy blocks (improves musicality/structure), and Cross-Pair Preference Optimization β an RLHF variant that groups the four preference dimensions (musicality, style similarity, lyric alignment, audio quality) into pairs to dodge the merging-induced regression that plain DPO causes. Max song length: 210 s in v2 vs. 4m45s (~285 s) in v1-full (github.com/ASLP-lab/DiffRhythm).
3. Variants and sizes
| Checkpoint | Duration | DiT params | Notes | Source |
|---|---|---|---|---|
DiffRhythm-base |
1m35s | ~1.1B | Original Mar 2025 | HF |
DiffRhythm-full |
4m45s | ~1.1B | Released 15 Mar 2025 | HF |
DiffRhythm-vae |
β | β | Shared audio VAE | HF |
DiffRhythm-1_2-base |
1m35s | ~1.1B | v1.2 quality fix | GH README |
DiffRhythm-1_2-full |
4m45s | ~1.1B | v1.2, text-style + instrumental | HF |
DiffRhythm+ (paper) |
full | ~1.1B | Adds DPO; not headlined as separate checkpoint | arxiv 2507.12890 |
DiffRhythm2 |
210 s | ~1B DiT + 170M VAE-dec | Block flow matching | HF |
(Speculation: I did not find an explicit param count posted for v2's DiT; the ~1B figure comes from a paper-extraction snippet and aligns with v1's ~1.1B body. Treat as approximate.)
4. License
Apache 2.0 for both code and DiT weights, declared on the v1 GitHub README and reaffirmed on the v2 README (github.com/ASLP-lab/DiffRhythm, github.com/ASLP-lab/DiffRhythm2). Commercial use is permitted with attribution. The v2 model card adds a non-binding ethical disclaimer asking users to verify originality, disclose AI involvement, and respect stylistic copyright β this is a notice, not an enforceable license restriction (HF model card).
5. Languages supported
Training is heavily bilingual (Mandarin + English) β v2's dataset is reported as Chinese : English : Instrumental β 4 : 5 : 1 (alphaXiv extract). The v1 README and several mirrors claim cross-lingual capability for Japanese, Korean, Spanish (diffrhythm.us, diffrhythm.ai) β but these are demo-site marketing claims, not benchmarked in the paper. Verdict: production-safe for EN and ZH; treat JP/KR/ES as best-effort. Phoneme front-end is espeak-ng, which itself supports 100+ languages (HF model card).
6. Quality assessment
Objective (v2 paper, lower=better for PER, higher=better for Mulan-T):
| Metric | DiffRhythm 2 | DiffRhythm+ | ACE-Step | LeVo |
|---|---|---|---|---|
| PER (lyric alignment) β | 0.13 | 0.15 | 0.23 | 0.19 |
| Mulan-T (style match) β | 0.40 | 0.25 | 0.28 | 0.35 |
| RTF (speed) β | 0.213 | 0.153 | 0.127 | 1.225 |
So v2 has best-in-open-source lyric alignment and style match, slightly slower than v1+/ACE-Step but ~6Γ faster than LeVo (arxiv 2510.22950).
Subjective: v2 is the strongest open model by MOS in the paper's own user study, but the authors explicitly state "in aspects such as musicality, it still shows a clear gap compared to commercial systems like SUNO V4.5" (arxiv 2510.22950). The block flow-matching does close the structural-coherence gap that the original Hacker News thread criticized v1 for β multiple HN commenters complained "there's no identifiable chorus in any of the demo songs" and rhythm was unstable (news.ycombinator.com/item?id=43255467). v2 demos show real verse/chorus structure (aslp-lab.github.io/DiffRhythm2.github.io). Specific Reddit reception threads in r/LocalLLaMA/r/StableDiffusion were not surfaced by search (low signal).
7. Inference performance
- v1-full: ~10 s for a 4m45s song on a single RTX 4090 (claimed in paper abstract, arxiv 2503.01183) β 32 ODE steps. Real-world ComfyUI users report ~62 s for 4 min on consumer GPUs (comfyui.org).
- VRAM: DiffRhythm-base needs β₯ 8 GB with
--chunked; full needs 24 GB for headroom (chutes.ai docs). - v2: RTF 0.213 on RTX 4090 β ~45 s for a 210 s song (arxiv 2510.22950).
- Apple Silicon / MPS: The v1 README claims Apple Silicon is "supported as of March 2025" but the GitHub issues list does not surface dedicated MPS benchmarks, and the Pinokio launcher (github.com/pinokiofactory/diffrhythm) does not advertise macOS in its description. No published M3/M4/M5 numbers exist. Speculation: on the user's M5 Max with 128 GB unified memory, v1-full should run via
PYTORCH_ENABLE_MPS_FALLBACK=1, likely 3β5Γ slower than 4090 β needs hands-on validation. v2 is newer and has not been tested on MPS publicly.
8. DiffRhythm 2 specifics
What changed from v1 β v2 (arxiv 2510.22950, alphaxiv overview):
- Architecture shift: pure NAR DiT β semi-AR block flow-matching (2 s blocks).
- New 5 Hz music VAE (vs. v1's higher-rate codec) β enables 210 s context within budget.
- Stochastic Block REPA loss: aligns clean vs. noisy hidden states β better musicality + structure.
- Cross-Pair Preference Optimization: four-dim RLHF without the model-merging regression that plain DPO causes.
- Dataset scaling: ~1.4 M songs / ~70,000 hours, with a 20 k-hour high-quality subset for SFT and 40 k preference pairs for DPO β a step-change from v1's undisclosed-but-smaller corpus.
- Lyric alignment without external constraints: v1 needed LRC timestamps; v2 learns alignment end-to-end via the AR block dependency.
- Quality numbers (paper): PER 0.15 β 0.13, Mulan-T 0.25 β 0.40 vs. DiffRhythm+ β i.e. lyric-error reduced ~13 % and style-match nearly doubled.
9. Repo health
- DiffRhythm v1: ~2.2β2.3 k stars, 268 forks, active through 2025, last major release Mar 2025 (github.com/ASLP-lab/DiffRhythm).
- DiffRhythm 2: 157 stars / 11 forks / 27 commits as of late Oct 2025 β young repo, recently pushed (github.com/ASLP-lab/DiffRhythm2).
- Training/fine-tuning scripts: "Coming soon" is the status on v1; community has filed Issue #46 asking for fine-tuning docs. v2 ships inference only in the public repo as of writing.
10. Real-world adoption
- ComfyUI: billwuhao/ComfyUI_DiffRhythm β 153 stars, supports v1.2 + full, includes bilingual subtitle gen (runcomfy.com node).
- Pinokio: pinokiofactory/diffrhythm β 19 stars, 69 commits, one-click installer.
- Chutes.ai: Public serverless endpoint for DiffRhythm-full (chutes.ai/docs/examples/music-generation).
- Replicate: No first-party DiffRhythm 2 model found in search β gap in the ecosystem (speculation).
- Multiple unofficial web frontends: diffrhythm.com, diffrhythm.us, diffrhythm.ai, diffrhythmai.com β quality and origin unverified, likely wrappers over the HF Space.
11. Fine-tuning
The official answer is none yet. The v1 repo's training code is listed as "Coming soon," and v2 only ships inference. There is no LoRA support, no published fine-tuning recipe, and no transformers/diffusers integration as of May 2026. Community workaround would require reverse-engineering the DiT class β non-trivial for a 1 B-param flow-matching model. For the user's Suno-clone platform, fine-tuning DiffRhythm today means forking + writing your own training loop. This is the single biggest practical weakness.
12. Pros and cons
Pros
- Permissive Apache 2.0 for code + weights β clean commercial path.
- Fastest open full-song model (~10 s for 4 min on a 4090; v2's block-FM is competitive even with AR-like coherence).
- v2 has state-of-the-art lyric alignment (PER 0.13) in open source.
- Lightweight: 8 GB VRAM possible with chunking β runs on consumer GPUs.
- Strong ecosystem: ComfyUI nodes, Pinokio installer, Chutes serverless.
- v2's block flow-matching meaningfully closes the structural-coherence gap that doomed v1 demos on HN.
Cons
- Still a clear musicality gap vs. Suno v4.5 (authors admit it; arxiv 2510.22950).
- No fine-tuning / LoRA path β training code unreleased.
- v2's max length is 210 s (3m30s), shorter than v1-full's 4m45s β a regression for radio-length pop.
- Multilingual claims (JP/KR/ES) are unbenchmarked; only EN/ZH have paper-backed quality.
- No published MPS benchmarks for Apple Silicon; v2 untested on Mac.
- Demo-site proliferation (
diffrhythm.us, etc.) muddies the brand β confusing for product positioning. - License disclaimer adds soft ethical obligations re. copyright that legal review may flag.
13. Verdict for the user's platform
For a Suno-style platform on an M5 Max (128 GB unified, MPS), DiffRhythm 2 is the best diffusion-side open option in May 2026, but it should be paired with an AR-style backup (YuE / SongBloom / LeVo) covering its weak points.
Where DiffRhythm 2 wins:
- Fast, cheap inference per song β viable for high-throughput web generation.
- Best-in-open lyric intelligibility β critical for a karaoke / lyrics-first UX.
- Stereo 44.1 kHz output out of the box.
- Apache-2.0 + commercial freedom.
Where it underperforms:
- Pop musicality, hook quality, vocal timbre are still below Suno v4.5 β premium-tier output is not there.
- No fine-tuning means you cannot specialize on a target sound or your platform's curated catalog without doing R&D.
- 210 s ceiling on v2 limits "full album track" formats β you'd fall back to v1-full (4m45s) at a quality cost.
- MPS path is unproven β the user should plan a same-week feasibility test on the M5 Max before committing v2 to the inference layer; CUDA cloud (Chutes / a 4090 server) is the safer near-term backend.
Recommended posture: ship v2 as the default fast generator behind a feature flag, keep v1.2-full for >3.5 min songs, evaluate Suno / YuE / SongBloom as quality-tier alternatives, and track the v2 repo for an eventual training-code release that would unlock fine-tuning on your platform's data.