techfreakworm's picture
docs: track spec + mockups + model research
9071450 unverified

YuE β€” Open Full-Song Music Generation Foundation Model

Research date: 2026-05-18


1. Overview

YuE (乐, "yue" β€” Chinese for "music") is an open-source family of long-form, lyrics-to-song foundation models that produce vocals + accompaniment end-to-end, explicitly positioned as the open competitor to Suno.ai and Udio. It was built by the M-A-P (Multimodal Art Projection) collective, led by researchers at HKUST (Hong Kong University of Science and Technology) with collaborators from multiple academic and industry institutions (58 authors are credited on the paper, with hardware support from Geely and Moonshot AI) (arXiv 2503.08638, HF model card).

Release timeline:

  • 2025-01-26 β€” Initial YuE-s1-7B series released (GitHub README)
  • 2025-01-30 β€” Apache 2.0 license adopted; dual-track ICL mode added
  • 2025-02-07 β€” Windows / Pinokio support
  • 2025-02-17 β€” Music continuation + Google Colab support
  • 2025-03-11/12 β€” Anneal checkpoints + technical report on arXiv (v1)
  • 2025-06-04 β€” LoRA fine-tuning code merged (PR #126)
  • ICLR 2026 β€” Paper presented

Current status (May 2026): effectively frozen / community-maintained. The official multimodal-art-projection/YuE repo's last commit is 2025-06-04 (GitHub API, retrieved 2026-05-18), nearly 12 months stale. There is no announced YuE-2 or successor from the M-A-P org. All forward development (quantization, ComfyUI, GUI, MPS attempts, exllama, mp3 extension) now happens in community forks like YuEGP, YuE-exllamav2, and YuE-extend. The space the team itself has moved into is ACE-Step (released January 2026), which the ACE-Step paper explicitly critiques YuE for "slow inference and structural artifacts" (arXiv 2506.00045).


2. Architecture

YuE is a two-stage autoregressive LLM pipeline built on the LLaMA2 decoder-only transformer backbone β€” not a diffusion model (paper).

Stage-1 LM (the headline 7B model):

  • LLaMA2-style decoder, ~6B–7B parameters (HF metadata reports 6B for the s1 checkpoints).
  • Performs track-decoupled next-token prediction: interleaves vocal and instrumental token streams in a single sequence, so a single AR pass produces both tracks rather than mixing them. This is YuE's central architectural innovation.
  • Conditioned on (genre tags || lyrics) using structural progressive conditioning β€” lyrics are chunked per section (verse/chorus/bridge) and re-injected so attention does not lose alignment over a 5-minute generation.
  • Native context: 8192 tokens (~163 s of mix-track audio, ~81 s of dual-track); extended to 16384 in the anneal phase.

Stage-2 LM:

  • 1B-parameter LLaMA2 model (HF reports ~2B for YuE-s2-1B-general).
  • Predicts the residual RVQ codebooks (layers 1–7) conditioned on Stage-1's codebook-0 output, restoring acoustic fidelity that the semantic-rich layer-0 tokens omit.
  • Context length 8192.

Audio tokenizer β€” X-Codec:

  • YuE uses X-Codec (from the same M-A-P lineage as MERT), a semantic-acoustic fused RVQ codec that bolts a HuBERT-based semantic stream onto an RVQ-VAE acoustic stream.
  • 12 RVQ codebooks total; YuE uses the first 8 (codebook size 1024 each).
  • 50 Hz frame rate over 16 kHz audio.
  • A separate YuE-upsampler (GAN-based) converts the 16 kHz output up to higher sample rate / better fidelity for delivery (paper Β§3, HF Transformers X-Codec docs).

Track handling: Dual-track. Vocal and accompaniment are separately tokenized via X-Codec, then interleaved in the AR sequence β€” this is the paper's claimed advantage over single-track-mixture baselines (less information loss, cleaner vocal/inst separation).

Max generation length: Up to ~5 minutes per song, generated in chunks/sessions and stitched.

Lyrics conditioning: Plain text lyrics with section tags ([verse], [chorus], etc.) + a genre tag prompt (a vocabulary from top_200_tags.json such as "pop", "female vocal", "energetic", "120 bpm"). The progressive conditioning means each new section re-references the relevant lyric chunk.

Training scale: Stage-1 used ~2T tokens across phases; data includes ~650K hours of in-the-wild music plus ~70K hours of TTS for vocal grounding (paper).


3. Variants and Sizes

From the M-A-P YuE collection on HuggingFace (downloads accurate as of mid-2026):

Model Params Stage Language Mode Downloads (last month)
YuE-s1-7B-anneal-en-cot 6B 1 English Chain-of-Thought (default) 8.48k
YuE-s1-7B-anneal-en-icl 6B 1 English In-Context Learning (style cloning) 805
YuE-s1-7B-anneal-zh-cot 6B 1 Mandarin/Cantonese CoT 203
YuE-s1-7B-anneal-zh-icl 6B 1 Mandarin/Cantonese ICL 89
YuE-s1-7B-anneal-jp-kr-cot 6B 1 Japanese/Korean CoT 95
YuE-s1-7B-anneal-jp-kr-icl 6B 1 Japanese/Korean ICL 25
YuE-s2-1B-general 2B 2 language-agnostic residual decoder 6.01k
YuE-s1-0.5B 0.5B 1 research/ablation partial training 94
YuE-upsampler – post n/a GAN upsampler –
xcodec_mini_infer – tokenizer n/a X-Codec encoder/decoder –

Naming key:

  • s1 / s2 = Stage-1 (semantic) / Stage-2 (acoustic residual).
  • anneal = checkpoints after the final "annealing" pretraining phase (highest quality public weights).
  • cot = chain-of-thought prompting variant; icl = in-context learning variant (used for style/voice cloning from a reference audio).
  • A community GGUF quantization of the Stage-2 model exists at multimodalart/YuE-s2-1B-general-Q8_0-GGUF β€” useful for Mac llama.cpp paths.

There is no official "YuE-2" or major version bump. The team's successor effort is the separately branded ACE-Step.


4. License

Apache License 2.0 for code and weights β€” switched on 2025-01-30 in response to community pressure (GitHub README news entry, HF model card).

  • Commercial use: Permitted and explicitly encouraged. The model card says: "Artists and content creators are encouraged to sample and incorporate outputs into their own works, and even monetize them, with attribution to the model's name ("YuE by HKUST/M-A-P")."
  • Attribution: Required for public / commercial outputs.
  • Recommended labeling: outputs should be marked "AI-generated", "YuE-generated", "AI-assisted", or "AI-auxiliated".
  • No training-data redistribution clause β€” Apache 2.0 covers code and the released weights; training data itself was not released, so no redistribution permission is granted on data.
  • Liability: users bear sole responsibility for any copyright infringement, plagiarism, or misuse. Likely β€” no explicit watermarking or content-credentials are baked into output (no direct confirmation in docs).

Practical takeaway for the user's Suno-like platform: YuE is one of the very few music-generation foundation models with a clean, no-strings commercial license, which is the single most valuable thing about it.


5. Languages Supported

Five officially: English, Mandarin Chinese, Cantonese, Japanese, Korean (GitHub README, demo page).

  • English has the deepest training and the most-downloaded checkpoint.
  • zh covers Mandarin and Cantonese (sharing a checkpoint).
  • jp-kr shares one checkpoint for Japanese and Korean.
  • The demo site shows code-switching (English ↔ Mandarin within the same song) working.
  • No official support for Spanish, French, German, Hindi, Arabic, etc. β€” outputs in those languages will likely be poor or accented (no direct user reports confirm, but architecturally the model has never seen them at scale).

6. Quality Assessment

Strengths (from paper + demos):

  • Wide vocal range β€” the paper reports YuE "closely matching top-performing closed-source systems like Suno V4" on vocal-range metrics (WhiteFiber summary).
  • Strong musical structure β€” verse/chorus/bridge transitions are coherent over 3–5 min, which most diffusion music models still struggle with.
  • Demos show death-growl metal, scatting jazz, Beijing opera, rap, ballad, country, and soul β€” genre breadth is genuinely impressive (map-yue.github.io).
  • ICL mode can clone the timbre/style of a reference clip β€” closest open-source analogue to Suno's "cover" or Udio's style transfer.

Weaknesses (from paper's own discussion + community feedback):

  • Acoustic fidelity gap. Multiple sources, including the paper itself, note "clear deficiencies in vocal and accompaniment acoustic quality, likely due to limitations of its current audio tokenization method"; the authors propose super-resolution / better decoders as future work.
  • Mono / narrow stereo image β€” third-party reviews call out that output "lacks the production quality needed for commercial music platforms" and is essentially mono (articlex review).
  • Slow inference + structural artifacts β€” the explicit critique from the ACE-Step authors (ICLR 2026 submission): "LLM-based models like YuE excel at lyrics alignment but suffer from slow inference and structural artifacts" (ACE-Step paper).
  • Mumbling / lyric drift appears in long sections β€” there is no explicit Reddit thread surfacing here, but the paper's "Section 12 Unsuccessful Attempts" and --repetition-penalty / decoding-temperature emphasis in the GitHub Issues suggest users hit it.

Quality verdict vs Suno v4 / v5:

  • Suno v4 β‰ˆ YuE on vocal range and genre breadth.
  • Suno v4/v5 clearly ahead on mix polish, stereo width, vocal clarity, and emotional nuance.
  • YuE ahead of Suno only on openness, controllability via lyrics tags, and structural macro-form for niche genres.

7. Inference Performance

From the README's official hardware table:

GPU Time for 30 s of audio (Stage-1 + Stage-2)
NVIDIA H800 80GB ~150 s
NVIDIA RTX 4090 24GB ~360 s
≀24GB GPU Max ~2 concurrent sessions; cannot generate a full song in one pass
β‰₯80GB GPU (H100/A100/H800) Recommended for a full 4+ session song

Extrapolating to a 3-minute song (~6Γ— a 30 s clip, plus some overhead for stitching):

  • H800: ~15–18 minutes
  • A100 80GB: ~18–22 minutes (likely β€” close to H800 throughput)
  • RTX 4090: ~35–45 minutes
  • M5 Max MPS (user's machine): no official support, no public benchmark.

VRAM: Full-precision FP16 Stage-1 needs ~16–18 GB; Stage-2 + upsampler add ~4–6 GB. Single-pass full-song generation comfortably wants 40–80 GB.

Quantized / community paths:

  • YuEGP ("YuE for the GPU Poor") brings VRAM down to <10 GB via 8-bit quantization and sequential offload (YuEGP repo).
  • YuE-exllamav2 claims up to 5Γ— speedup via ExLlamaV2 + FlashAttention-2 + BF16 (YuE-exllamav2) β€” NVIDIA-only.
  • GGUF Stage-2 exists (multimodalart/YuE-s2-1B-general-Q8_0-GGUF). Stage-1 7B GGUF is not officially published as of 2026-05.

Apple Silicon / MPS:

  • No official MPS support. GitHub README references --cuda_idx, no mps or mac mentions.
  • No HF Space or fork advertises working MPS inference. The architecture is plain LLaMA2 + standard transformer ops, so MPS port is technically feasible (likely β€” Stage-1 fits well within the user's 128GB unified memory), but the X-Codec encoder/decoder has Flash-Attention CUDA kernels that would need replacement. Realistic path on M5 Max today: run the Stage-2 GGUF via llama.cpp Metal backend, but Stage-1 has no public Metal/MPS port.
  • A community attempt to MPS-port has not surfaced in any search or GitHub issue as of May 2026.

8. Repo Health

Data from the GitHub API on 2026-05-18 for multimodal-art-projection/YuE:

  • Stars: 6,219
  • Forks: 741
  • Open issues: 86
  • License: Apache-2.0
  • Default branch last push: 2025-06-04T13:08:48Z β€” ~11 months stale
  • Most-recent commits: all README edits and the finetune-merge PRs on the same day (2025-06-04).
  • Recent issue traffic (sampled 2025-Q4 through 2026-Q2): install errors (CUDA / codecmanipulator missing), ComfyUI integration questions, attention-mask warnings, "how do I generate a full song" basics, a Feb-2026 PR proposing SDPA as default attention that received zero engagement. Maintainer responses are essentially absent in 2026.
  • Fine-tuning support: present, merged June 2025 via PR #126 (LoRA, no QLoRA, requires CUDA 12.1+, PyTorch 2.4, Megatron-formatted JSONL data).
  • vLLM / SGLang: listed in TODO, never implemented.
  • llama.cpp: community Stage-2 GGUF exists but no official integration; Stage-1 not converted.
  • Tensor parallel / Stemgen mode: TODO, never shipped.

Verdict: The repo is in maintenance/abandonment limbo. Apache 2.0 + open weights mean anyone can fork; community forks are where the energy is.


9. Real-World Adoption

  • Replicate: Hosted at fofr/yue with an official cog wrapper at replicate/cog-yue β€” production-ready pay-per-second API.
  • HuggingFace Spaces: at least three live demos β€” fffiloni/YuE, innova-ai/YuE-music-generator-demo, Harveyu/YuE-music-generator-demo.
  • ComfyUI: community node smthemex/ComfyUI_YuE exposes YuE as a node graph (issue #148 confirms active users in 2026).
  • Pinokio: one-click Windows installer ships in the official Pinokio script directory (pinokio.co).
  • GPU-poor / consumer forks: deepbeepmeep/YuEGP (sub-10 GB VRAM), sgsdxzy/YuE-exllamav2 (5Γ— speedup), Mozer/YuE-extend (mp3 extension + GUI), Sorrymakershen/YuE-for-windows.
  • SiliconFlow: no public listing found as of 2026-05 (likely β€” search returned no SiliconFlow YuE endpoint).
  • Forks: 741 total, dominated by consumer-VRAM optimization rather than research extension.

For a Suno-like platform, the Replicate fofr/yue endpoint is the lowest-friction starting point to test quality before self-hosting.


10. Fine-Tuning

  • LoRA fine-tuning is documented and supported since June 2025, in the finetune/ directory with scripts/preprocess_data.sh and scripts/run_finetune.sh.
  • Configurable LORA_R, LORA_ALPHA, LORA_DROPOUT.
  • Training scripts are open β€” Megatron-style data pipeline; data must be converted to JSONL containing X-Codec tokens + lyric/structure/genre metadata, then to Megatron binary.
  • QLoRA: not documented. No 4-bit fine-tuning path is described in the official repo (likely β€” community forks may have hacked it together).
  • Requires CUDA 12.1+, PyTorch 2.4, Python 3.10; GPU memory not explicitly stated but realistically wants β‰₯40 GB VRAM for the 7B Stage-1 LoRA.
  • No published guide for full-parameter fine-tuning of Stage-1 β€” implied to need multi-node H100.

11. Pros and Cons

Pros

  • True open weights (Apache 2.0), commercial-use-friendly, with strong attribution-only requirements.
  • Genuine dual-track output (vocals + instrumentals as separable streams), not just a mix.
  • Multilingual coverage of EN / ZH / Cantonese / JP / KR with code-switching demos.
  • Strong macro-structure for 3–5 minute songs β€” verses, choruses, bridges hold together.
  • Healthy ecosystem of quantized / consumer-VRAM forks and a turnkey Replicate endpoint.
  • LoRA fine-tuning code is shipped and merged.
  • Comparable vocal range to Suno v4 on the paper's metrics.

Cons

  • Repo is effectively dormant since June 2025 β€” no maintainer engagement on 2026 issues/PRs.
  • Acoustic fidelity is noticeably below Suno v4/v5 β€” mono-ish, less polished mix, occasional vocal artifacts/mumbling on long passages.
  • No MPS / Apple Silicon support, official or community β€” a real problem for the user's M5 Max workflow.
  • Slow inference even on H800 (~150 s per 30 s clip, β†’ 15+ minutes per full song before quantization).
  • VRAM hungry: full-song single-pass wants 80 GB; consumer GPUs need session-stitching tricks.
  • No QLoRA / no vLLM / no SGLang / no tensor parallel β€” all in TODO purgatory.
  • Training data not released β†’ fine-tuning needs you to bring your own licensed corpus.
  • Tokenizer (X-Codec) is the bottleneck for fidelity, and YuE inherits this ceiling β€” no upgrade path planned in this codebase.
  • An explicit successor effort (ACE-Step) from an adjacent team claims to fix YuE's specific weaknesses.

12. Verdict for the User's Suno-like Platform

Best fit for the user's M5 Max / 128 GB platform if:

  • The product needs commercial-grade licensing freedom above all else β€” YuE is one of the very few open music models you can ship in a paid product without licensing carve-outs.
  • You target multilingual song generation (EN + Mandarin/Cantonese + JP/KR) with code-switching β€” YuE is the strongest open option here.
  • You can offload generation to a rented H100/H800 (Replicate, Runpod, Lambda) rather than insisting on local M5 Max inference β€” MPS support is the blocker on the user's hardware.
  • You want a base to LoRA fine-tune on a proprietary genre/voice corpus β€” the official fine-tune scripts work today, and Apache 2.0 lets you keep your LoRA private and commercial.

Where YuE will underperform competitors:

  • Acoustic polish β€” Suno v4/v5 and Udio will sound noticeably more professional out of the box. If your platform's selling point is "studio-quality vocals", YuE is not there.
  • Throughput per dollar β€” diffusion-based ACE-Step and DiffRhythm-2 are dramatically faster (ACE-Step claims ~15Γ— speedup); for a high-volume product, the AR-LLM architecture is expensive.
  • Real-time / interactive generation β€” not viable; YuE is batch-only.
  • Local Mac inference β€” until somebody ports Stage-1 to MPS or ships a Stage-1 GGUF, the user's M5 Max can at best play around with the Stage-2 model in llama.cpp Metal mode.

Concrete recommendation for the user: use YuE via Replicate's fofr/yue endpoint as the commercial-license-clean fallback / multilingual specialist in the platform's model router, and seriously evaluate ACE-Step in parallel for the throughput-sensitive default path. Plan a future LoRA fine-tune on YuE only after the platform has clear vertical (genre, language, or vocal-style) demand that the closed APIs cannot serve.


References