TheArtist Music Transformer — Phase 0 (Pop Baseline)

Pop-pretrained chord generation model. The starting point that all five jazz fine-tunes resume from. Released as the no-jazz reference for measuring catastrophic forgetting in the companion paper.

This checkpoint is one of six released alongside the paper Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation (Lee, 2026). The collection landing page is at PearlLeeStudio/TheArtist-MusicTransformer-pop-baseline (this repo) and at the per-checkpoint repositories listed in the paper. See the paper for the experimental design that motivates this set of checkpoints.

Model summary

Field	Value
Architecture	Music Transformer with relative positional attention
Parameters	25,661,440
Vocabulary size	351 tokens
Max sequence length	256
d_model / heads / FFN / layers	512 / 8 / 2048 / 8
Training framework	PyTorch 2.5+ (CUDA 12.1)

Training data

Trained from scratch on the pop training split: Chordonomicon (~~679K songs) and McGill Billboard (~~890 songs), deduplicated and twelve-key-augmented. Three epochs at peak learning rate 3 × 10⁻⁴ with one-epoch warmup and cosine decay. Wall-clock time ≈27 hours on a single NVIDIA RTX 4070 Mobile.

Evaluation (held-out per-genre test sets)

Metric	Pop test	Jazz test
Top-1 accuracy	84.24%	72.86%
Top-5 accuracy	97.10%	86.51%
Perplexity	1.73	4.01

The 72.86% jazz top-1 from a pop-only model reflects the substantial token overlap between the two genres. Pop and jazz share most of their chord tokens; jazz-specific gains in our fine-tuned checkpoints come from learning the transition statistics over those tokens, not from learning new tokens.

Intended use and limitations

This checkpoint is the unmodified pop baseline. It is the most pop-fluent model in the collection and the most jazz-naive. Use it when pop output is the only target. For balanced pop and jazz output, use F3 (ft-pop50). For pop-leaning output that still includes occasional jazz coloration, use F1 (ft-pop80). For jazz-leaning output, use F4 (ft-pop29).

Out of scope: melody or audio generation (symbolic chord-only model); genres outside pop, rock, and jazz (out-of-distribution behavior not characterized); real-time low-latency settings (use batched inference).

Usage

import torch
from huggingface_hub import hf_hub_download
from model import MusicTransformer
from tokenizer import ChordTokenizer

ckpt_path = hf_hub_download(
    repo_id="PearlLeeStudio/TheArtist-MusicTransformer-pop-baseline",
    filename="best.pt",
)
tokenizer = ChordTokenizer()
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)

model = MusicTransformer(
    vocab_size=tokenizer.vocab_size,
    d_model=512, n_heads=8, d_ff=2048, n_layers=8,
    max_seq_len=256, dropout=0.0, pad_id=tokenizer.pad_id,
)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

model.py and tokenizer.py are included in this repository alongside best.pt.

Training-data licenses

Dataset	License
Chordonomicon	Public (user-generated)
McGill Billboard	CC0

Citation

Preprint: arXiv:2605.04998.

@misc{lee2026chordmix,
  title         = {Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation},
  author        = {Lee, Jinju},
  year          = {2026},
  eprint        = {2605.04998},
  archivePrefix = {arXiv}
}

Downloads last month: 13

Paper for PearlLeeStudio/TheArtist-MusicTransformer-pop-baseline

Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation

Paper • 2605.04998 • Published 2 days ago