TheArtist Music Transformer โ€” Phase 0 (Pop Baseline)

Pop-pretrained chord generation model. The starting point that all five jazz fine-tunes resume from. Released as the no-jazz reference for measuring catastrophic forgetting in the companion paper.

This checkpoint is one of six released alongside the paper Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation (Lee, 2026). The collection landing page is at PearlLeeStudio/TheArtist-MusicTransformer-pop-baseline (this repo) and at the per-checkpoint repositories listed in the paper. See the paper for the experimental design that motivates this set of checkpoints.

Model summary

Field Value
Architecture Music Transformer with relative positional attention
Parameters 25,661,440
Vocabulary size 351 tokens
Max sequence length 256
d_model / heads / FFN / layers 512 / 8 / 2048 / 8
Training framework PyTorch 2.5+ (CUDA 12.1)

Training data

Trained from scratch on the pop training split: Chordonomicon (679K songs) and McGill Billboard (890 songs), deduplicated and twelve-key-augmented. Three epochs at peak learning rate 3 ร— 10โปโด with one-epoch warmup and cosine decay. Wall-clock time โ‰ˆ27 hours on a single NVIDIA RTX 4070 Mobile.

Evaluation (held-out per-genre test sets)

Metric Pop test Jazz test
Top-1 accuracy 84.24% 72.86%
Top-5 accuracy 97.10% 86.51%
Perplexity 1.73 4.01

The 72.86% jazz top-1 from a pop-only model reflects the substantial token overlap between the two genres. Pop and jazz share most of their chord tokens; jazz-specific gains in our fine-tuned checkpoints come from learning the transition statistics over those tokens, not from learning new tokens.

Intended use and limitations

This checkpoint is the unmodified pop baseline. It is the most pop-fluent model in the collection and the most jazz-naive. Use it when pop output is the only target. For balanced pop and jazz output, use F3 (ft-pop50). For pop-leaning output that still includes occasional jazz coloration, use F1 (ft-pop80). For jazz-leaning output, use F4 (ft-pop29).

Out of scope: melody or audio generation (symbolic chord-only model); genres outside pop, rock, and jazz (out-of-distribution behavior not characterized); real-time low-latency settings (use batched inference).

Usage

import torch
from huggingface_hub import hf_hub_download
from model import MusicTransformer
from tokenizer import ChordTokenizer

ckpt_path = hf_hub_download(
    repo_id="PearlLeeStudio/TheArtist-MusicTransformer-pop-baseline",
    filename="best.pt",
)
tokenizer = ChordTokenizer()
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)

model = MusicTransformer(
    vocab_size=tokenizer.vocab_size,
    d_model=512, n_heads=8, d_ff=2048, n_layers=8,
    max_seq_len=256, dropout=0.0, pad_id=tokenizer.pad_id,
)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

model.py and tokenizer.py are included in this repository alongside best.pt.

Training-data licenses

Dataset License
Chordonomicon Public (user-generated)
McGill Billboard CC0

Citation

Preprint: arXiv:2605.04998.

@misc{lee2026chordmix,
  title         = {Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation},
  author        = {Lee, Jinju},
  year          = {2026},
  eprint        = {2605.04998},
  archivePrefix = {arXiv}
}
Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for PearlLeeStudio/TheArtist-MusicTransformer-pop-baseline