TheArtist Music Transformer โ Phase 0 (Pop Baseline)
Pop-pretrained chord generation model. The starting point that all five jazz fine-tunes resume from. Released as the no-jazz reference for measuring catastrophic forgetting in the companion paper.
This checkpoint is one of six released alongside the paper Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation (Lee, 2026). The collection landing page is at PearlLeeStudio/TheArtist-MusicTransformer-pop-baseline (this repo) and at the per-checkpoint repositories listed in the paper. See the paper for the experimental design that motivates this set of checkpoints.
Model summary
| Field | Value |
|---|---|
| Architecture | Music Transformer with relative positional attention |
| Parameters | 25,661,440 |
| Vocabulary size | 351 tokens |
| Max sequence length | 256 |
| d_model / heads / FFN / layers | 512 / 8 / 2048 / 8 |
| Training framework | PyTorch 2.5+ (CUDA 12.1) |
Training data
Trained from scratch on the pop training split: Chordonomicon (679K songs) and McGill Billboard (890 songs), deduplicated and twelve-key-augmented. Three epochs at peak learning rate 3 ร 10โปโด with one-epoch warmup and cosine decay. Wall-clock time โ27 hours on a single NVIDIA RTX 4070 Mobile.
Evaluation (held-out per-genre test sets)
| Metric | Pop test | Jazz test |
|---|---|---|
| Top-1 accuracy | 84.24% | 72.86% |
| Top-5 accuracy | 97.10% | 86.51% |
| Perplexity | 1.73 | 4.01 |
The 72.86% jazz top-1 from a pop-only model reflects the substantial token overlap between the two genres. Pop and jazz share most of their chord tokens; jazz-specific gains in our fine-tuned checkpoints come from learning the transition statistics over those tokens, not from learning new tokens.
Intended use and limitations
This checkpoint is the unmodified pop baseline. It is the most pop-fluent model in the collection and the most jazz-naive. Use it when pop output is the only target. For balanced pop and jazz output, use F3 (ft-pop50). For pop-leaning output that still includes occasional jazz coloration, use F1 (ft-pop80). For jazz-leaning output, use F4 (ft-pop29).
Out of scope: melody or audio generation (symbolic chord-only model); genres outside pop, rock, and jazz (out-of-distribution behavior not characterized); real-time low-latency settings (use batched inference).
Usage
import torch
from huggingface_hub import hf_hub_download
from model import MusicTransformer
from tokenizer import ChordTokenizer
ckpt_path = hf_hub_download(
repo_id="PearlLeeStudio/TheArtist-MusicTransformer-pop-baseline",
filename="best.pt",
)
tokenizer = ChordTokenizer()
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model = MusicTransformer(
vocab_size=tokenizer.vocab_size,
d_model=512, n_heads=8, d_ff=2048, n_layers=8,
max_seq_len=256, dropout=0.0, pad_id=tokenizer.pad_id,
)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
model.py and tokenizer.py are included in this repository alongside best.pt.
Training-data licenses
| Dataset | License |
|---|---|
| Chordonomicon | Public (user-generated) |
| McGill Billboard | CC0 |
Citation
Preprint: arXiv:2605.04998.
@misc{lee2026chordmix,
title = {Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation},
author = {Lee, Jinju},
year = {2026},
eprint = {2605.04998},
archivePrefix = {arXiv}
}
- Downloads last month
- 13