Add demo video link (youtu.be/iSDj4VK9nKc)

f791f29 verified about 8 hours ago

8.95 kB

	---
	license: cc-by-nc-4.0
	library_name: pytorch
	tags:
	- music
	- music-generation
	- chord-generation
	- symbolic-music
	- music-transformer
	- jazz
	- pop
	language:
	- en
	---

	# TheArtist Music Transformer — F2 (Pop 5K Mix)

	Jazz-adapted chord model with a 5,000-sequence pop rehearsal buffer. Calibration point that the paper finds is dominated by F3 on every axis.

	One of six checkpoints released alongside the paper Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation (Lee, 2026). See the collection overview at `PearlLeeStudio/TheArtist-MusicTransformer-pop-baseline`.

	## Demo

	[Watch TheArtist in action on YouTube](https://youtu.be/iSDj4VK9nKc) — interactive staff editor, MIDI input, AI generation with live progress, and per-genre LoRA playback across the 13-genre vocabulary.

	## Model summary

	\| Field \| Value \|
	\|---\|---\|
	\| Architecture \| Music Transformer with relative positional attention \|
	\| Parameters \| 25,661,440 \|
	\| Vocabulary size \| 351 tokens \|
	\| Max sequence length \| 256 \|
	\| d_model / heads / FFN / layers \| 512 / 8 / 2048 / 8 \|
	\| Fine-tune resumed from \| Phase 0 pop baseline \|
	\| Best epoch \| 4 \|

	## Training data

	All 1,513 jazz training sequences plus 5,000 pop rehearsal sequences (seed 42). Pop:jazz ≈ 3.3:1.

	## Evaluation (held-out per-genre test sets)

	\| Metric \| Pop test \| Jazz test \|
	\|---\|---:\|---:\|
	\| Top-1 accuracy \| 84.07% \| 79.90% \|
	\| Top-5 accuracy \| 97.04% \| 92.14% \|
	\| Perplexity \| 1.75 \| 2.33 \|
	\| Δ vs. Phase 0 baseline \| −0.17 \| +7.04 \|

	F2 is dominated by F3 on every axis. It is released for reproducibility of the saturation curve described in the paper (see paper §6.1, §7.3) but is not the recommended choice for any operating point. Prefer F3 for the balanced setting, F1 for pop-leaning, or F4 for jazz-leaning.

	## Intended use

	Reference checkpoint for replication and saturation-curve analysis. Not recommended as a default for chord-composition workflows.

	## Usage

	The repo bundles the project's `model.py` and `tokenizer.py` at the repo
	root, so external users can load the checkpoint end-to-end without
	cloning anything from GitHub. `snapshot_download` materializes the full
	repo on disk; `sys.path` makes the bundled `model.py` / `tokenizer.py`
	importable.

	Required dependencies: `torch`, `huggingface_hub`.

	```python
	import sys
	import torch
	from huggingface_hub import snapshot_download

	# Download the full repo (model.py, tokenizer.py, best.pt, config.json).
	ckpt_dir = snapshot_download(repo_id="PearlLeeStudio/TheArtist-MusicTransformer-ft-pop67")
	sys.path.insert(0, ckpt_dir) # so the next two imports resolve

	from model import MusicTransformer
	from tokenizer import ChordTokenizer

	tokenizer = ChordTokenizer()
	ckpt = torch.load(f"{ckpt_dir}/best.pt", map_location="cpu", weights_only=False)
	model = MusicTransformer(
	vocab_size=tokenizer.vocab_size,
	d_model=512, n_heads=8, d_ff=2048, n_layers=8,
	max_seq_len=256, dropout=0.0, pad_id=tokenizer.pad_id,
	)
	model.load_state_dict(ckpt["model_state_dict"])
	model.eval()

	# Prompt = ii-V-I in C major; ask for a pop-flavoured continuation.
	song = {
	"key": "Cmaj", "time_signature": "4/4", "genre": "pop",
	"bars": [["Dm7", "G7"], ["Cmaj7"]],
	}
	prompt_ids = tokenizer.encode_sequence(song)[:-1]
	ids = torch.tensor([prompt_ids])
	with torch.no_grad():
	for _ in range(32):
	logits = model(ids)
	next_id = torch.multinomial(
	torch.softmax(logits[:, -1, :] / 0.8, dim=-1), 1,
	)
	ids = torch.cat([ids, next_id], dim=-1)
	if next_id.item() == tokenizer.eos_id:
	break
	print(tokenizer.decode(ids[0].tolist()))
	```

	For per-genre adaptation beyond pop and jazz, see the 11 LoRA adapter
	repos at [PearlLeeStudio](https://huggingface.co/PearlLeeStudio) — they
	chain on top of this base.

	<!-- real-song-eval:begin -->
	## Per-genre real-song eval (held-out 130-song set, 2026-05)

	First per-genre evaluation of ft-pop67 beyond the pop/jazz split that the original paper reports.

	### Eval results

	\| Genre \| n_songs \| Top-1 (%) \| Top-5 (%) \| val_loss \|
	\|---\|---:\|---:\|---:\|---:\|
	\| pop \| 10 \| 86.23 \| 95.70 \| 0.5810 \|
	\| rock \| 10 \| 87.04 \| 96.90 \| 0.4660 \|
	\| jazz \| 10 \| 69.43 \| 85.73 \| 1.3464 \|
	\| blues \| 10 \| 83.11 \| 93.35 \| 0.7935 \|
	\| bossa \| 10 \| 82.44 \| 95.04 \| 0.7304 \|
	\| classical \| 10 \| 49.84 \| 83.17 \| 2.0839 \|
	\| country \| 10 \| 85.80 \| 97.25 \| 0.5200 \|
	\| electronic \| 10 \| 86.81 \| 97.49 \| 0.5097 \|
	\| folk \| 10 \| 84.60 \| 98.06 \| 0.5295 \|
	\| funk \| 10 \| 83.76 \| 95.61 \| 0.6983 \|
	\| gospel \| 10 \| 80.47 \| 95.83 \| 0.7463 \|
	\| hip_hop \| 10 \| 90.49 \| 97.62 \| 0.3990 \|
	\| rnb_soul \| 10 \| 84.66 \| 96.40 \| 0.5954 \|

	On this eval set F2 peaks on hip_hop (90.49%) and struggles most on classical (49.84%).
	This is auxiliary signal — the 11 per-genre LoRAs ([sister `lora-*` repos](https://huggingface.co/PearlLeeStudio)) are the recommended path for production use on the 9 non-pop, non-jazz genres. F-series cells on those genres show what the base model produces under `[GENRE:none]` conditioning (the model's `[GENRE:X]` token does not exist for the 9 new genres in the F-series vocab=351).

	### Eval dataset composition

	130 songs total, 10 per genre × 13 genres. Drawn from the same `splits/val.jsonl` + `splits/test.jsonl` partitions every F-series model was held out from during training — no train-set leakage. Built by `ai/training/build_eval_real_songs.py --seed 42 --per-genre 10` (deterministic).

	\| Genre \| n \| Source(s) \| Bar range \| Avg duration · named \|
	\|---\|---:\|---\|---\|---\|
	\| pop \| 10 \| billboard \| 58–116 \| 189s · 10/10 named \|
	\| rock \| 10 \| chordonomicon_rock \| 52–87 \| 127s · 0/10 named \|
	\| jazz \| 10 \| choco:jazz-corpus, choco:real-book, jazzstandards, jht \| 16–89 \| 72s · 10/10 named \|
	\| blues \| 10 \| chordonomicon_blues \| 24–46 \| 93s · 0/10 named \|
	\| bossa \| 10 \| chordonomicon_bossa \| 24–78 \| 88s · 0/10 named \|
	\| classical \| 10 \| chordonomicon_classical \| 11–40 \| 60s · 10/10 named \|
	\| country \| 10 \| chordonomicon_country \| 30–81 \| 110s · 0/10 named \|
	\| electronic \| 10 \| chordonomicon_electronic \| 25–84 \| 89s · 0/10 named \|
	\| folk \| 10 \| chordonomicon_folk \| 33–82 \| 114s · 0/10 named \|
	\| funk \| 10 \| chordonomicon_funk \| 30–60 \| 92s · 0/10 named \|
	\| gospel \| 10 \| chordonomicon_gospel \| 24–85 \| 98s · 0/10 named \|
	\| hip_hop \| 10 \| chordonomicon_hip_hop \| 24–81 \| 136s · 0/10 named \|
	\| rnb_soul \| 10 \| chordonomicon_rnb_soul \| 34–82 \| 128s · 0/10 named \|

	Source license summary: McGill Billboard (CC0, named pop songs), Jazz Harmony Treebank / JazzStandards / WJazzD (Public / community-redistributed, named jazz standards), Bach chorales via music21 (public domain, named pieces), Chordonomicon per-genre subsets (CC BY-NC 4.0; titles are Spotify track IDs by upstream dataset policy — progressions are real songs). See [docs/EVAL.md](https://github.com/JinJuLee/PearlLeeStudio_TheArtist/blob/main/docs/EVAL.md) for full breakdown.

	### Methodology

	Teacher-forced next-token cross-entropy / top-1 / top-5 over each song's token sequence (BOS + key + time_sig + genre + bars + EOS, truncated to `max_seq_len=256`). Same `evaluate()` call as `ai/results/f1_per_genre_baseline.csv`, just narrowed to the curated 130-song subset. Token-level metrics; not a generation-quality eval (free-generation comparison with R1 Sethares + R2 theory RAG rerank is documented separately in `ai/results/eval_report.md`).

	Caveats:
	- `classical` val partition is intrinsically small (37 sequences in full eval); the 10-song subset here has even narrower confidence bands. Directional finding (LoRA helps a lot on Bach harmony) is robust, exact pp deltas are noisy.
	- F-series numbers on the 9 LoRA-only genres are conditioned without genre tag (vocab=351 has no `[GENRE:country]` token etc.). This is the realistic "F-series alone" condition, not a controlled ablation.

	Source CSV: `ai/results/real_song_eval.csv` (17 models × 130 songs, long format).
	<!-- real-song-eval:end -->

	## Training-data licenses

	\| Dataset \| License \|
	\|---\|---\|
	\| Chordonomicon \| Public (user-generated) \|
	\| McGill Billboard \| CC0 \|
	\| Jazz Harmony Treebank \| Public \|
	\| JazzStandards (iReal Pro) \| Community redistribution \|
	\| Weimar Jazz Database \| ODbL \|
	\| JAAH \| Research-use public \|

	## Citation

	Cite the original mix-ratio paper. The companion per-genre LoRA paper
	(chord-symbol time-series adaptation) is in preparation; its arXiv ID
	will be added here once posted.

	```bibtex
	@misc{lee2026chordmix,
	title = {Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation},
	author = {Lee, Jinju},
	year = {2026},
	eprint = {2605.04998},
	archivePrefix = {arXiv}
	}

	@misc{lee2026chordtimeseries,
	title = {How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity?},
	author = {Lee, Jinju},
	year = {2026},
	note = {arXiv preprint, ID TBD},
	}
	```