new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 22

Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer

Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation. However, this prompt-based conditioning cannot guarantee that the conditioning sequence would develop or even simply repeat itself in the generated continuation. In this paper, we propose an alternative conditioning approach, called theme-based conditioning, that explicitly trains the Transformer to treat the conditioning sequence as a thematic material that has to manifest itself multiple times in its generation result. This is achieved with two main technical contributions. First, we propose a deep learning-based approach that uses contrastive representation learning and clustering to automatically retrieve thematic materials from music pieces in the training data. Second, we propose a novel gated parallel attention module to be used in a sequence-to-sequence (seq2seq) encoder/decoder architecture to more effectively account for a given conditioning thematic material in the generation process of the Transformer decoder. We report on objective and subjective evaluations of variants of the proposed Theme Transformer and the conventional prompt-based baseline, showing that our best model can generate, to some extent, polyphonic pop piano music with repetition and plausible variations of a given condition.

  • 5 authors
·
Nov 7, 2021

PBSCR: The Piano Bootleg Score Composer Recognition Dataset

This article motivates, describes, and presents the PBSCR dataset for studying composer recognition of classical piano music. Our goal was to design a dataset that facilitates large-scale research on composer recognition that is suitable for modern architectures and training practices. To achieve this goal, we utilize the abundance of sheet music images and rich metadata on IMSLP, use a previously proposed feature representation called a bootleg score to encode the location of noteheads relative to staff lines, and present the data in an extremely simple format (2D binary images) to encourage rapid exploration and iteration. The dataset itself contains 40,000 62x64 bootleg score images for a 9-class recognition task, 100,000 62x64 bootleg score images for a 100-class recognition task, and 29,310 unlabeled variable-length bootleg score images for pretraining. The labeled data is presented in a form that mirrors MNIST images, in order to make it extremely easy to visualize, manipulate, and train models in an efficient manner. We include relevant information to connect each bootleg score image with its underlying raw sheet music image, and we scrape, organize, and compile metadata from IMSLP on all piano works to facilitate multimodal research and allow for convenient linking to other datasets. We release baseline results in a supervised and low-shot setting for future works to compare against, and we discuss open research questions that the PBSCR data is especially well suited to facilitate research on.

  • 3 authors
·
Jan 30, 2024

GiantMIDI-Piano: A large-scale MIDI dataset for classical piano music

Symbolic music datasets are important for music information retrieval and musical analysis. However, there is a lack of large-scale symbolic datasets for classical piano music. In this article, we create a GiantMIDI-Piano (GP) dataset containing 38,700,838 transcribed notes and 10,855 unique solo piano works composed by 2,786 composers. We extract the names of music works and the names of composers from the International Music Score Library Project (IMSLP). We search and download their corresponding audio recordings from the internet. We further create a curated subset containing 7,236 works composed by 1,787 composers by constraining the titles of downloaded audio recordings containing the surnames of composers. We apply a convolutional neural network to detect solo piano works. Then, we transcribe those solo piano recordings into Musical Instrument Digital Interface (MIDI) files using a high-resolution piano transcription system. Each transcribed MIDI file contains the onset, offset, pitch, and velocity attributes of piano notes and pedals. GiantMIDI-Piano includes 90% live performance MIDI files and 10\% sequence input MIDI files. We analyse the statistics of GiantMIDI-Piano and show pitch class, interval, trichord, and tetrachord frequencies of six composers from different eras to show that GiantMIDI-Piano can be used for musical analysis. We evaluate the quality of GiantMIDI-Piano in terms of solo piano detection F1 scores, metadata accuracy, and transcription error rates. We release the source code for acquiring the GiantMIDI-Piano dataset at https://github.com/bytedance/GiantMIDI-Piano

  • 4 authors
·
Oct 10, 2020

From Context to Concept: Exploring Semantic Relationships in Music with Word2Vec

We explore the potential of a popular distributional semantics vector space model, word2vec, for capturing meaningful relationships in ecological (complex polyphonic) music. More precisely, the skip-gram version of word2vec is used to model slices of music from a large corpus spanning eight musical genres. In this newly learned vector space, a metric based on cosine distance is able to distinguish between functional chord relationships, as well as harmonic associations in the music. Evidence, based on cosine distance between chord-pair vectors, suggests that an implicit circle-of-fifths exists in the vector space. In addition, a comparison between pieces in different keys reveals that key relationships are represented in word2vec space. These results suggest that the newly learned embedded vector representation does in fact capture tonal and harmonic characteristics of music, without receiving explicit information about the musical content of the constituent slices. In order to investigate whether proximity in the discovered space of embeddings is indicative of `semantically-related' slices, we explore a music generation task, by automatically replacing existing slices from a given piece of music with new slices. We propose an algorithm to find substitute slices based on spatial proximity and the pitch class distribution inferred in the chosen subspace. The results indicate that the size of the subspace used has a significant effect on whether slices belonging to the same key are selected. In sum, the proposed word2vec model is able to learn music-vector embeddings that capture meaningful tonal and harmonic relationships in music, thereby providing a useful tool for exploring musical properties and comparisons across pieces, as a potential input representation for deep learning models, and as a music generation device.

  • 3 authors
·
Nov 29, 2018

MusicScore: A Dataset for Music Score Modeling and Generation

Music scores are written representations of music and contain rich information about musical components. The visual information on music scores includes notes, rests, staff lines, clefs, dynamics, and articulations. This visual information in music scores contains more semantic information than audio and symbolic representations of music. Previous music score datasets have limited sizes and are mainly designed for optical music recognition (OMR). There is a lack of research on creating a large-scale benchmark dataset for music modeling and generation. In this work, we propose MusicScore, a large-scale music score dataset collected and processed from the International Music Score Library Project (IMSLP). MusicScore consists of image-text pairs, where the image is a page of a music score and the text is the metadata of the music. The metadata of MusicScore is extracted from the general information section of the IMSLP pages. The metadata includes rich information about the composer, instrument, piece style, and genre of the music pieces. MusicScore is curated into small, medium, and large scales of 400, 14k, and 200k image-text pairs with varying diversity, respectively. We build a score generation system based on a UNet diffusion model to generate visually readable music scores conditioned on text descriptions to benchmark the MusicScore dataset for music score generation. MusicScore is released to the public at https://huggingface.co/datasets/ZheqiDAI/MusicScore.

  • 3 authors
·
Jun 17, 2024

DExter: Learning and Controlling Performance Expression with Diffusion Models

In the pursuit of developing expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. In this approach, performance parameters are represented in a continuous expression space and a diffusion model is trained to predict these continuous parameters while being conditioned on the musical score. Furthermore, DExter also enables the generation of interpretations (expressive variations of a performance) guided by perceptually meaningful features by conditioning jointly on score and perceptual feature representations. Consequently, we find that our model is useful for learning expressive performance, generating perceptually steered performances, and transferring performance styles. We assess the model through quantitative and qualitative analyses, focusing on specific performance metrics regarding dimensions like asynchrony and articulation, as well as through listening tests comparing generated performances with different human interpretations. Results show that DExter is able to capture the time-varying correlation of the expressive parameters, and compares well to existing rendering models in subjectively evaluated ratings. The perceptual-feature-conditioned generation and transferring capabilities of DExter are verified by a proxy model predicting perceptual characteristics of differently steered performances.

  • 6 authors
·
Jun 20, 2024

MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, the synthesis models often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances. Furthermore, the model's generalisation ability is enhanced by training on an extensive and diverse piano performance dataset. Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline, achieving over 75% lower Frechet Audio Distance on the ATEPP and Maestro datasets. In the listening test, MIDI-VALLE received 202 votes compared to 58 for the baseline, demonstrating improved synthesis quality and generalisation across diverse performance MIDI inputs.

  • 6 authors
·
Jul 11, 2025

Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model

Diffusion models have shown promising results in cross-modal generation tasks involving audio and music, such as text-to-sound and text-to-music generation. These text-controlled music generation models typically focus on generating music by capturing global musical attributes like genre and mood. However, music composition is a complex, multilayered task that often involves musical arrangement as an integral part of the process. This process involves composing each instrument to align with existing ones in terms of beat, dynamics, harmony, and melody, requiring greater precision and control over tracks than text prompts usually provide. In this work, we address these challenges by extending the MusicLDM, a latent diffusion model for music, into a multi-track generative model. By learning the joint probability of tracks sharing a context, our model is capable of generating music across several tracks that correspond well to each other, either conditionally or unconditionally. Additionally, our model is capable of arrangement generation, where the model can generate any subset of tracks given the others (e.g., generating a piano track complementing given bass and drum tracks). We compared our model with an existing multi-track generative model and demonstrated that our model achieves considerable improvements across objective metrics for both total and arrangement generation tasks.

  • 5 authors
·
Sep 4, 2024

Mustango: Toward Controllable Text-to-Music Generation

With recent advancements in text-to-audio and text-to-music based on latent diffusion models, the quality of generated content has been reaching new heights. The controllability of musical aspects, however, has not been explicitly explored in text-to-music systems yet. In this paper, we present Mustango, a music-domain-knowledge-inspired text-to-music system based on diffusion, that expands the Tango text-to-audio model. Mustango aims to control the generated music, not only with general text captions, but from more rich captions that could include specific instructions related to chords, beats, tempo, and key. As part of Mustango, we propose MuNet, a Music-Domain-Knowledge-Informed UNet sub-module to integrate these music-specific features, which we predict from the text prompt, as well as the general text embedding, into the diffusion denoising process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music audio and using state-of-the-art Music Information Retrieval methods to extract the music features which will then be appended to the existing descriptions in text format. We release the resulting MusicBench dataset which contains over 52K instances and includes music-theory-based descriptions in the caption text. Through extensive experiments, we show that the quality of the music generated by Mustango is state-of-the-art, and the controllability through music-specific text prompts greatly outperforms other models in terms of desired chords, beat, key, and tempo, on multiple datasets.

  • 6 authors
·
Nov 14, 2023 1