--- license: mit tags: - audio - music-generation --- # BandTok BandTok is a 2D audio tokenizer that represents music as a time-frequency image. This repository provides the BandTok tokenizer and a language model trained on BandTok tokens for generating 10-second music clips at 44.1 kHz from text prompts. ## Links - [👀Paper](https://arxiv.org/abs/2605.15831) - [👂Demo](https://xiaolubuhuizhuzhou.github.io/bandtok_demo/) - [👨‍💻Code](https://github.com/xiaolubuhuizhuzhou/Bandtok) ## Install ```bash pip install -r requirements.txt ``` The BandTok decoder uses NVIDIA BigVGAN. You can also install it explicitly: ```bash cd /bandtok git clone https://github.com/NVIDIA/BigVGAN ``` The package uses Hugging Face Hub for `config.yaml`, `bandtoklm.safetensors`, and tokenizer-only `bandtok.safetensors`. ## One-Command Music Generation ```bash python examples/infer.py --repo_id xlbhzz/bandtok --prompt "A happy Latin song" --output output.wav ``` For a local pre-upload smoke test from this repository directory: ```bash python examples/local_test_infer.py --prompt "A happy Latin song" --output local_test_output.wav ``` ## Tokenizer Reconstruction Inference Use the tokenizer-only checkpoint to encode an audio file into BandTok tokens and decode it back to waveform audio: ```bash python examples/tokenizer_infer.py --repo_id . --input input.wav --output reconstructed.wav ``` For a directory, the script mirrors the input folder structure under the output directory: ```bash python examples/tokenizer_infer.py --repo_id . --input /path/to/audios --output tokenizer_reconstructions ``` You can also save the encoded tokens: ```bash python examples/tokenizer_infer.py --repo_id . --input input.wav --output reconstructed.wav --save-tokens input_tokens.pt ``` ## Python API ```python from bandtok import BandTokPipeline pipe = BandTokPipeline.from_pretrained("xlbhzz/bandtok", device="cuda") audio = pipe.generate("A happy Latin song", duration=10.0) pipe.save(audio, "output.wav") ``` Tokenizer-only usage: ```python from bandtok import BandTokTokenizer tokenizer = BandTokTokenizer.from_pretrained("xlbhzz/bandtok", device="cuda") tokens = tokenizer.encode("input.wav") audio = tokenizer.decode(tokens) ``` ## Troubleshooting - BigVGAN import error: run `git clone https://github.com/NVIDIA/BigVGAN` under /bandtok. - T5 download errors: the prompt encoder uses `t5-base`; make sure Hugging Face downloads are available or pre-cache the model. ## Citation If you find this work useful, please cite: ```bibtex @inproceedings{cheng2026modeling, title = {Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation}, author = {Cheng, Yuqing and Ma, Xingyu and Yu, Guochen and Gu, Xiaotao}, booktitle = {IEEE ICME 2026 Challenge Papers}, year = {2026} } ```