| --- |
| license: mit |
| tags: |
| - audio |
| - music-generation |
| --- |
| |
| # BandTok |
|
|
| BandTok is a 2D audio tokenizer that represents music as a time-frequency image. This repository provides the BandTok tokenizer and a language model trained on BandTok tokens for generating 10-second music clips at 44.1 kHz from text prompts. |
|
|
| ## Links |
|
|
| - [👀Paper](https://arxiv.org/abs/2605.15831) |
| - [👂Demo](https://xiaolubuhuizhuzhou.github.io/bandtok_demo/) |
| - [👨💻Code](https://github.com/xiaolubuhuizhuzhou/Bandtok) |
|
|
| ## Install |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| The BandTok decoder uses NVIDIA BigVGAN. You can also install it explicitly: |
|
|
| ```bash |
| cd /bandtok |
| git clone https://github.com/NVIDIA/BigVGAN |
| ``` |
|
|
| The package uses Hugging Face Hub for `config.yaml`, `bandtoklm.safetensors`, and tokenizer-only `bandtok.safetensors`. |
|
|
| ## One-Command Music Generation |
|
|
| ```bash |
| python examples/infer.py --repo_id xlbhzz/bandtok --prompt "A happy Latin song" --output output.wav |
| ``` |
|
|
| For a local pre-upload smoke test from this repository directory: |
|
|
| ```bash |
| python examples/local_test_infer.py --prompt "A happy Latin song" --output local_test_output.wav |
| ``` |
|
|
| ## Tokenizer Reconstruction Inference |
|
|
| Use the tokenizer-only checkpoint to encode an audio file into BandTok tokens and decode it back to waveform audio: |
|
|
| ```bash |
| python examples/tokenizer_infer.py --repo_id . --input input.wav --output reconstructed.wav |
| ``` |
|
|
| For a directory, the script mirrors the input folder structure under the output directory: |
|
|
| ```bash |
| python examples/tokenizer_infer.py --repo_id . --input /path/to/audios --output tokenizer_reconstructions |
| ``` |
|
|
| You can also save the encoded tokens: |
|
|
| ```bash |
| python examples/tokenizer_infer.py --repo_id . --input input.wav --output reconstructed.wav --save-tokens input_tokens.pt |
| ``` |
|
|
| ## Python API |
|
|
| ```python |
| from bandtok import BandTokPipeline |
| |
| pipe = BandTokPipeline.from_pretrained("xlbhzz/bandtok", device="cuda") |
| audio = pipe.generate("A happy Latin song", duration=10.0) |
| pipe.save(audio, "output.wav") |
| ``` |
|
|
| Tokenizer-only usage: |
|
|
| ```python |
| from bandtok import BandTokTokenizer |
| |
| tokenizer = BandTokTokenizer.from_pretrained("xlbhzz/bandtok", device="cuda") |
| tokens = tokenizer.encode("input.wav") |
| audio = tokenizer.decode(tokens) |
| ``` |
|
|
| ## Troubleshooting |
|
|
| - BigVGAN import error: run `git clone https://github.com/NVIDIA/BigVGAN` under /bandtok. |
| - T5 download errors: the prompt encoder uses `t5-base`; make sure Hugging Face downloads are available or pre-cache the model. |
|
|
| ## Citation |
|
|
| If you find this work useful, please cite: |
|
|
| ```bibtex |
| @inproceedings{cheng2026modeling, |
| title = {Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation}, |
| author = {Cheng, Yuqing and Ma, Xingyu and Yu, Guochen and Gu, Xiaotao}, |
| booktitle = {IEEE ICME 2026 Challenge Papers}, |
| year = {2026} |
| } |
| ``` |
|
|