File size: 2,843 Bytes
ddc5f7d e69ee92 ddc5f7d 8cc4925 ddc5f7d e69ee92 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | ---
license: mit
tags:
- audio
- music-generation
---
# BandTok
BandTok is a 2D audio tokenizer that represents music as a time-frequency image. This repository provides the BandTok tokenizer and a language model trained on BandTok tokens for generating 10-second music clips at 44.1 kHz from text prompts.
## Links
- [👀Paper](https://arxiv.org/abs/2605.15831)
- [👂Demo](https://xiaolubuhuizhuzhou.github.io/bandtok_demo/)
- [👨💻Code](https://github.com/xiaolubuhuizhuzhou/Bandtok)
## Install
```bash
pip install -r requirements.txt
```
The BandTok decoder uses NVIDIA BigVGAN. You can also install it explicitly:
```bash
cd /bandtok
git clone https://github.com/NVIDIA/BigVGAN
```
The package uses Hugging Face Hub for `config.yaml`, `bandtoklm.safetensors`, and tokenizer-only `bandtok.safetensors`.
## One-Command Music Generation
```bash
python examples/infer.py --repo_id xlbhzz/bandtok --prompt "A happy Latin song" --output output.wav
```
For a local pre-upload smoke test from this repository directory:
```bash
python examples/local_test_infer.py --prompt "A happy Latin song" --output local_test_output.wav
```
## Tokenizer Reconstruction Inference
Use the tokenizer-only checkpoint to encode an audio file into BandTok tokens and decode it back to waveform audio:
```bash
python examples/tokenizer_infer.py --repo_id . --input input.wav --output reconstructed.wav
```
For a directory, the script mirrors the input folder structure under the output directory:
```bash
python examples/tokenizer_infer.py --repo_id . --input /path/to/audios --output tokenizer_reconstructions
```
You can also save the encoded tokens:
```bash
python examples/tokenizer_infer.py --repo_id . --input input.wav --output reconstructed.wav --save-tokens input_tokens.pt
```
## Python API
```python
from bandtok import BandTokPipeline
pipe = BandTokPipeline.from_pretrained("xlbhzz/bandtok", device="cuda")
audio = pipe.generate("A happy Latin song", duration=10.0)
pipe.save(audio, "output.wav")
```
Tokenizer-only usage:
```python
from bandtok import BandTokTokenizer
tokenizer = BandTokTokenizer.from_pretrained("xlbhzz/bandtok", device="cuda")
tokens = tokenizer.encode("input.wav")
audio = tokenizer.decode(tokens)
```
## Troubleshooting
- BigVGAN import error: run `git clone https://github.com/NVIDIA/BigVGAN` under /bandtok.
- T5 download errors: the prompt encoder uses `t5-base`; make sure Hugging Face downloads are available or pre-cache the model.
## Citation
If you find this work useful, please cite:
```bibtex
@inproceedings{cheng2026modeling,
title = {Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation},
author = {Cheng, Yuqing and Ma, Xingyu and Yu, Guochen and Gu, Xiaotao},
booktitle = {IEEE ICME 2026 Challenge Papers},
year = {2026}
}
```
|