xlbhzz
/

bandtok-model

music-generation

Model card Files Files and versions

bandtok-model / README.md

xlbhzz's picture

Update README.md

8cc4925 verified 2 days ago

|

history blame contribute delete

2.84 kB

	---
	license: mit
	tags:
	- audio
	- music-generation
	---

	# BandTok

	BandTok is a 2D audio tokenizer that represents music as a time-frequency image. This repository provides the BandTok tokenizer and a language model trained on BandTok tokens for generating 10-second music clips at 44.1 kHz from text prompts.

	## Links

	- [👀Paper](https://arxiv.org/abs/2605.15831)
	- [👂Demo](https://xiaolubuhuizhuzhou.github.io/bandtok_demo/)
	- [👨‍💻Code](https://github.com/xiaolubuhuizhuzhou/Bandtok)

	## Install

	```bash
	pip install -r requirements.txt
	```

	The BandTok decoder uses NVIDIA BigVGAN. You can also install it explicitly:

	```bash
	cd /bandtok
	git clone https://github.com/NVIDIA/BigVGAN
	```

	The package uses Hugging Face Hub for `config.yaml`, `bandtoklm.safetensors`, and tokenizer-only `bandtok.safetensors`.

	## One-Command Music Generation

	```bash
	python examples/infer.py --repo_id xlbhzz/bandtok --prompt "A happy Latin song" --output output.wav
	```

	For a local pre-upload smoke test from this repository directory:

	```bash
	python examples/local_test_infer.py --prompt "A happy Latin song" --output local_test_output.wav
	```

	## Tokenizer Reconstruction Inference

	Use the tokenizer-only checkpoint to encode an audio file into BandTok tokens and decode it back to waveform audio:

	```bash
	python examples/tokenizer_infer.py --repo_id . --input input.wav --output reconstructed.wav
	```

	For a directory, the script mirrors the input folder structure under the output directory:

	```bash
	python examples/tokenizer_infer.py --repo_id . --input /path/to/audios --output tokenizer_reconstructions
	```

	You can also save the encoded tokens:

	```bash
	python examples/tokenizer_infer.py --repo_id . --input input.wav --output reconstructed.wav --save-tokens input_tokens.pt
	```

	## Python API

	```python
	from bandtok import BandTokPipeline

	pipe = BandTokPipeline.from_pretrained("xlbhzz/bandtok", device="cuda")
	audio = pipe.generate("A happy Latin song", duration=10.0)
	pipe.save(audio, "output.wav")
	```

	Tokenizer-only usage:

	```python
	from bandtok import BandTokTokenizer

	tokenizer = BandTokTokenizer.from_pretrained("xlbhzz/bandtok", device="cuda")
	tokens = tokenizer.encode("input.wav")
	audio = tokenizer.decode(tokens)
	```

	## Troubleshooting

	- BigVGAN import error: run `git clone https://github.com/NVIDIA/BigVGAN` under /bandtok.
	- T5 download errors: the prompt encoder uses `t5-base`; make sure Hugging Face downloads are available or pre-cache the model.

	## Citation

	If you find this work useful, please cite:

	```bibtex
	@inproceedings{cheng2026modeling,
	title = {Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation},
	author = {Cheng, Yuqing and Ma, Xingyu and Yu, Guochen and Gu, Xiaotao},
	booktitle = {IEEE ICME 2026 Challenge Papers},
	year = {2026}
	}
	```