Spaces:

ydqmkkx
/

GibbsTTS

Running on Zero

App Files Files Community

GibbsTTS / README.md

ydqmkkx

update

0afe769 4 days ago

preview code

raw

history blame contribute delete

2.42 kB

	---
	title: GibbsTTS Demo
	emoji: 🎙️
	colorFrom: indigo
	colorTo: gray
	sdk: gradio
	sdk_version: "5.49.1"
	python_version: "3.10"
	app_file: app.py
	pinned: false
	license: mit
	short_description: Zero-shot voice cloning TTS (EN/ZH) — GibbsTTS demo
	models:
	- ydqmkkx/GibbsTTS
	tags:
	- tts
	- text-to-speech
	- voice-cloning
	- zero-shot
	- english
	- chinese
	- flow-matching
	---

	# GibbsTTS — Zero-Shot Voice Cloning TTS

	A Hugging Face Space for GibbsTTS, a zero-shot text-to-speech model
	based on metric-induced discrete flow matching with the proposed
	kinetic-optimal scheduler and finite-step CTMC moment correction.

	- 📄 Paper: <https://arxiv.org/abs/2605.09386>
	- 💻 Code: <https://github.com/ydqmkkx/GibbsTTS>
	- 🎛️ Weights: <https://huggingface.co/ydqmkkx/GibbsTTS>

	## How to use

	1. Reference audio — upload (or record) a short clip of the voice you want
	to clone. A few seconds is enough.
	2. Reference transcript — type exactly what the reference clip says.
	3. Target text — the sentence you want the model to speak in that voice.
	4. Language — choose `English`, `Chinese (Mandarin)`, or `Mixed EN/ZH`.
	5. Click Synthesize.

	The model was trained on
	[Emilia-en/zh](https://huggingface.co/datasets/amphion/Emilia-Dataset), so it
	supports English and Mandarin. The mixed mode is experimental and provided
	for fun.

	## Hardware

	Inference is fast on a single GPU (a couple of seconds per sentence on an
	H100). The model is ~1.6 GB plus the MaskGCT codec — choose at least a small
	GPU runtime. Weights are downloaded automatically from
	[`ydqmkkx/GibbsTTS`](https://huggingface.co/ydqmkkx/GibbsTTS) on the first run.

	## Citation

	```bibtex
	@article{GibbsTTS,
	author = {Dong Yang and Yiyi Cai and Haoyu Zhang and Yuki Saito and Hiroshi Saruwatari},
	title = {Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech},
	year = {2026},
	journal = {arXiv preprint arXiv:2605.09386},
	}

	@inproceedings{MaskGCT,
	author = {Yuancheng Wang and Haoyue Zhan and Liwei Liu and Ruihong Zeng and Haotian Guo and Jiachen Zheng and Qiang Zhang and Xueyao Zhang and Shunsi Zhang and Zhizheng Wu},
	title = {{MaskGCT}: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
	year = {2025},
	booktitle = {International Conference on Learning Representations (ICLR)},
	}
	```