Spaces:
Running on Zero
Running on Zero
| title: GibbsTTS Demo | |
| emoji: ποΈ | |
| colorFrom: indigo | |
| colorTo: gray | |
| sdk: gradio | |
| sdk_version: "5.49.1" | |
| python_version: "3.10" | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Zero-shot voice cloning TTS (EN/ZH) β GibbsTTS demo | |
| models: | |
| - ydqmkkx/GibbsTTS | |
| tags: | |
| - tts | |
| - text-to-speech | |
| - voice-cloning | |
| - zero-shot | |
| - english | |
| - chinese | |
| - flow-matching | |
| # GibbsTTS β Zero-Shot Voice Cloning TTS | |
| A Hugging Face Space for **GibbsTTS**, a zero-shot text-to-speech model | |
| based on metric-induced discrete flow matching with the proposed | |
| kinetic-optimal scheduler and finite-step CTMC moment correction. | |
| - π Paper: <https://arxiv.org/abs/2605.09386> | |
| - π» Code: <https://github.com/ydqmkkx/GibbsTTS> | |
| - ποΈ Weights: <https://huggingface.co/ydqmkkx/GibbsTTS> | |
| ## How to use | |
| 1. **Reference audio** β upload (or record) a short clip of the voice you want | |
| to clone. A few seconds is enough. | |
| 2. **Reference transcript** β type exactly what the reference clip says. | |
| 3. **Target text** β the sentence you want the model to speak in that voice. | |
| 4. **Language** β choose `English`, `Chinese (Mandarin)`, or `Mixed EN/ZH`. | |
| 5. Click **Synthesize**. | |
| The model was trained on | |
| [Emilia-en/zh](https://huggingface.co/datasets/amphion/Emilia-Dataset), so it | |
| supports English and Mandarin. The mixed mode is experimental and provided | |
| for fun. | |
| ## Hardware | |
| Inference is fast on a single GPU (a couple of seconds per sentence on an | |
| H100). The model is ~1.6 GB plus the MaskGCT codec β choose at least a small | |
| GPU runtime. Weights are downloaded automatically from | |
| [`ydqmkkx/GibbsTTS`](https://huggingface.co/ydqmkkx/GibbsTTS) on the first run. | |
| ## Citation | |
| ```bibtex | |
| @article{GibbsTTS, | |
| author = {Dong Yang and Yiyi Cai and Haoyu Zhang and Yuki Saito and Hiroshi Saruwatari}, | |
| title = {Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech}, | |
| year = {2026}, | |
| journal = {arXiv preprint arXiv:2605.09386}, | |
| } | |
| @inproceedings{MaskGCT, | |
| author = {Yuancheng Wang and Haoyue Zhan and Liwei Liu and Ruihong Zeng and Haotian Guo and Jiachen Zheng and Qiang Zhang and Xueyao Zhang and Shunsi Zhang and Zhizheng Wu}, | |
| title = {{MaskGCT}: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer}, | |
| year = {2025}, | |
| booktitle = {International Conference on Learning Representations (ICLR)}, | |
| } | |
| ``` | |