Spaces:
Running on Zero
Running on Zero
A newer version of the Gradio SDK is available: 6.14.0
metadata
title: GibbsTTS Demo
emoji: ποΈ
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.49.1
python_version: '3.10'
app_file: app.py
pinned: false
license: mit
short_description: Zero-shot voice cloning TTS (EN/ZH) β GibbsTTS demo
models:
- ydqmkkx/GibbsTTS
tags:
- tts
- text-to-speech
- voice-cloning
- zero-shot
- english
- chinese
- flow-matching
GibbsTTS β Zero-Shot Voice Cloning TTS
A Hugging Face Space for GibbsTTS, a zero-shot text-to-speech model based on metric-induced discrete flow matching with the proposed kinetic-optimal scheduler and finite-step CTMC moment correction.
- π Paper: https://arxiv.org/abs/2605.09386
- π» Code: https://github.com/ydqmkkx/GibbsTTS
- ποΈ Weights: https://huggingface.co/ydqmkkx/GibbsTTS
How to use
- Reference audio β upload (or record) a short clip of the voice you want to clone. A few seconds is enough.
- Reference transcript β type exactly what the reference clip says.
- Target text β the sentence you want the model to speak in that voice.
- Language β choose
English,Chinese (Mandarin), orMixed EN/ZH. - Click Synthesize.
The model was trained on Emilia-en/zh, so it supports English and Mandarin. The mixed mode is experimental and provided for fun.
Hardware
Inference is fast on a single GPU (a couple of seconds per sentence on an
H100). The model is ~1.6 GB plus the MaskGCT codec β choose at least a small
GPU runtime. Weights are downloaded automatically from
ydqmkkx/GibbsTTS on the first run.
Citation
@article{GibbsTTS,
author = {Dong Yang and Yiyi Cai and Haoyu Zhang and Yuki Saito and Hiroshi Saruwatari},
title = {Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech},
year = {2026},
journal = {arXiv preprint arXiv:2605.09386},
}
@inproceedings{MaskGCT,
author = {Yuancheng Wang and Haoyue Zhan and Liwei Liu and Ruihong Zeng and Haotian Guo and Jiachen Zheng and Qiang Zhang and Xueyao Zhang and Shunsi Zhang and Zhizheng Wu},
title = {{MaskGCT}: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
year = {2025},
booktitle = {International Conference on Learning Representations (ICLR)},
}