--- title: GibbsTTS Demo emoji: 🎙️ colorFrom: indigo colorTo: gray sdk: gradio sdk_version: "5.49.1" python_version: "3.10" app_file: app.py pinned: false license: mit short_description: Zero-shot voice cloning TTS (EN/ZH) — GibbsTTS demo models: - ydqmkkx/GibbsTTS tags: - tts - text-to-speech - voice-cloning - zero-shot - english - chinese - flow-matching --- # GibbsTTS — Zero-Shot Voice Cloning TTS A Hugging Face Space for **GibbsTTS**, a zero-shot text-to-speech model based on metric-induced discrete flow matching with the proposed kinetic-optimal scheduler and finite-step CTMC moment correction. - 📄 Paper: - 💻 Code: - 🎛️ Weights: ## How to use 1. **Reference audio** — upload (or record) a short clip of the voice you want to clone. A few seconds is enough. 2. **Reference transcript** — type exactly what the reference clip says. 3. **Target text** — the sentence you want the model to speak in that voice. 4. **Language** — choose `English`, `Chinese (Mandarin)`, or `Mixed EN/ZH`. 5. Click **Synthesize**. The model was trained on [Emilia-en/zh](https://huggingface.co/datasets/amphion/Emilia-Dataset), so it supports English and Mandarin. The mixed mode is experimental and provided for fun. ## Hardware Inference is fast on a single GPU (a couple of seconds per sentence on an H100). The model is ~1.6 GB plus the MaskGCT codec — choose at least a small GPU runtime. Weights are downloaded automatically from [`ydqmkkx/GibbsTTS`](https://huggingface.co/ydqmkkx/GibbsTTS) on the first run. ## Citation ```bibtex @article{GibbsTTS, author = {Dong Yang and Yiyi Cai and Haoyu Zhang and Yuki Saito and Hiroshi Saruwatari}, title = {Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech}, year = {2026}, journal = {arXiv preprint arXiv:2605.09386}, } @inproceedings{MaskGCT, author = {Yuancheng Wang and Haoyue Zhan and Liwei Liu and Ruihong Zeng and Haotian Guo and Jiachen Zheng and Qiang Zhang and Xueyao Zhang and Shunsi Zhang and Zhizheng Wu}, title = {{MaskGCT}: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer}, year = {2025}, booktitle = {International Conference on Learning Representations (ICLR)}, } ```