GibbsTTS / README.md
ydqmkkx's picture
update
0afe769

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: GibbsTTS Demo
emoji: πŸŽ™οΈ
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: 5.49.1
python_version: '3.10'
app_file: app.py
pinned: false
license: mit
short_description: Zero-shot voice cloning TTS (EN/ZH) β€” GibbsTTS demo
models:
  - ydqmkkx/GibbsTTS
tags:
  - tts
  - text-to-speech
  - voice-cloning
  - zero-shot
  - english
  - chinese
  - flow-matching

GibbsTTS β€” Zero-Shot Voice Cloning TTS

A Hugging Face Space for GibbsTTS, a zero-shot text-to-speech model based on metric-induced discrete flow matching with the proposed kinetic-optimal scheduler and finite-step CTMC moment correction.

How to use

  1. Reference audio β€” upload (or record) a short clip of the voice you want to clone. A few seconds is enough.
  2. Reference transcript β€” type exactly what the reference clip says.
  3. Target text β€” the sentence you want the model to speak in that voice.
  4. Language β€” choose English, Chinese (Mandarin), or Mixed EN/ZH.
  5. Click Synthesize.

The model was trained on Emilia-en/zh, so it supports English and Mandarin. The mixed mode is experimental and provided for fun.

Hardware

Inference is fast on a single GPU (a couple of seconds per sentence on an H100). The model is ~1.6 GB plus the MaskGCT codec β€” choose at least a small GPU runtime. Weights are downloaded automatically from ydqmkkx/GibbsTTS on the first run.

Citation

@article{GibbsTTS,
 author    = {Dong Yang and Yiyi Cai and Haoyu Zhang and Yuki Saito and Hiroshi Saruwatari},
 title     = {Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech},
 year      = {2026},
 journal   = {arXiv preprint arXiv:2605.09386},
}

@inproceedings{MaskGCT,
 author    = {Yuancheng Wang and Haoyue Zhan and Liwei Liu and Ruihong Zeng and Haotian Guo and Jiachen Zheng and Qiang Zhang and Xueyao Zhang and Shunsi Zhang and Zhizheng Wu},
 title     = {{MaskGCT}: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
 year      = {2025},
 booktitle = {International Conference on Learning Representations (ICLR)},
}