File size: 2,420 Bytes
b910080
0afe769
 
 
b910080
 
0afe769
 
b910080
 
 
0afe769
 
 
 
 
 
 
 
 
 
 
b910080
 
0afe769
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
title: GibbsTTS Demo
emoji: πŸŽ™οΈ
colorFrom: indigo
colorTo: gray
sdk: gradio
sdk_version: "5.49.1"
python_version: "3.10"
app_file: app.py
pinned: false
license: mit
short_description: Zero-shot voice cloning TTS (EN/ZH) β€” GibbsTTS demo
models:
  - ydqmkkx/GibbsTTS
tags:
  - tts
  - text-to-speech
  - voice-cloning
  - zero-shot
  - english
  - chinese
  - flow-matching
---

# GibbsTTS β€” Zero-Shot Voice Cloning TTS

A Hugging Face Space for **GibbsTTS**, a zero-shot text-to-speech model
based on metric-induced discrete flow matching with the proposed
kinetic-optimal scheduler and finite-step CTMC moment correction.

- πŸ“„ Paper: <https://arxiv.org/abs/2605.09386>
- πŸ’» Code: <https://github.com/ydqmkkx/GibbsTTS>
- πŸŽ›οΈ Weights: <https://huggingface.co/ydqmkkx/GibbsTTS>

## How to use

1. **Reference audio** β€” upload (or record) a short clip of the voice you want
   to clone. A few seconds is enough.
2. **Reference transcript** β€” type exactly what the reference clip says.
3. **Target text** β€” the sentence you want the model to speak in that voice.
4. **Language** β€” choose `English`, `Chinese (Mandarin)`, or `Mixed EN/ZH`.
5. Click **Synthesize**.

The model was trained on
[Emilia-en/zh](https://huggingface.co/datasets/amphion/Emilia-Dataset), so it
supports English and Mandarin. The mixed mode is experimental and provided
for fun.

## Hardware

Inference is fast on a single GPU (a couple of seconds per sentence on an
H100). The model is ~1.6 GB plus the MaskGCT codec β€” choose at least a small
GPU runtime. Weights are downloaded automatically from
[`ydqmkkx/GibbsTTS`](https://huggingface.co/ydqmkkx/GibbsTTS) on the first run.

## Citation

```bibtex
@article{GibbsTTS,
 author    = {Dong Yang and Yiyi Cai and Haoyu Zhang and Yuki Saito and Hiroshi Saruwatari},
 title     = {Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech},
 year      = {2026},
 journal   = {arXiv preprint arXiv:2605.09386},
}

@inproceedings{MaskGCT,
 author    = {Yuancheng Wang and Haoyue Zhan and Liwei Liu and Ruihong Zeng and Haotian Guo and Jiachen Zheng and Qiang Zhang and Xueyao Zhang and Shunsi Zhang and Zhizheng Wu},
 title     = {{MaskGCT}: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
 year      = {2025},
 booktitle = {International Conference on Learning Representations (ICLR)},
}
```