Text-to-Speech
Thai
English
File size: 5,886 Bytes
219c0c1
 
 
 
 
 
 
 
 
caad885
219c0c1
caad885
 
 
219c0c1
 
 
 
caad885
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219c0c1
caad885
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219c0c1
 
 
caad885
 
219c0c1
 
 
 
 
 
 
 
 
caad885
219c0c1
 
 
 
 
 
 
 
caad885
219c0c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
caad885
219c0c1
 
 
 
 
 
 
 
 
 
caad885
 
 
 
 
 
 
 
 
219c0c1
 
 
 
caad885
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: apache-2.0
language:
- th
- en
base_model:
- SWivid/F5-TTS
pipeline_tag: text-to-speech
---
# JaiTTS-F5TTS: Thai Voice Cloning Model Research Prototype 

[![Paper](https://img.shields.io/badge/Paper-link-green.svg)](https://arxiv.org/pdf/2604.27607)
[![Website](https://img.shields.io/badge/Website-JAI-orange.svg)](https://jts.co.th/jai/)
[![GitHub](https://img.shields.io/badge/GitHub-repository-black.svg)](https://github.com/JTS-AI-Team/JaiTTS)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

<img src="JaiTTS-logo.jpg" width="313"/>

**JaiTTS-F5TTS** is a non-autoregressive JaiTTS voice cloning model based on [F5-TTS](https://huggingface.co/SWivid/F5-TTS). It targets Thai zero-shot voice cloning.

> **Research prototype:** JaiTTS-F5TTS is one of our experimental variants within the JaiTTS project. It is released for research and benchmarking only.

## Highlights

- F5-TTS-based non-autoregressive voice cloning for Thai
- Duration predictor for improved pacing and intelligibility
- Fast synthesis with Real-Time Factor (RTF) below `0.2`

## Duration Modeling

The original F5-TTS duration estimate uses a UTF-8 byte-ratio formula. This is brittle for Thai and mixed-script input because Thai characters, English words, Arabic numerals, and punctuation do not have a consistent byte-to-pronunciation relationship. In practice, the mismatch can produce rushed, compressed, or unstable speech.

We address this with an XLM-R-based neural duration predictor that estimates target duration from text more robustly than the UTF-8 byte-ratio baseline.

The data used to train and evaluate the duration predictor is sampled from the JaiTTS-v1.0 training set.

### Duration Predictor Architecture

The duration predictor uses [XLM-R base](https://huggingface.co/FacebookAI/xlm-roberta-base) as the text encoder. Text representations are aggregated with masked mean pooling, then passed to a regression head composed of linear layers with GELU activation and dropout. The predictor also uses log-transformed syllable counts as an auxiliary feature, which provides a more pronunciation-aware signal than byte length for Thai and mixed-script text.

### Duration Prediction Metrics

Errors are reported in seconds. Lower is better.

- `MAE`: Mean absolute error across all samples.
- `p50 Error`: The 50th-percentile absolute error.
- `p90 Error`: The 90th-percentile absolute error.
- `p95 Error`: The 95th-percentile absolute error.

| Model | MAE ↓ | p50 Error ↓ | p90 Error ↓ | p95 Error ↓ |
| :-- | --: | --: | --: | --: |
| F5-TTS UTF-8 baseline | 1.7064 | 1.0987 | 4.0461 | 5.3914 |
| **XLM-R predictor** | **1.0924** | **0.7118** | **2.6319** | **3.4425** |

## Benchmark Results

### Objective Evaluation

Objective evaluation is measured on the same benchmark used in the paper: [JaiTTS: A Thai Voice Cloning Model](https://arxiv.org/pdf/2604.27607). Results can be reproduced using the benchmark instructions in the [GitHub repository](https://github.com/JTS-AI-Team/JaiTTS).

| Model | Short CER (%) ↓ | Short SIM ↑ | Long CER (%) ↓ | Long SIM ↑ |
| :-- | --: | --: | --: | --: |
| ThonburianTTS | 6.26 | 0.48 | -- | -- |
| JaiTTS-F5TTS | 4.78 | 0.60 | 12.63 | **0.80** |
| JaiTTS-F5TTS + Duration Predictor | 4.26 | 0.58 | 11.57 | **0.80** |
| [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **1.94** | **0.62** | **2.55** | 0.76 |

### Inference Speed

| Model | RTF ↓ |
| :-- | --: |
| ThonburianTTS | 0.1150 |
| JaiTTS-F5TTS | 0.1138 |
| JaiTTS-F5TTS + Duration Predictor | 0.1652 |
| [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **0.1136** |

## Installation

This inference code and pipeline structure are adapted from the [ThonburianTTS](https://huggingface.co/biodatlab/ThonburianTTS) project by biodatlab.

### 1. Install Dependencies

```bash
pip install torch cached-path librosa transformers f5-tts
sudo apt install ffmpeg
```

### 2. Clone the Inference Codebase

This model uses the `flowtts` pipeline adapted from ThonburianTTS:

```bash
git clone https://github.com/biodatlab/thonburian-tts.git
cd thonburian-tts
```

## Quick Usage

Use the following snippet to run inference with the JaiTTS-F5TTS checkpoint. Ensure you are inside the `thonburian-tts` directory or have the `flowtts` module in your Python path.

```python
import torch
import soundfile as sf
from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig

model_config = ModelConfig(
    language="th",
    model_type="F5",
    checkpoint="hf://JTS-AI/JaiTTS-F5TTS/model.pt",
    vocab_file="hf://JTS-AI/JaiTTS-F5TTS/vocab.txt",
    vocoder="vocos",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

audio_config = AudioConfig(
    silence_threshold=-45,
    cfg_strength=2.5,
    nfe_step=32,
    speed=1.0
)

pipeline = FlowTTSPipeline(model_config, audio_config)

audio, sr = pipeline.generate(
    reference_audio="path/to/reference.wav",
    reference_text="Transcription of the reference audio.",
    gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือ AI ที่สร้างโดย JTS"
)

sf.write("output.wav", audio, sr)
```

## Citation

If you find this work useful, please cite our paper:

```bibtex
@misc{karnjanaekarin2026jaittsthaivoicecloning,
      title={JaiTTS: A Thai Voice Cloning Model}, 
      author={Jullajak Karnjanaekarin and Pontakorn Trakuekul and Narongkorn Panitsrisit and Sumana Sumanakul and Vichayuth Nitayasomboon and Nithid Guntasin and Thanavin Denkavin and Attapol T. Rutherford},
      year={2026},
      eprint={2604.27607},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.27607}, 
}
```

## Acknowledgements

- Codebase adapted from [ThonburianTTS](https://github.com/biodatlab/thonburian-tts).