File size: 5,886 Bytes
219c0c1 caad885 219c0c1 caad885 219c0c1 caad885 219c0c1 caad885 219c0c1 caad885 219c0c1 caad885 219c0c1 caad885 219c0c1 caad885 219c0c1 caad885 219c0c1 caad885 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | ---
license: apache-2.0
language:
- th
- en
base_model:
- SWivid/F5-TTS
pipeline_tag: text-to-speech
---
# JaiTTS-F5TTS: Thai Voice Cloning Model Research Prototype
[](https://arxiv.org/pdf/2604.27607)
[](https://jts.co.th/jai/)
[](https://github.com/JTS-AI-Team/JaiTTS)
[](https://opensource.org/licenses/Apache-2.0)
<img src="JaiTTS-logo.jpg" width="313"/>
**JaiTTS-F5TTS** is a non-autoregressive JaiTTS voice cloning model based on [F5-TTS](https://huggingface.co/SWivid/F5-TTS). It targets Thai zero-shot voice cloning.
> **Research prototype:** JaiTTS-F5TTS is one of our experimental variants within the JaiTTS project. It is released for research and benchmarking only.
## Highlights
- F5-TTS-based non-autoregressive voice cloning for Thai
- Duration predictor for improved pacing and intelligibility
- Fast synthesis with Real-Time Factor (RTF) below `0.2`
## Duration Modeling
The original F5-TTS duration estimate uses a UTF-8 byte-ratio formula. This is brittle for Thai and mixed-script input because Thai characters, English words, Arabic numerals, and punctuation do not have a consistent byte-to-pronunciation relationship. In practice, the mismatch can produce rushed, compressed, or unstable speech.
We address this with an XLM-R-based neural duration predictor that estimates target duration from text more robustly than the UTF-8 byte-ratio baseline.
The data used to train and evaluate the duration predictor is sampled from the JaiTTS-v1.0 training set.
### Duration Predictor Architecture
The duration predictor uses [XLM-R base](https://huggingface.co/FacebookAI/xlm-roberta-base) as the text encoder. Text representations are aggregated with masked mean pooling, then passed to a regression head composed of linear layers with GELU activation and dropout. The predictor also uses log-transformed syllable counts as an auxiliary feature, which provides a more pronunciation-aware signal than byte length for Thai and mixed-script text.
### Duration Prediction Metrics
Errors are reported in seconds. Lower is better.
- `MAE`: Mean absolute error across all samples.
- `p50 Error`: The 50th-percentile absolute error.
- `p90 Error`: The 90th-percentile absolute error.
- `p95 Error`: The 95th-percentile absolute error.
| Model | MAE ↓ | p50 Error ↓ | p90 Error ↓ | p95 Error ↓ |
| :-- | --: | --: | --: | --: |
| F5-TTS UTF-8 baseline | 1.7064 | 1.0987 | 4.0461 | 5.3914 |
| **XLM-R predictor** | **1.0924** | **0.7118** | **2.6319** | **3.4425** |
## Benchmark Results
### Objective Evaluation
Objective evaluation is measured on the same benchmark used in the paper: [JaiTTS: A Thai Voice Cloning Model](https://arxiv.org/pdf/2604.27607). Results can be reproduced using the benchmark instructions in the [GitHub repository](https://github.com/JTS-AI-Team/JaiTTS).
| Model | Short CER (%) ↓ | Short SIM ↑ | Long CER (%) ↓ | Long SIM ↑ |
| :-- | --: | --: | --: | --: |
| ThonburianTTS | 6.26 | 0.48 | -- | -- |
| JaiTTS-F5TTS | 4.78 | 0.60 | 12.63 | **0.80** |
| JaiTTS-F5TTS + Duration Predictor | 4.26 | 0.58 | 11.57 | **0.80** |
| [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **1.94** | **0.62** | **2.55** | 0.76 |
### Inference Speed
| Model | RTF ↓ |
| :-- | --: |
| ThonburianTTS | 0.1150 |
| JaiTTS-F5TTS | 0.1138 |
| JaiTTS-F5TTS + Duration Predictor | 0.1652 |
| [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **0.1136** |
## Installation
This inference code and pipeline structure are adapted from the [ThonburianTTS](https://huggingface.co/biodatlab/ThonburianTTS) project by biodatlab.
### 1. Install Dependencies
```bash
pip install torch cached-path librosa transformers f5-tts
sudo apt install ffmpeg
```
### 2. Clone the Inference Codebase
This model uses the `flowtts` pipeline adapted from ThonburianTTS:
```bash
git clone https://github.com/biodatlab/thonburian-tts.git
cd thonburian-tts
```
## Quick Usage
Use the following snippet to run inference with the JaiTTS-F5TTS checkpoint. Ensure you are inside the `thonburian-tts` directory or have the `flowtts` module in your Python path.
```python
import torch
import soundfile as sf
from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig
model_config = ModelConfig(
language="th",
model_type="F5",
checkpoint="hf://JTS-AI/JaiTTS-F5TTS/model.pt",
vocab_file="hf://JTS-AI/JaiTTS-F5TTS/vocab.txt",
vocoder="vocos",
device="cuda" if torch.cuda.is_available() else "cpu"
)
audio_config = AudioConfig(
silence_threshold=-45,
cfg_strength=2.5,
nfe_step=32,
speed=1.0
)
pipeline = FlowTTSPipeline(model_config, audio_config)
audio, sr = pipeline.generate(
reference_audio="path/to/reference.wav",
reference_text="Transcription of the reference audio.",
gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือ AI ที่สร้างโดย JTS"
)
sf.write("output.wav", audio, sr)
```
## Citation
If you find this work useful, please cite our paper:
```bibtex
@misc{karnjanaekarin2026jaittsthaivoicecloning,
title={JaiTTS: A Thai Voice Cloning Model},
author={Jullajak Karnjanaekarin and Pontakorn Trakuekul and Narongkorn Panitsrisit and Sumana Sumanakul and Vichayuth Nitayasomboon and Nithid Guntasin and Thanavin Denkavin and Attapol T. Rutherford},
year={2026},
eprint={2604.27607},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.27607},
}
```
## Acknowledgements
- Codebase adapted from [ThonburianTTS](https://github.com/biodatlab/thonburian-tts).
|