JaiTTS-F5TTS: Thai Voice Cloning Model Research Prototype
JaiTTS-F5TTS is a non-autoregressive JaiTTS voice cloning model based on F5-TTS. It targets Thai zero-shot voice cloning.
Research prototype: JaiTTS-F5TTS is one of our experimental variants within the JaiTTS project. It is released for research and benchmarking only.
Highlights
- F5-TTS-based non-autoregressive voice cloning for Thai
- Duration predictor for improved pacing and intelligibility
- Fast synthesis with Real-Time Factor (RTF) below
0.2
Duration Modeling
The original F5-TTS duration estimate uses a UTF-8 byte-ratio formula. This is brittle for Thai and mixed-script input because Thai characters, English words, Arabic numerals, and punctuation do not have a consistent byte-to-pronunciation relationship. In practice, the mismatch can produce rushed, compressed, or unstable speech.
We address this with an XLM-R-based neural duration predictor that estimates target duration from text more robustly than the UTF-8 byte-ratio baseline.
The data used to train and evaluate the duration predictor is sampled from the JaiTTS-v1.0 training set.
Duration Predictor Architecture
The duration predictor uses XLM-R base as the text encoder. Text representations are aggregated with masked mean pooling, then passed to a regression head composed of linear layers with GELU activation and dropout. The predictor also uses log-transformed syllable counts as an auxiliary feature, which provides a more pronunciation-aware signal than byte length for Thai and mixed-script text.
Duration Prediction Metrics
Errors are reported in seconds. Lower is better.
MAE: Mean absolute error across all samples.p50 Error: The 50th-percentile absolute error.p90 Error: The 90th-percentile absolute error.p95 Error: The 95th-percentile absolute error.
| Model | MAE ↓ | p50 Error ↓ | p90 Error ↓ | p95 Error ↓ |
|---|---|---|---|---|
| F5-TTS UTF-8 baseline | 1.7064 | 1.0987 | 4.0461 | 5.3914 |
| XLM-R predictor | 1.0924 | 0.7118 | 2.6319 | 3.4425 |
Benchmark Results
Objective Evaluation
Objective evaluation is measured on the same benchmark used in the paper: JaiTTS: A Thai Voice Cloning Model. Results can be reproduced using the benchmark instructions in the GitHub repository.
| Model | Short CER (%) ↓ | Short SIM ↑ | Long CER (%) ↓ | Long SIM ↑ |
|---|---|---|---|---|
| ThonburianTTS | 6.26 | 0.48 | -- | -- |
| JaiTTS-F5TTS | 4.78 | 0.60 | 12.63 | 0.80 |
| JaiTTS-F5TTS + Duration Predictor | 4.26 | 0.58 | 11.57 | 0.80 |
| JaiTTS-v1.0 | 1.94 | 0.62 | 2.55 | 0.76 |
Inference Speed
| Model | RTF ↓ |
|---|---|
| ThonburianTTS | 0.1150 |
| JaiTTS-F5TTS | 0.1138 |
| JaiTTS-F5TTS + Duration Predictor | 0.1652 |
| JaiTTS-v1.0 | 0.1136 |
Installation
This inference code and pipeline structure are adapted from the ThonburianTTS project by biodatlab.
1. Install Dependencies
pip install torch cached-path librosa transformers f5-tts
sudo apt install ffmpeg
2. Clone the Inference Codebase
This model uses the flowtts pipeline adapted from ThonburianTTS:
git clone https://github.com/biodatlab/thonburian-tts.git
cd thonburian-tts
Quick Usage
Use the following snippet to run inference with the JaiTTS-F5TTS checkpoint. Ensure you are inside the thonburian-tts directory or have the flowtts module in your Python path.
import torch
import soundfile as sf
from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig
model_config = ModelConfig(
language="th",
model_type="F5",
checkpoint="hf://JTS-AI/JaiTTS-F5TTS/model.pt",
vocab_file="hf://JTS-AI/JaiTTS-F5TTS/vocab.txt",
vocoder="vocos",
device="cuda" if torch.cuda.is_available() else "cpu"
)
audio_config = AudioConfig(
silence_threshold=-45,
cfg_strength=2.5,
nfe_step=32,
speed=1.0
)
pipeline = FlowTTSPipeline(model_config, audio_config)
audio, sr = pipeline.generate(
reference_audio="path/to/reference.wav",
reference_text="Transcription of the reference audio.",
gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือ AI ที่สร้างโดย JTS"
)
sf.write("output.wav", audio, sr)
Citation
If you find this work useful, please cite our paper:
@misc{karnjanaekarin2026jaittsthaivoicecloning,
title={JaiTTS: A Thai Voice Cloning Model},
author={Jullajak Karnjanaekarin and Pontakorn Trakuekul and Narongkorn Panitsrisit and Sumana Sumanakul and Vichayuth Nitayasomboon and Nithid Guntasin and Thanavin Denkavin and Attapol T. Rutherford},
year={2026},
eprint={2604.27607},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.27607},
}
Acknowledgements
- Codebase adapted from ThonburianTTS.
Model tree for JTS-AI/JaiTTS-F5TTS
Base model
SWivid/F5-TTS