Update README.md

caad885 verified 7 days ago

5.89 kB

	---
	license: apache-2.0
	language:
	- th
	- en
	base_model:
	- SWivid/F5-TTS
	pipeline_tag: text-to-speech
	---
	# JaiTTS-F5TTS: Thai Voice Cloning Model Research Prototype

	[![Paper](https://img.shields.io/badge/Paper-link-green.svg)](https://arxiv.org/pdf/2604.27607)
	[![Website](https://img.shields.io/badge/Website-JAI-orange.svg)](https://jts.co.th/jai/)
	[![GitHub](https://img.shields.io/badge/GitHub-repository-black.svg)](https://github.com/JTS-AI-Team/JaiTTS)
	[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

	<img src="JaiTTS-logo.jpg" width="313"/>

	JaiTTS-F5TTS is a non-autoregressive JaiTTS voice cloning model based on [F5-TTS](https://huggingface.co/SWivid/F5-TTS). It targets Thai zero-shot voice cloning.

	> Research prototype: JaiTTS-F5TTS is one of our experimental variants within the JaiTTS project. It is released for research and benchmarking only.

	## Highlights

	- F5-TTS-based non-autoregressive voice cloning for Thai
	- Duration predictor for improved pacing and intelligibility
	- Fast synthesis with Real-Time Factor (RTF) below `0.2`

	## Duration Modeling

	The original F5-TTS duration estimate uses a UTF-8 byte-ratio formula. This is brittle for Thai and mixed-script input because Thai characters, English words, Arabic numerals, and punctuation do not have a consistent byte-to-pronunciation relationship. In practice, the mismatch can produce rushed, compressed, or unstable speech.

	We address this with an XLM-R-based neural duration predictor that estimates target duration from text more robustly than the UTF-8 byte-ratio baseline.

	The data used to train and evaluate the duration predictor is sampled from the JaiTTS-v1.0 training set.

	### Duration Predictor Architecture

	The duration predictor uses [XLM-R base](https://huggingface.co/FacebookAI/xlm-roberta-base) as the text encoder. Text representations are aggregated with masked mean pooling, then passed to a regression head composed of linear layers with GELU activation and dropout. The predictor also uses log-transformed syllable counts as an auxiliary feature, which provides a more pronunciation-aware signal than byte length for Thai and mixed-script text.

	### Duration Prediction Metrics

	Errors are reported in seconds. Lower is better.

	- `MAE`: Mean absolute error across all samples.
	- `p50 Error`: The 50th-percentile absolute error.
	- `p90 Error`: The 90th-percentile absolute error.
	- `p95 Error`: The 95th-percentile absolute error.

	\| Model \| MAE ↓ \| p50 Error ↓ \| p90 Error ↓ \| p95 Error ↓ \|
	\| :-- \| --: \| --: \| --: \| --: \|
	\| F5-TTS UTF-8 baseline \| 1.7064 \| 1.0987 \| 4.0461 \| 5.3914 \|
	\| XLM-R predictor \| 1.0924 \| 0.7118 \| 2.6319 \| 3.4425 \|

	## Benchmark Results

	### Objective Evaluation

	Objective evaluation is measured on the same benchmark used in the paper: [JaiTTS: A Thai Voice Cloning Model](https://arxiv.org/pdf/2604.27607). Results can be reproduced using the benchmark instructions in the [GitHub repository](https://github.com/JTS-AI-Team/JaiTTS).

	\| Model \| Short CER (%) ↓ \| Short SIM ↑ \| Long CER (%) ↓ \| Long SIM ↑ \|
	\| :-- \| --: \| --: \| --: \| --: \|
	\| ThonburianTTS \| 6.26 \| 0.48 \| -- \| -- \|
	\| JaiTTS-F5TTS \| 4.78 \| 0.60 \| 12.63 \| 0.80 \|
	\| JaiTTS-F5TTS + Duration Predictor \| 4.26 \| 0.58 \| 11.57 \| 0.80 \|
	\| [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) \| 1.94 \| 0.62 \| 2.55 \| 0.76 \|

	### Inference Speed

	\| Model \| RTF ↓ \|
	\| :-- \| --: \|
	\| ThonburianTTS \| 0.1150 \|
	\| JaiTTS-F5TTS \| 0.1138 \|
	\| JaiTTS-F5TTS + Duration Predictor \| 0.1652 \|
	\| [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) \| 0.1136 \|

	## Installation

	This inference code and pipeline structure are adapted from the [ThonburianTTS](https://huggingface.co/biodatlab/ThonburianTTS) project by biodatlab.

	### 1. Install Dependencies

	```bash
	pip install torch cached-path librosa transformers f5-tts
	sudo apt install ffmpeg
	```

	### 2. Clone the Inference Codebase

	This model uses the `flowtts` pipeline adapted from ThonburianTTS:

	```bash
	git clone https://github.com/biodatlab/thonburian-tts.git
	cd thonburian-tts
	```

	## Quick Usage

	Use the following snippet to run inference with the JaiTTS-F5TTS checkpoint. Ensure you are inside the `thonburian-tts` directory or have the `flowtts` module in your Python path.

	```python
	import torch
	import soundfile as sf
	from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig

	model_config = ModelConfig(
	language="th",
	model_type="F5",
	checkpoint="hf://JTS-AI/JaiTTS-F5TTS/model.pt",
	vocab_file="hf://JTS-AI/JaiTTS-F5TTS/vocab.txt",
	vocoder="vocos",
	device="cuda" if torch.cuda.is_available() else "cpu"
	)

	audio_config = AudioConfig(
	silence_threshold=-45,
	cfg_strength=2.5,
	nfe_step=32,
	speed=1.0
	)

	pipeline = FlowTTSPipeline(model_config, audio_config)

	audio, sr = pipeline.generate(
	reference_audio="path/to/reference.wav",
	reference_text="Transcription of the reference audio.",
	gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือ AI ที่สร้างโดย JTS"
	)

	sf.write("output.wav", audio, sr)
	```

	## Citation

	If you find this work useful, please cite our paper:

	```bibtex
	@misc{karnjanaekarin2026jaittsthaivoicecloning,
	title={JaiTTS: A Thai Voice Cloning Model},
	author={Jullajak Karnjanaekarin and Pontakorn Trakuekul and Narongkorn Panitsrisit and Sumana Sumanakul and Vichayuth Nitayasomboon and Nithid Guntasin and Thanavin Denkavin and Attapol T. Rutherford},
	year={2026},
	eprint={2604.27607},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2604.27607},
	}
	```

	## Acknowledgements

	- Codebase adapted from [ThonburianTTS](https://github.com/biodatlab/thonburian-tts).