| --- |
| license: apache-2.0 |
| language: |
| - th |
| - en |
| base_model: |
| - SWivid/F5-TTS |
| pipeline_tag: text-to-speech |
| --- |
| # JaiTTS-F5TTS: Thai Voice Cloning Model Research Prototype |
|
|
| [](https://arxiv.org/pdf/2604.27607) |
| [](https://jts.co.th/jai/) |
| [](https://github.com/JTS-AI-Team/JaiTTS) |
| [](https://opensource.org/licenses/Apache-2.0) |
|
|
| <img src="JaiTTS-logo.jpg" width="313"/> |
|
|
| **JaiTTS-F5TTS** is a non-autoregressive JaiTTS voice cloning model based on [F5-TTS](https://huggingface.co/SWivid/F5-TTS). It targets Thai zero-shot voice cloning. |
|
|
| > **Research prototype:** JaiTTS-F5TTS is one of our experimental variants within the JaiTTS project. It is released for research and benchmarking only. |
|
|
| ## Highlights |
|
|
| - F5-TTS-based non-autoregressive voice cloning for Thai |
| - Duration predictor for improved pacing and intelligibility |
| - Fast synthesis with Real-Time Factor (RTF) below `0.2` |
|
|
| ## Duration Modeling |
|
|
| The original F5-TTS duration estimate uses a UTF-8 byte-ratio formula. This is brittle for Thai and mixed-script input because Thai characters, English words, Arabic numerals, and punctuation do not have a consistent byte-to-pronunciation relationship. In practice, the mismatch can produce rushed, compressed, or unstable speech. |
|
|
| We address this with an XLM-R-based neural duration predictor that estimates target duration from text more robustly than the UTF-8 byte-ratio baseline. |
|
|
| The data used to train and evaluate the duration predictor is sampled from the JaiTTS-v1.0 training set. |
|
|
| ### Duration Predictor Architecture |
|
|
| The duration predictor uses [XLM-R base](https://huggingface.co/FacebookAI/xlm-roberta-base) as the text encoder. Text representations are aggregated with masked mean pooling, then passed to a regression head composed of linear layers with GELU activation and dropout. The predictor also uses log-transformed syllable counts as an auxiliary feature, which provides a more pronunciation-aware signal than byte length for Thai and mixed-script text. |
|
|
| ### Duration Prediction Metrics |
|
|
| Errors are reported in seconds. Lower is better. |
|
|
| - `MAE`: Mean absolute error across all samples. |
| - `p50 Error`: The 50th-percentile absolute error. |
| - `p90 Error`: The 90th-percentile absolute error. |
| - `p95 Error`: The 95th-percentile absolute error. |
|
|
| | Model | MAE ↓ | p50 Error ↓ | p90 Error ↓ | p95 Error ↓ | |
| | :-- | --: | --: | --: | --: | |
| | F5-TTS UTF-8 baseline | 1.7064 | 1.0987 | 4.0461 | 5.3914 | |
| | **XLM-R predictor** | **1.0924** | **0.7118** | **2.6319** | **3.4425** | |
|
|
| ## Benchmark Results |
|
|
| ### Objective Evaluation |
|
|
| Objective evaluation is measured on the same benchmark used in the paper: [JaiTTS: A Thai Voice Cloning Model](https://arxiv.org/pdf/2604.27607). Results can be reproduced using the benchmark instructions in the [GitHub repository](https://github.com/JTS-AI-Team/JaiTTS). |
|
|
| | Model | Short CER (%) ↓ | Short SIM ↑ | Long CER (%) ↓ | Long SIM ↑ | |
| | :-- | --: | --: | --: | --: | |
| | ThonburianTTS | 6.26 | 0.48 | -- | -- | |
| | JaiTTS-F5TTS | 4.78 | 0.60 | 12.63 | **0.80** | |
| | JaiTTS-F5TTS + Duration Predictor | 4.26 | 0.58 | 11.57 | **0.80** | |
| | [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **1.94** | **0.62** | **2.55** | 0.76 | |
|
|
| ### Inference Speed |
|
|
| | Model | RTF ↓ | |
| | :-- | --: | |
| | ThonburianTTS | 0.1150 | |
| | JaiTTS-F5TTS | 0.1138 | |
| | JaiTTS-F5TTS + Duration Predictor | 0.1652 | |
| | [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **0.1136** | |
|
|
| ## Installation |
|
|
| This inference code and pipeline structure are adapted from the [ThonburianTTS](https://huggingface.co/biodatlab/ThonburianTTS) project by biodatlab. |
|
|
| ### 1. Install Dependencies |
|
|
| ```bash |
| pip install torch cached-path librosa transformers f5-tts |
| sudo apt install ffmpeg |
| ``` |
|
|
| ### 2. Clone the Inference Codebase |
|
|
| This model uses the `flowtts` pipeline adapted from ThonburianTTS: |
|
|
| ```bash |
| git clone https://github.com/biodatlab/thonburian-tts.git |
| cd thonburian-tts |
| ``` |
|
|
| ## Quick Usage |
|
|
| Use the following snippet to run inference with the JaiTTS-F5TTS checkpoint. Ensure you are inside the `thonburian-tts` directory or have the `flowtts` module in your Python path. |
|
|
| ```python |
| import torch |
| import soundfile as sf |
| from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig |
| |
| model_config = ModelConfig( |
| language="th", |
| model_type="F5", |
| checkpoint="hf://JTS-AI/JaiTTS-F5TTS/model.pt", |
| vocab_file="hf://JTS-AI/JaiTTS-F5TTS/vocab.txt", |
| vocoder="vocos", |
| device="cuda" if torch.cuda.is_available() else "cpu" |
| ) |
| |
| audio_config = AudioConfig( |
| silence_threshold=-45, |
| cfg_strength=2.5, |
| nfe_step=32, |
| speed=1.0 |
| ) |
| |
| pipeline = FlowTTSPipeline(model_config, audio_config) |
| |
| audio, sr = pipeline.generate( |
| reference_audio="path/to/reference.wav", |
| reference_text="Transcription of the reference audio.", |
| gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือ AI ที่สร้างโดย JTS" |
| ) |
| |
| sf.write("output.wav", audio, sr) |
| ``` |
|
|
| ## Citation |
|
|
| If you find this work useful, please cite our paper: |
|
|
| ```bibtex |
| @misc{karnjanaekarin2026jaittsthaivoicecloning, |
| title={JaiTTS: A Thai Voice Cloning Model}, |
| author={Jullajak Karnjanaekarin and Pontakorn Trakuekul and Narongkorn Panitsrisit and Sumana Sumanakul and Vichayuth Nitayasomboon and Nithid Guntasin and Thanavin Denkavin and Attapol T. Rutherford}, |
| year={2026}, |
| eprint={2604.27607}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2604.27607}, |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| - Codebase adapted from [ThonburianTTS](https://github.com/biodatlab/thonburian-tts). |
|
|