JTS-AI
/

JaiTTS-F5TTS

Text-to-Speech

Thai

English

Model card Files Files and versions

xet

Community

jts-ai-team commited on 7 days ago

Commit

caad885

verified ·

1 Parent(s): 219c0c1

Update README.md

Browse files

Files changed (1) hide show

README.md +75 -16

README.md CHANGED Viewed

@@ -7,20 +7,77 @@ base_model:
 - SWivid/F5-TTS
 pipeline_tag: text-to-speech
 ---
-# JaiTTS-F5TTS: Thai Voice Cloning Model
 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
-[**Paper**](https://github.com/JTS-AI-Team/JaiTTS) | [**Code Repository**](https://github.com/JTS-AI-Team/JaiTTS)
 <img src="JaiTTS-logo.jpg" width="313"/>
-**JaiTTS-F5TTS** is a state-of-the-art Thai voice cloning Text-to-Speech system developed by **Jasmine Technology Solution (JTS)**. It is based on a continually trained F5-TTS architecture, optimized for the Thai language using a large-scale proprietary Thai and English dataset.
-This repository provides the F5-TTS variant of the model described in our paper. **The inference code and pipeline structure are adapted from the [ThonburianTTS](https://huggingface.co/biodatlab/ThonburianTTS) project by biodatlab.**
 ## Installation
 ### 1. Install Dependencies
 ```bash
@@ -30,7 +87,7 @@ sudo apt install ffmpeg
 ### 2. Clone the Inference Codebase
-As this model utilizes the `flowtts` pipeline adapted from ThonburianTTS, please clone the following repository:
 ```bash
 git clone https://github.com/biodatlab/thonburian-tts.git
@@ -39,14 +96,13 @@ cd thonburian-tts
 ## Quick Usage
-You can use the following snippet to run inference with the JaiTTS-F5TTS model. Ensure you are inside the `thonburian-tts` directory or have the `flowtts` module in your Python path.
 ```python
 import torch
 import soundfile as sf
 from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig
-# Configure JaiTTS-F5TTS model
 model_config = ModelConfig(
     language="th",
     model_type="F5",
@@ -56,7 +112,6 @@ model_config = ModelConfig(
     device="cuda" if torch.cuda.is_available() else "cpu"
 )
-# Basic audio settings
 audio_config = AudioConfig(
     silence_threshold=-45,
     cfg_strength=2.5,
@@ -64,29 +119,33 @@ audio_config = AudioConfig(
     speed=1.0
 )
-# Initialize pipeline
 pipeline = FlowTTSPipeline(model_config, audio_config)
-# Inference
 audio, sr = pipeline.generate(
     reference_audio="path/to/reference.wav",
     reference_text="Transcription of the reference audio.",
-    gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือระบบสังเคราะห์เสียงภาษาไทย"
 )
-# Save result
 sf.write("output.wav", audio, sr)
 ```
 ## Citation
 If you find this work useful, please cite our paper:
 ```bibtex
 ```
 ## Acknowledgements
-- Codebase adapted from [ThonburianTTS](https://github.com/biodatlab/thonburian-tts).

 - SWivid/F5-TTS
 pipeline_tag: text-to-speech
 ---
+# JaiTTS-F5TTS: Thai Voice Cloning Model Research Prototype
+[![Paper](https://img.shields.io/badge/Paper-link-green.svg)](https://arxiv.org/pdf/2604.27607)
+[![Website](https://img.shields.io/badge/Website-JAI-orange.svg)](https://jts.co.th/jai/)
+[![GitHub](https://img.shields.io/badge/GitHub-repository-black.svg)](https://github.com/JTS-AI-Team/JaiTTS)
 [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 <img src="JaiTTS-logo.jpg" width="313"/>
+**JaiTTS-F5TTS** is a non-autoregressive JaiTTS voice cloning model based on [F5-TTS](https://huggingface.co/SWivid/F5-TTS). It targets Thai zero-shot voice cloning.
+> **Research prototype:** JaiTTS-F5TTS is one of our experimental variants within the JaiTTS project. It is released for research and benchmarking only.
+## Highlights
+- F5-TTS-based non-autoregressive voice cloning for Thai
+- Duration predictor for improved pacing and intelligibility
+- Fast synthesis with Real-Time Factor (RTF) below `0.2`
+## Duration Modeling
+The original F5-TTS duration estimate uses a UTF-8 byte-ratio formula. This is brittle for Thai and mixed-script input because Thai characters, English words, Arabic numerals, and punctuation do not have a consistent byte-to-pronunciation relationship. In practice, the mismatch can produce rushed, compressed, or unstable speech.
+We address this with an XLM-R-based neural duration predictor that estimates target duration from text more robustly than the UTF-8 byte-ratio baseline.
+The data used to train and evaluate the duration predictor is sampled from the JaiTTS-v1.0 training set.
+### Duration Predictor Architecture
+The duration predictor uses [XLM-R base](https://huggingface.co/FacebookAI/xlm-roberta-base) as the text encoder. Text representations are aggregated with masked mean pooling, then passed to a regression head composed of linear layers with GELU activation and dropout. The predictor also uses log-transformed syllable counts as an auxiliary feature, which provides a more pronunciation-aware signal than byte length for Thai and mixed-script text.
+### Duration Prediction Metrics
+Errors are reported in seconds. Lower is better.
+- `MAE`: Mean absolute error across all samples.
+- `p50 Error`: The 50th-percentile absolute error.
+- `p90 Error`: The 90th-percentile absolute error.
+- `p95 Error`: The 95th-percentile absolute error.
+| Model | MAE ↓ | p50 Error ↓ | p90 Error ↓ | p95 Error ↓ |
+| :-- | --: | --: | --: | --: |
+| F5-TTS UTF-8 baseline | 1.7064 | 1.0987 | 4.0461 | 5.3914 |
+| **XLM-R predictor** | **1.0924** | **0.7118** | **2.6319** | **3.4425** |
+## Benchmark Results
+### Objective Evaluation
+Objective evaluation is measured on the same benchmark used in the paper: [JaiTTS: A Thai Voice Cloning Model](https://arxiv.org/pdf/2604.27607). Results can be reproduced using the benchmark instructions in the [GitHub repository](https://github.com/JTS-AI-Team/JaiTTS).
+| Model | Short CER (%) ↓ | Short SIM ↑ | Long CER (%) ↓ | Long SIM ↑ |
+| :-- | --: | --: | --: | --: |
+| ThonburianTTS | 6.26 | 0.48 | -- | -- |
+| JaiTTS-F5TTS | 4.78 | 0.60 | 12.63 | **0.80** |
+| JaiTTS-F5TTS + Duration Predictor | 4.26 | 0.58 | 11.57 | **0.80** |
+| [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **1.94** | **0.62** | **2.55** | 0.76 |
+### Inference Speed
+| Model | RTF ↓ |
+| :-- | --: |
+| ThonburianTTS | 0.1150 |
+| JaiTTS-F5TTS | 0.1138 |
+| JaiTTS-F5TTS + Duration Predictor | 0.1652 |
+| [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **0.1136** |
 ## Installation
+This inference code and pipeline structure are adapted from the [ThonburianTTS](https://huggingface.co/biodatlab/ThonburianTTS) project by biodatlab.
 ### 1. Install Dependencies
 ```bash
 ### 2. Clone the Inference Codebase
+This model uses the `flowtts` pipeline adapted from ThonburianTTS:
 ```bash
 git clone https://github.com/biodatlab/thonburian-tts.git
 ## Quick Usage
+Use the following snippet to run inference with the JaiTTS-F5TTS checkpoint. Ensure you are inside the `thonburian-tts` directory or have the `flowtts` module in your Python path.
 ```python
 import torch
 import soundfile as sf
 from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig
 model_config = ModelConfig(
     language="th",
     model_type="F5",
     device="cuda" if torch.cuda.is_available() else "cpu"
 )
 audio_config = AudioConfig(
     silence_threshold=-45,
     cfg_strength=2.5,
     speed=1.0
 )
 pipeline = FlowTTSPipeline(model_config, audio_config)
 audio, sr = pipeline.generate(
     reference_audio="path/to/reference.wav",
     reference_text="Transcription of the reference audio.",
+    gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือ AI ที่สร้างโดย JTS"
 )
 sf.write("output.wav", audio, sr)
 ```
 ## Citation
 If you find this work useful, please cite our paper:
 ```bibtex
+@misc{karnjanaekarin2026jaittsthaivoicecloning,
+      title={JaiTTS: A Thai Voice Cloning Model},
+      author={Jullajak Karnjanaekarin and Pontakorn Trakuekul and Narongkorn Panitsrisit and Sumana Sumanakul and Vichayuth Nitayasomboon and Nithid Guntasin and Thanavin Denkavin and Attapol T. Rutherford},
+      year={2026},
+      eprint={2604.27607},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2604.27607},
+}
 ```
 ## Acknowledgements
+- Codebase adapted from [ThonburianTTS](https://github.com/biodatlab/thonburian-tts).