Text-to-Speech
Thai
English
jts-ai-team commited on
Commit
caad885
·
verified ·
1 Parent(s): 219c0c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -16
README.md CHANGED
@@ -7,20 +7,77 @@ base_model:
7
  - SWivid/F5-TTS
8
  pipeline_tag: text-to-speech
9
  ---
10
- # JaiTTS-F5TTS: Thai Voice Cloning Model
11
 
 
 
 
12
  [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
13
 
14
- [**Paper**](https://github.com/JTS-AI-Team/JaiTTS) | [**Code Repository**](https://github.com/JTS-AI-Team/JaiTTS)
15
-
16
  <img src="JaiTTS-logo.jpg" width="313"/>
17
 
18
- **JaiTTS-F5TTS** is a state-of-the-art Thai voice cloning Text-to-Speech system developed by **Jasmine Technology Solution (JTS)**. It is based on a continually trained F5-TTS architecture, optimized for the Thai language using a large-scale proprietary Thai and English dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- This repository provides the F5-TTS variant of the model described in our paper. **The inference code and pipeline structure are adapted from the [ThonburianTTS](https://huggingface.co/biodatlab/ThonburianTTS) project by biodatlab.**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## Installation
23
 
 
 
24
  ### 1. Install Dependencies
25
 
26
  ```bash
@@ -30,7 +87,7 @@ sudo apt install ffmpeg
30
 
31
  ### 2. Clone the Inference Codebase
32
 
33
- As this model utilizes the `flowtts` pipeline adapted from ThonburianTTS, please clone the following repository:
34
 
35
  ```bash
36
  git clone https://github.com/biodatlab/thonburian-tts.git
@@ -39,14 +96,13 @@ cd thonburian-tts
39
 
40
  ## Quick Usage
41
 
42
- You can use the following snippet to run inference with the JaiTTS-F5TTS model. Ensure you are inside the `thonburian-tts` directory or have the `flowtts` module in your Python path.
43
 
44
  ```python
45
  import torch
46
  import soundfile as sf
47
  from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig
48
 
49
- # Configure JaiTTS-F5TTS model
50
  model_config = ModelConfig(
51
  language="th",
52
  model_type="F5",
@@ -56,7 +112,6 @@ model_config = ModelConfig(
56
  device="cuda" if torch.cuda.is_available() else "cpu"
57
  )
58
 
59
- # Basic audio settings
60
  audio_config = AudioConfig(
61
  silence_threshold=-45,
62
  cfg_strength=2.5,
@@ -64,29 +119,33 @@ audio_config = AudioConfig(
64
  speed=1.0
65
  )
66
 
67
- # Initialize pipeline
68
  pipeline = FlowTTSPipeline(model_config, audio_config)
69
 
70
- # Inference
71
  audio, sr = pipeline.generate(
72
  reference_audio="path/to/reference.wav",
73
  reference_text="Transcription of the reference audio.",
74
- gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือระบบังเคราะห์เสียภาษาไทย"
75
  )
76
 
77
- # Save result
78
  sf.write("output.wav", audio, sr)
79
  ```
80
 
81
-
82
  ## Citation
83
 
84
  If you find this work useful, please cite our paper:
85
 
86
  ```bibtex
87
-
 
 
 
 
 
 
 
 
88
  ```
89
 
90
  ## Acknowledgements
91
 
92
- - Codebase adapted from [ThonburianTTS](https://github.com/biodatlab/thonburian-tts).
 
7
  - SWivid/F5-TTS
8
  pipeline_tag: text-to-speech
9
  ---
10
+ # JaiTTS-F5TTS: Thai Voice Cloning Model Research Prototype
11
 
12
+ [![Paper](https://img.shields.io/badge/Paper-link-green.svg)](https://arxiv.org/pdf/2604.27607)
13
+ [![Website](https://img.shields.io/badge/Website-JAI-orange.svg)](https://jts.co.th/jai/)
14
+ [![GitHub](https://img.shields.io/badge/GitHub-repository-black.svg)](https://github.com/JTS-AI-Team/JaiTTS)
15
  [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
16
 
 
 
17
  <img src="JaiTTS-logo.jpg" width="313"/>
18
 
19
+ **JaiTTS-F5TTS** is a non-autoregressive JaiTTS voice cloning model based on [F5-TTS](https://huggingface.co/SWivid/F5-TTS). It targets Thai zero-shot voice cloning.
20
+
21
+ > **Research prototype:** JaiTTS-F5TTS is one of our experimental variants within the JaiTTS project. It is released for research and benchmarking only.
22
+
23
+ ## Highlights
24
+
25
+ - F5-TTS-based non-autoregressive voice cloning for Thai
26
+ - Duration predictor for improved pacing and intelligibility
27
+ - Fast synthesis with Real-Time Factor (RTF) below `0.2`
28
+
29
+ ## Duration Modeling
30
+
31
+ The original F5-TTS duration estimate uses a UTF-8 byte-ratio formula. This is brittle for Thai and mixed-script input because Thai characters, English words, Arabic numerals, and punctuation do not have a consistent byte-to-pronunciation relationship. In practice, the mismatch can produce rushed, compressed, or unstable speech.
32
+
33
+ We address this with an XLM-R-based neural duration predictor that estimates target duration from text more robustly than the UTF-8 byte-ratio baseline.
34
+
35
+ The data used to train and evaluate the duration predictor is sampled from the JaiTTS-v1.0 training set.
36
+
37
+ ### Duration Predictor Architecture
38
+
39
+ The duration predictor uses [XLM-R base](https://huggingface.co/FacebookAI/xlm-roberta-base) as the text encoder. Text representations are aggregated with masked mean pooling, then passed to a regression head composed of linear layers with GELU activation and dropout. The predictor also uses log-transformed syllable counts as an auxiliary feature, which provides a more pronunciation-aware signal than byte length for Thai and mixed-script text.
40
+
41
+ ### Duration Prediction Metrics
42
 
43
+ Errors are reported in seconds. Lower is better.
44
+
45
+ - `MAE`: Mean absolute error across all samples.
46
+ - `p50 Error`: The 50th-percentile absolute error.
47
+ - `p90 Error`: The 90th-percentile absolute error.
48
+ - `p95 Error`: The 95th-percentile absolute error.
49
+
50
+ | Model | MAE ↓ | p50 Error ↓ | p90 Error ↓ | p95 Error ↓ |
51
+ | :-- | --: | --: | --: | --: |
52
+ | F5-TTS UTF-8 baseline | 1.7064 | 1.0987 | 4.0461 | 5.3914 |
53
+ | **XLM-R predictor** | **1.0924** | **0.7118** | **2.6319** | **3.4425** |
54
+
55
+ ## Benchmark Results
56
+
57
+ ### Objective Evaluation
58
+
59
+ Objective evaluation is measured on the same benchmark used in the paper: [JaiTTS: A Thai Voice Cloning Model](https://arxiv.org/pdf/2604.27607). Results can be reproduced using the benchmark instructions in the [GitHub repository](https://github.com/JTS-AI-Team/JaiTTS).
60
+
61
+ | Model | Short CER (%) ↓ | Short SIM ↑ | Long CER (%) ↓ | Long SIM ↑ |
62
+ | :-- | --: | --: | --: | --: |
63
+ | ThonburianTTS | 6.26 | 0.48 | -- | -- |
64
+ | JaiTTS-F5TTS | 4.78 | 0.60 | 12.63 | **0.80** |
65
+ | JaiTTS-F5TTS + Duration Predictor | 4.26 | 0.58 | 11.57 | **0.80** |
66
+ | [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **1.94** | **0.62** | **2.55** | 0.76 |
67
+
68
+ ### Inference Speed
69
+
70
+ | Model | RTF ↓ |
71
+ | :-- | --: |
72
+ | ThonburianTTS | 0.1150 |
73
+ | JaiTTS-F5TTS | 0.1138 |
74
+ | JaiTTS-F5TTS + Duration Predictor | 0.1652 |
75
+ | [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **0.1136** |
76
 
77
  ## Installation
78
 
79
+ This inference code and pipeline structure are adapted from the [ThonburianTTS](https://huggingface.co/biodatlab/ThonburianTTS) project by biodatlab.
80
+
81
  ### 1. Install Dependencies
82
 
83
  ```bash
 
87
 
88
  ### 2. Clone the Inference Codebase
89
 
90
+ This model uses the `flowtts` pipeline adapted from ThonburianTTS:
91
 
92
  ```bash
93
  git clone https://github.com/biodatlab/thonburian-tts.git
 
96
 
97
  ## Quick Usage
98
 
99
+ Use the following snippet to run inference with the JaiTTS-F5TTS checkpoint. Ensure you are inside the `thonburian-tts` directory or have the `flowtts` module in your Python path.
100
 
101
  ```python
102
  import torch
103
  import soundfile as sf
104
  from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig
105
 
 
106
  model_config = ModelConfig(
107
  language="th",
108
  model_type="F5",
 
112
  device="cuda" if torch.cuda.is_available() else "cpu"
113
  )
114
 
 
115
  audio_config = AudioConfig(
116
  silence_threshold=-45,
117
  cfg_strength=2.5,
 
119
  speed=1.0
120
  )
121
 
 
122
  pipeline = FlowTTSPipeline(model_config, audio_config)
123
 
 
124
  audio, sr = pipeline.generate(
125
  reference_audio="path/to/reference.wav",
126
  reference_text="Transcription of the reference audio.",
127
+ gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือ AI ที่สรางโด JTS"
128
  )
129
 
 
130
  sf.write("output.wav", audio, sr)
131
  ```
132
 
 
133
  ## Citation
134
 
135
  If you find this work useful, please cite our paper:
136
 
137
  ```bibtex
138
+ @misc{karnjanaekarin2026jaittsthaivoicecloning,
139
+ title={JaiTTS: A Thai Voice Cloning Model},
140
+ author={Jullajak Karnjanaekarin and Pontakorn Trakuekul and Narongkorn Panitsrisit and Sumana Sumanakul and Vichayuth Nitayasomboon and Nithid Guntasin and Thanavin Denkavin and Attapol T. Rutherford},
141
+ year={2026},
142
+ eprint={2604.27607},
143
+ archivePrefix={arXiv},
144
+ primaryClass={cs.CL},
145
+ url={https://arxiv.org/abs/2604.27607},
146
+ }
147
  ```
148
 
149
  ## Acknowledgements
150
 
151
+ - Codebase adapted from [ThonburianTTS](https://github.com/biodatlab/thonburian-tts).