Update README.md
Browse files
README.md
CHANGED
|
@@ -7,20 +7,77 @@ base_model:
|
|
| 7 |
- SWivid/F5-TTS
|
| 8 |
pipeline_tag: text-to-speech
|
| 9 |
---
|
| 10 |
-
# JaiTTS-F5TTS: Thai Voice Cloning Model
|
| 11 |
|
|
|
|
|
|
|
|
|
|
| 12 |
[](https://opensource.org/licenses/Apache-2.0)
|
| 13 |
|
| 14 |
-
[**Paper**](https://github.com/JTS-AI-Team/JaiTTS) | [**Code Repository**](https://github.com/JTS-AI-Team/JaiTTS)
|
| 15 |
-
|
| 16 |
<img src="JaiTTS-logo.jpg" width="313"/>
|
| 17 |
|
| 18 |
-
**JaiTTS-F5TTS** is a
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
## Installation
|
| 23 |
|
|
|
|
|
|
|
| 24 |
### 1. Install Dependencies
|
| 25 |
|
| 26 |
```bash
|
|
@@ -30,7 +87,7 @@ sudo apt install ffmpeg
|
|
| 30 |
|
| 31 |
### 2. Clone the Inference Codebase
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
```bash
|
| 36 |
git clone https://github.com/biodatlab/thonburian-tts.git
|
|
@@ -39,14 +96,13 @@ cd thonburian-tts
|
|
| 39 |
|
| 40 |
## Quick Usage
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
```python
|
| 45 |
import torch
|
| 46 |
import soundfile as sf
|
| 47 |
from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig
|
| 48 |
|
| 49 |
-
# Configure JaiTTS-F5TTS model
|
| 50 |
model_config = ModelConfig(
|
| 51 |
language="th",
|
| 52 |
model_type="F5",
|
|
@@ -56,7 +112,6 @@ model_config = ModelConfig(
|
|
| 56 |
device="cuda" if torch.cuda.is_available() else "cpu"
|
| 57 |
)
|
| 58 |
|
| 59 |
-
# Basic audio settings
|
| 60 |
audio_config = AudioConfig(
|
| 61 |
silence_threshold=-45,
|
| 62 |
cfg_strength=2.5,
|
|
@@ -64,29 +119,33 @@ audio_config = AudioConfig(
|
|
| 64 |
speed=1.0
|
| 65 |
)
|
| 66 |
|
| 67 |
-
# Initialize pipeline
|
| 68 |
pipeline = FlowTTSPipeline(model_config, audio_config)
|
| 69 |
|
| 70 |
-
# Inference
|
| 71 |
audio, sr = pipeline.generate(
|
| 72 |
reference_audio="path/to/reference.wav",
|
| 73 |
reference_text="Transcription of the reference audio.",
|
| 74 |
-
gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือ
|
| 75 |
)
|
| 76 |
|
| 77 |
-
# Save result
|
| 78 |
sf.write("output.wav", audio, sr)
|
| 79 |
```
|
| 80 |
|
| 81 |
-
|
| 82 |
## Citation
|
| 83 |
|
| 84 |
If you find this work useful, please cite our paper:
|
| 85 |
|
| 86 |
```bibtex
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
```
|
| 89 |
|
| 90 |
## Acknowledgements
|
| 91 |
|
| 92 |
-
- Codebase adapted from [ThonburianTTS](https://github.com/biodatlab/thonburian-tts).
|
|
|
|
| 7 |
- SWivid/F5-TTS
|
| 8 |
pipeline_tag: text-to-speech
|
| 9 |
---
|
| 10 |
+
# JaiTTS-F5TTS: Thai Voice Cloning Model Research Prototype
|
| 11 |
|
| 12 |
+
[](https://arxiv.org/pdf/2604.27607)
|
| 13 |
+
[](https://jts.co.th/jai/)
|
| 14 |
+
[](https://github.com/JTS-AI-Team/JaiTTS)
|
| 15 |
[](https://opensource.org/licenses/Apache-2.0)
|
| 16 |
|
|
|
|
|
|
|
| 17 |
<img src="JaiTTS-logo.jpg" width="313"/>
|
| 18 |
|
| 19 |
+
**JaiTTS-F5TTS** is a non-autoregressive JaiTTS voice cloning model based on [F5-TTS](https://huggingface.co/SWivid/F5-TTS). It targets Thai zero-shot voice cloning.
|
| 20 |
+
|
| 21 |
+
> **Research prototype:** JaiTTS-F5TTS is one of our experimental variants within the JaiTTS project. It is released for research and benchmarking only.
|
| 22 |
+
|
| 23 |
+
## Highlights
|
| 24 |
+
|
| 25 |
+
- F5-TTS-based non-autoregressive voice cloning for Thai
|
| 26 |
+
- Duration predictor for improved pacing and intelligibility
|
| 27 |
+
- Fast synthesis with Real-Time Factor (RTF) below `0.2`
|
| 28 |
+
|
| 29 |
+
## Duration Modeling
|
| 30 |
+
|
| 31 |
+
The original F5-TTS duration estimate uses a UTF-8 byte-ratio formula. This is brittle for Thai and mixed-script input because Thai characters, English words, Arabic numerals, and punctuation do not have a consistent byte-to-pronunciation relationship. In practice, the mismatch can produce rushed, compressed, or unstable speech.
|
| 32 |
+
|
| 33 |
+
We address this with an XLM-R-based neural duration predictor that estimates target duration from text more robustly than the UTF-8 byte-ratio baseline.
|
| 34 |
+
|
| 35 |
+
The data used to train and evaluate the duration predictor is sampled from the JaiTTS-v1.0 training set.
|
| 36 |
+
|
| 37 |
+
### Duration Predictor Architecture
|
| 38 |
+
|
| 39 |
+
The duration predictor uses [XLM-R base](https://huggingface.co/FacebookAI/xlm-roberta-base) as the text encoder. Text representations are aggregated with masked mean pooling, then passed to a regression head composed of linear layers with GELU activation and dropout. The predictor also uses log-transformed syllable counts as an auxiliary feature, which provides a more pronunciation-aware signal than byte length for Thai and mixed-script text.
|
| 40 |
+
|
| 41 |
+
### Duration Prediction Metrics
|
| 42 |
|
| 43 |
+
Errors are reported in seconds. Lower is better.
|
| 44 |
+
|
| 45 |
+
- `MAE`: Mean absolute error across all samples.
|
| 46 |
+
- `p50 Error`: The 50th-percentile absolute error.
|
| 47 |
+
- `p90 Error`: The 90th-percentile absolute error.
|
| 48 |
+
- `p95 Error`: The 95th-percentile absolute error.
|
| 49 |
+
|
| 50 |
+
| Model | MAE ↓ | p50 Error ↓ | p90 Error ↓ | p95 Error ↓ |
|
| 51 |
+
| :-- | --: | --: | --: | --: |
|
| 52 |
+
| F5-TTS UTF-8 baseline | 1.7064 | 1.0987 | 4.0461 | 5.3914 |
|
| 53 |
+
| **XLM-R predictor** | **1.0924** | **0.7118** | **2.6319** | **3.4425** |
|
| 54 |
+
|
| 55 |
+
## Benchmark Results
|
| 56 |
+
|
| 57 |
+
### Objective Evaluation
|
| 58 |
+
|
| 59 |
+
Objective evaluation is measured on the same benchmark used in the paper: [JaiTTS: A Thai Voice Cloning Model](https://arxiv.org/pdf/2604.27607). Results can be reproduced using the benchmark instructions in the [GitHub repository](https://github.com/JTS-AI-Team/JaiTTS).
|
| 60 |
+
|
| 61 |
+
| Model | Short CER (%) ↓ | Short SIM ↑ | Long CER (%) ↓ | Long SIM ↑ |
|
| 62 |
+
| :-- | --: | --: | --: | --: |
|
| 63 |
+
| ThonburianTTS | 6.26 | 0.48 | -- | -- |
|
| 64 |
+
| JaiTTS-F5TTS | 4.78 | 0.60 | 12.63 | **0.80** |
|
| 65 |
+
| JaiTTS-F5TTS + Duration Predictor | 4.26 | 0.58 | 11.57 | **0.80** |
|
| 66 |
+
| [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **1.94** | **0.62** | **2.55** | 0.76 |
|
| 67 |
+
|
| 68 |
+
### Inference Speed
|
| 69 |
+
|
| 70 |
+
| Model | RTF ↓ |
|
| 71 |
+
| :-- | --: |
|
| 72 |
+
| ThonburianTTS | 0.1150 |
|
| 73 |
+
| JaiTTS-F5TTS | 0.1138 |
|
| 74 |
+
| JaiTTS-F5TTS + Duration Predictor | 0.1652 |
|
| 75 |
+
| [JaiTTS-v1.0](https://arxiv.org/pdf/2604.27607) | **0.1136** |
|
| 76 |
|
| 77 |
## Installation
|
| 78 |
|
| 79 |
+
This inference code and pipeline structure are adapted from the [ThonburianTTS](https://huggingface.co/biodatlab/ThonburianTTS) project by biodatlab.
|
| 80 |
+
|
| 81 |
### 1. Install Dependencies
|
| 82 |
|
| 83 |
```bash
|
|
|
|
| 87 |
|
| 88 |
### 2. Clone the Inference Codebase
|
| 89 |
|
| 90 |
+
This model uses the `flowtts` pipeline adapted from ThonburianTTS:
|
| 91 |
|
| 92 |
```bash
|
| 93 |
git clone https://github.com/biodatlab/thonburian-tts.git
|
|
|
|
| 96 |
|
| 97 |
## Quick Usage
|
| 98 |
|
| 99 |
+
Use the following snippet to run inference with the JaiTTS-F5TTS checkpoint. Ensure you are inside the `thonburian-tts` directory or have the `flowtts` module in your Python path.
|
| 100 |
|
| 101 |
```python
|
| 102 |
import torch
|
| 103 |
import soundfile as sf
|
| 104 |
from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig
|
| 105 |
|
|
|
|
| 106 |
model_config = ModelConfig(
|
| 107 |
language="th",
|
| 108 |
model_type="F5",
|
|
|
|
| 112 |
device="cuda" if torch.cuda.is_available() else "cpu"
|
| 113 |
)
|
| 114 |
|
|
|
|
| 115 |
audio_config = AudioConfig(
|
| 116 |
silence_threshold=-45,
|
| 117 |
cfg_strength=2.5,
|
|
|
|
| 119 |
speed=1.0
|
| 120 |
)
|
| 121 |
|
|
|
|
| 122 |
pipeline = FlowTTSPipeline(model_config, audio_config)
|
| 123 |
|
|
|
|
| 124 |
audio, sr = pipeline.generate(
|
| 125 |
reference_audio="path/to/reference.wav",
|
| 126 |
reference_text="Transcription of the reference audio.",
|
| 127 |
+
gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือ AI ที่สร้างโดย JTS"
|
| 128 |
)
|
| 129 |
|
|
|
|
| 130 |
sf.write("output.wav", audio, sr)
|
| 131 |
```
|
| 132 |
|
|
|
|
| 133 |
## Citation
|
| 134 |
|
| 135 |
If you find this work useful, please cite our paper:
|
| 136 |
|
| 137 |
```bibtex
|
| 138 |
+
@misc{karnjanaekarin2026jaittsthaivoicecloning,
|
| 139 |
+
title={JaiTTS: A Thai Voice Cloning Model},
|
| 140 |
+
author={Jullajak Karnjanaekarin and Pontakorn Trakuekul and Narongkorn Panitsrisit and Sumana Sumanakul and Vichayuth Nitayasomboon and Nithid Guntasin and Thanavin Denkavin and Attapol T. Rutherford},
|
| 141 |
+
year={2026},
|
| 142 |
+
eprint={2604.27607},
|
| 143 |
+
archivePrefix={arXiv},
|
| 144 |
+
primaryClass={cs.CL},
|
| 145 |
+
url={https://arxiv.org/abs/2604.27607},
|
| 146 |
+
}
|
| 147 |
```
|
| 148 |
|
| 149 |
## Acknowledgements
|
| 150 |
|
| 151 |
+
- Codebase adapted from [ThonburianTTS](https://github.com/biodatlab/thonburian-tts).
|