Text-to-Speech
Thai
English
JaiTTS-F5TTS / README.md
jts-ai-team's picture
Update README.md
caad885 verified
metadata
license: apache-2.0
language:
  - th
  - en
base_model:
  - SWivid/F5-TTS
pipeline_tag: text-to-speech

JaiTTS-F5TTS: Thai Voice Cloning Model Research Prototype

Paper Website GitHub License: Apache 2.0

JaiTTS-F5TTS is a non-autoregressive JaiTTS voice cloning model based on F5-TTS. It targets Thai zero-shot voice cloning.

Research prototype: JaiTTS-F5TTS is one of our experimental variants within the JaiTTS project. It is released for research and benchmarking only.

Highlights

  • F5-TTS-based non-autoregressive voice cloning for Thai
  • Duration predictor for improved pacing and intelligibility
  • Fast synthesis with Real-Time Factor (RTF) below 0.2

Duration Modeling

The original F5-TTS duration estimate uses a UTF-8 byte-ratio formula. This is brittle for Thai and mixed-script input because Thai characters, English words, Arabic numerals, and punctuation do not have a consistent byte-to-pronunciation relationship. In practice, the mismatch can produce rushed, compressed, or unstable speech.

We address this with an XLM-R-based neural duration predictor that estimates target duration from text more robustly than the UTF-8 byte-ratio baseline.

The data used to train and evaluate the duration predictor is sampled from the JaiTTS-v1.0 training set.

Duration Predictor Architecture

The duration predictor uses XLM-R base as the text encoder. Text representations are aggregated with masked mean pooling, then passed to a regression head composed of linear layers with GELU activation and dropout. The predictor also uses log-transformed syllable counts as an auxiliary feature, which provides a more pronunciation-aware signal than byte length for Thai and mixed-script text.

Duration Prediction Metrics

Errors are reported in seconds. Lower is better.

  • MAE: Mean absolute error across all samples.
  • p50 Error: The 50th-percentile absolute error.
  • p90 Error: The 90th-percentile absolute error.
  • p95 Error: The 95th-percentile absolute error.
Model MAE ↓ p50 Error ↓ p90 Error ↓ p95 Error ↓
F5-TTS UTF-8 baseline 1.7064 1.0987 4.0461 5.3914
XLM-R predictor 1.0924 0.7118 2.6319 3.4425

Benchmark Results

Objective Evaluation

Objective evaluation is measured on the same benchmark used in the paper: JaiTTS: A Thai Voice Cloning Model. Results can be reproduced using the benchmark instructions in the GitHub repository.

Model Short CER (%) ↓ Short SIM ↑ Long CER (%) ↓ Long SIM ↑
ThonburianTTS 6.26 0.48 -- --
JaiTTS-F5TTS 4.78 0.60 12.63 0.80
JaiTTS-F5TTS + Duration Predictor 4.26 0.58 11.57 0.80
JaiTTS-v1.0 1.94 0.62 2.55 0.76

Inference Speed

Model RTF ↓
ThonburianTTS 0.1150
JaiTTS-F5TTS 0.1138
JaiTTS-F5TTS + Duration Predictor 0.1652
JaiTTS-v1.0 0.1136

Installation

This inference code and pipeline structure are adapted from the ThonburianTTS project by biodatlab.

1. Install Dependencies

pip install torch cached-path librosa transformers f5-tts
sudo apt install ffmpeg

2. Clone the Inference Codebase

This model uses the flowtts pipeline adapted from ThonburianTTS:

git clone https://github.com/biodatlab/thonburian-tts.git
cd thonburian-tts

Quick Usage

Use the following snippet to run inference with the JaiTTS-F5TTS checkpoint. Ensure you are inside the thonburian-tts directory or have the flowtts module in your Python path.

import torch
import soundfile as sf
from flowtts.inference import FlowTTSPipeline, ModelConfig, AudioConfig

model_config = ModelConfig(
    language="th",
    model_type="F5",
    checkpoint="hf://JTS-AI/JaiTTS-F5TTS/model.pt",
    vocab_file="hf://JTS-AI/JaiTTS-F5TTS/vocab.txt",
    vocoder="vocos",
    device="cuda" if torch.cuda.is_available() else "cpu"
)

audio_config = AudioConfig(
    silence_threshold=-45,
    cfg_strength=2.5,
    nfe_step=32,
    speed=1.0
)

pipeline = FlowTTSPipeline(model_config, audio_config)

audio, sr = pipeline.generate(
    reference_audio="path/to/reference.wav",
    reference_text="Transcription of the reference audio.",
    gen_text="สวัสดีครับ ยินดีที่ได้รู้จัก ผมคือ AI ที่สร้างโดย JTS"
)

sf.write("output.wav", audio, sr)

Citation

If you find this work useful, please cite our paper:

@misc{karnjanaekarin2026jaittsthaivoicecloning,
      title={JaiTTS: A Thai Voice Cloning Model}, 
      author={Jullajak Karnjanaekarin and Pontakorn Trakuekul and Narongkorn Panitsrisit and Sumana Sumanakul and Vichayuth Nitayasomboon and Nithid Guntasin and Thanavin Denkavin and Attapol T. Rutherford},
      year={2026},
      eprint={2604.27607},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.27607}, 
}

Acknowledgements