F5-TTS Vietnamese (PhoAudiobook Experiments)

This repository contains experimental checkpoints for F5-TTS adapted for the Vietnamese language.

The models were trained using the PhoAudiobook dataset to evaluate the performance differences between extensive pre-training and standard fine-tuning.

Model Variants

There are two distinct model versions included in this repository:

1. Pre-trained Vietnamese Base (1000 Hours)

  • Filename: model_pretrain_1000h.pt (Example name, rename to match yours)
  • Training Strategy: Pre-trained from scratch (or continued pre-training) on a massive corpus of 1000 hours of Vietnamese audio from PhoAudiobook.
  • Goal: To establish a robust native understanding of Vietnamese phonemes and prosody without relying heavily on the original English priors.

2. Fine-tuned Base (200 Hours)

  • Filename: model_finetune_200h.pt (Example name, rename to match yours)
  • Training Strategy: Fine-tuned on 200 hours of PhoAudiobook data.
  • Base Model: Started from the official F5-TTS Base model.
  • Goal: To quickly adapt the pre-existing capabilities of F5-TTS to Vietnamese with a smaller, high-quality dataset.

Observations & Performance Tips

Word Skipping vs. Repetition

  • Pre-trained Model: Prioritizes natural flow but struggles with strict text alignment (missing words).
  • Fine-tuned Model: Prioritizes text adherence but sacrifices some native accent quality.

Impact of CFG (Classifier-Free Guidance)

Increasing the CFG Strength (e.g., > 2.0) helps the model follow the input text more strictly.

  • Higher CFG: Reduces missing words but increases the chance of repeated words.
  • Lower CFG: More diverse/natural flow but higher risk of skipping words.

Recommended Inference Settings

For the best generation quality and stability with these checkpoints, use the following parameters:

  • Reference Audio: 5 - 10 seconds (Clean, single-speaker audio works best)
  • CFG Strength: 2.0
  • Sway Sampling Coefficient: -1.0

How to Use

To use these checkpoints, load the specific .pt file into your F5-TTS inference setup.

python src/f5_tts/infer/infer_cli.py
--model_cfg "conf/model.yaml" \ #config file, only for pretrain --ckpt_file "model.pt" \ # model checkpoint --ref_audio "audio.wav"
--ref_text "lúc trước là hắn đều chọc cho tôi khóc, sau đó lại dỗ dành tôi, nhưng mà lần này hắn lại khóc."
--gen_text "hồ gươm, trái tim thủ đô hà nội, là một bức tranh thủy mặc sống động với mặt hồ xanh biếc phẳng lặng như gương, phản chiếu hàng liễu rủ bóng và tháp rùa cổ kính soi mình giữa hồ. buổi sáng, nơi đây tĩnh lặng, ngập tràn không khí trong lành với người tập thể dục, nhưng cũng trở nên rực rỡ, lung linh hơn khi đêm xuống, kết hợp cùng cầu thê húc đỏ son và đền ngọc sơn linh thiêng, mang vẻ đẹp vừa cổ kính, vừa hiện đại, gắn liền với truyền thuyết hào hùng về sự tích trả gươm của vua lê lợi, tạo nên một biểu tượng văn hóa không thể phai mờ"
--vocab_file data/vietnamese_vocab.txt \ # different for pretrain and finetune --speed 1.0
--cfg_strength 3.0
--nfe_step 64
--sway_sampling_coef -1.0
--remove_silence \

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for namprice227/F5-TTS-Vietnamese1000h-Pretrain

Base model

SWivid/F5-TTS
Finetuned
(88)
this model

Dataset used to train namprice227/F5-TTS-Vietnamese1000h-Pretrain