--- language: - as license: cc-by-sa-4.0 library_name: f5-tts tags: - text-to-speech - tts - f5-tts - open-bible - assamese pipeline_tag: text-to-speech base_model: SWivid/F5-TTS datasets: - davidguzmanr/open-bible-resources inference: false --- # F5-TTS Open Bible — Assamese A zero-shot text-to-speech model for **Assamese**, trained from scratch on the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources) corpus using the [F5-TTS](https://github.com/SWivid/F5-TTS) architecture (diffusion transformer with vocos vocoder, 24 kHz output). The model takes a short reference audio clip (5–10 seconds) and a target text, and synthesises the target text in the voice of the reference speaker. No fine-tuning per voice is required. ## Files | File | Purpose | |------|---------| | `model_last.pt` | Trained model weights. | | `vocab.txt` | Character vocabulary built from the training transcripts. | | `F5-TTS_OpenBible_Assamese.yaml` | Hydra training/inference config (architecture, mel spec settings, tokenizer). | ## Intended use - Zero-shot TTS for Assamese, controlled by a user-supplied reference clip. - Research on multilingual TTS, low-resource TTS evaluation, and listening studies on Open Bible–style read-speech. ## How to use Install F5-TTS: ```bash pip install git+https://github.com/SWivid/F5-TTS.git ``` Download the checkpoint and run inference: ```python import torch from huggingface_hub import hf_hub_download from hydra.utils import get_class from omegaconf import OmegaConf from f5_tts.infer.utils_infer import infer_process, load_model, load_vocoder, preprocess_ref_audio_text repo_id = "multilingual-tts/F5-TTS-OpenBible-Assamese" ckpt = hf_hub_download(repo_id, "model_last.pt") vocab = hf_hub_download(repo_id, "vocab.txt") config = hf_hub_download(repo_id, "F5-TTS_OpenBible_Assamese.yaml") device = "cuda" if torch.cuda.is_available() else "cpu" model_cfg = OmegaConf.load(config) model_cls = get_class(f"f5_tts.model.{model_cfg.model.backbone}") vocoder = load_vocoder(vocoder_name="vocos", is_local=False, device=device) model = load_model( model_cls, model_cfg.model.arch, ckpt, mel_spec_type="vocos", vocab_file=vocab, use_ema=True, device=device, ) # Supply your own clean reference clip — 5–10 s, single speaker and its transcription. ref_audio = "/path/to/your-assamese-clip.wav" ref_text = "Exact transcription of the clip" gen_text = "..." # text to synthesise in Assamese ref_audio_proc, ref_text_proc = preprocess_ref_audio_text(ref_audio, ref_text) wav, sr, _ = infer_process( ref_audio_proc, ref_text_proc, gen_text, model, vocoder, mel_spec_type="vocos", device=device, ) ``` ## Training data - **Source:** `davidguzmanr/open-bible-resources`, config `Assamese` - **Size:** approximately 30,500 utterances - **Speakers:** multispeaker; speaker identity is supplied at inference time via the reference clip, not by a fixed speaker id - **Sample rate:** 24 kHz - **Maximum utterance duration during training:** 15 s ## Training procedure - Base architecture: F5-TTS v1 Base (DiT, 1024 dim, 22 layers, 16 heads, text dim 512, 4 convolutional layers). - Tokenizer: custom character-level, built from the training transcripts. - Vocoder: vocos. - Mel spectrogram: 100 channels, hop 256, win 1024, n_fft 1024. - Optimizer: AdamW, learning rate 7.5e-5, 20 000 warmup updates. - Training budget: 500,000 optimizer updates on 4 GPUs with mixed precision (bf16), global batch ≈ 112,000 frames. Audio preprocessing, vocab generation, and config sizing are reproducible via the upstream [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo. ## Evaluation Evaluated alongside other Open-Bible TTS systems on character/word error rate (via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository for the evaluation pipeline and the [open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository for the human-listening survey methodology.