Instructions to use multilingual-tts/F5-TTS-OpenBible-Assamese with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- F5-TTS
How to use multilingual-tts/F5-TTS-OpenBible-Assamese with F5-TTS:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| language: | |
| - as | |
| license: cc-by-sa-4.0 | |
| library_name: f5-tts | |
| tags: | |
| - text-to-speech | |
| - tts | |
| - f5-tts | |
| - open-bible | |
| - assamese | |
| pipeline_tag: text-to-speech | |
| base_model: SWivid/F5-TTS | |
| datasets: | |
| - davidguzmanr/open-bible-resources | |
| inference: false | |
| # F5-TTS Open Bible — Assamese | |
| A zero-shot text-to-speech model for **Assamese**, trained from scratch on | |
| the [Open Bible](https://huggingface.co/datasets/davidguzmanr/open-bible-resources) | |
| corpus using the [F5-TTS](https://github.com/SWivid/F5-TTS) architecture | |
| (diffusion transformer with vocos vocoder, 24 kHz output). | |
| The model takes a short reference audio clip (5–10 seconds) and a target text, | |
| and synthesises the target text in the voice of the reference speaker. No | |
| fine-tuning per voice is required. | |
| ## Files | |
| | File | Purpose | | |
| |------|---------| | |
| | `model_last.pt` | Trained model weights. | | |
| | `vocab.txt` | Character vocabulary built from the training transcripts. | | |
| | `F5-TTS_OpenBible_Assamese.yaml` | Hydra training/inference config (architecture, mel spec settings, tokenizer). | | |
| ## Intended use | |
| - Zero-shot TTS for Assamese, controlled by a user-supplied reference clip. | |
| - Research on multilingual TTS, low-resource TTS evaluation, and listening | |
| studies on Open Bible–style read-speech. | |
| ## How to use | |
| Install F5-TTS: | |
| ```bash | |
| pip install git+https://github.com/SWivid/F5-TTS.git | |
| ``` | |
| Download the checkpoint and run inference: | |
| ```python | |
| import torch | |
| from huggingface_hub import hf_hub_download | |
| from hydra.utils import get_class | |
| from omegaconf import OmegaConf | |
| from f5_tts.infer.utils_infer import infer_process, load_model, load_vocoder, preprocess_ref_audio_text | |
| repo_id = "multilingual-tts/F5-TTS-OpenBible-Assamese" | |
| ckpt = hf_hub_download(repo_id, "model_last.pt") | |
| vocab = hf_hub_download(repo_id, "vocab.txt") | |
| config = hf_hub_download(repo_id, "F5-TTS_OpenBible_Assamese.yaml") | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| model_cfg = OmegaConf.load(config) | |
| model_cls = get_class(f"f5_tts.model.{model_cfg.model.backbone}") | |
| vocoder = load_vocoder(vocoder_name="vocos", is_local=False, device=device) | |
| model = load_model( | |
| model_cls, model_cfg.model.arch, ckpt, | |
| mel_spec_type="vocos", vocab_file=vocab, use_ema=True, device=device, | |
| ) | |
| # Supply your own clean reference clip — 5–10 s, single speaker and its transcription. | |
| ref_audio = "/path/to/your-assamese-clip.wav" | |
| ref_text = "Exact transcription of the clip" | |
| gen_text = "..." # text to synthesise in Assamese | |
| ref_audio_proc, ref_text_proc = preprocess_ref_audio_text(ref_audio, ref_text) | |
| wav, sr, _ = infer_process( | |
| ref_audio_proc, ref_text_proc, gen_text, model, vocoder, | |
| mel_spec_type="vocos", device=device, | |
| ) | |
| ``` | |
| ## Training data | |
| - **Source:** `davidguzmanr/open-bible-resources`, config `Assamese` | |
| - **Size:** approximately 30,500 utterances | |
| - **Speakers:** multispeaker; speaker identity is supplied at inference time | |
| via the reference clip, not by a fixed speaker id | |
| - **Sample rate:** 24 kHz | |
| - **Maximum utterance duration during training:** 15 s | |
| ## Training procedure | |
| - Base architecture: F5-TTS v1 Base (DiT, 1024 dim, 22 layers, 16 heads, | |
| text dim 512, 4 convolutional layers). | |
| - Tokenizer: custom character-level, built from the training transcripts. | |
| - Vocoder: vocos. | |
| - Mel spectrogram: 100 channels, hop 256, win 1024, n_fft 1024. | |
| - Optimizer: AdamW, learning rate 7.5e-5, 20 000 warmup updates. | |
| - Training budget: 500,000 optimizer updates on 4 GPUs with mixed precision | |
| (bf16), global batch ≈ 112,000 frames. | |
| Audio preprocessing, vocab generation, and config sizing are reproducible via | |
| the upstream | |
| [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repo. | |
| ## Evaluation | |
| Evaluated alongside other Open-Bible TTS systems on character/word error rate | |
| (via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the | |
| [open-bible-models](https://github.com/davidguzmanr/open-bible-models) repository | |
| for the evaluation pipeline and the | |
| [open-bible-surveys](https://github.com/davidguzmanr/open-bible-surveys) repository | |
| for the human-listening survey methodology. |