WAXAL MMS-TTS — Luo (`luo`)

Fine-tuning-ready checkpoint for Luo (luo).


WAXAL dataset config	`google/WaxalNLP` — `luo_tts`
Data provider	Loud and Clear
WAXAL data license	CC-BY-SA-4.0
Base model	`facebook/mms-tts-ach`
Model license	CC-BY-NC 4.0 (MMS base; governs fine-tuned model)

⚠️ Proxy checkpoint

facebook/mms-tts-luo does not exist in MMS-TTS coverage (1107 languages). This repository fine-tunes from the closest available linguistic donor:


Proxy donor	`facebook/mms-tts-ach`
Ranked alternatives	`ach`, `nyn`, `lug`
Other WAXAL languages sharing this donor	none

Proximity rationale: Luo (luo) has no MMS checkpoint. Proxy: ach (Acholi; both are Western Nilotic, Nilo-Saharan — same subgroup with closely related phonology, tone system, and lexicon). Secondary: nyn (Nyankole, Bantu, geographically co-located by Loud and Clear). Tertiary: lug (Luganda, Bantu, same recording provider).

Each recipient language is fine-tuned independently from the same donor base. Donor weights provide acoustic/prosodic warm-start; WAXAL fine-tuning adapts them to the target language.

What this repository adds

facebook/mms-tts-* Hub checkpoints are inference-only releases that crash run_vits_finetuning.py. This repository applies three patches:

File	Change
`config.json`	`pad_token_id` set to `0` (was `null`)
`tokenizer_config.json`	`pad_token` entry added
`preprocessor_config.json`	Added — `VitsFeatureExtractor` config from `ylacombe/mms-tts-eng-train`

Model weights are not stored here. _name_or_path in config.json points to facebook/mms-tts-ach, so run_vits_finetuning.py loads weights from that checkpoint at training time.

preprocessor_config.json

Downloaded verbatim from ylacombe/mms-tts-eng-train. Values are VITS architecture constants shared by all MMS-TTS languages.

Field	Value
`feature_extractor_type`	`VitsFeatureExtractor`
`feature_size`	`80`
`hop_length`	`256`
`max_wav_value`	`32768.0`
`n_fft`	`1024`
`padding_side`	`right`
`padding_value`	`0.0`
`return_attention_mask`	`False`
`sampling_rate`	`16000`
`spec_gain`	`1`

Usage in finetune-hf-vits

{
  "model_name_or_path":     "rnjema-unima/mms-tts-luo-baseline",
  "feature_extractor_name": "rnjema-unima/mms-tts-luo-baseline",
  "dataset_name":           "google/WaxalNLP",
  "dataset_config_name":    "luo_tts",
  "audio_column_name":      "audio",
  "text_column_name":       "text",
  "train_split_name":       "train",
  "eval_split_name":        "validation"
}

Inference (after fine-tuning)

from transformers import VitsModel, VitsTokenizer
import torch, scipy

model     = VitsModel.from_pretrained("your-org/your-finetuned-model")
tokenizer = VitsTokenizer.from_pretrained("your-org/your-finetuned-model")

inputs = tokenizer("Your text in Luo.", return_tensors="pt")
with torch.no_grad():
    out = model(**inputs)

scipy.io.wavfile.write("output.wav", model.config.sampling_rate,
    out.waveform.squeeze().numpy())

Technical details


Architecture	VITS (end-to-end, no separate vocoder)
MMS match type	`proxy`
`pad_token_id`	`0`
`vocab_size`	`28`
`is_uroman`	`false`
`sampling_rate`	`16000` Hz

References

BibTex Citation:

This model was developed by Vineel Pratap et al. from Meta AI. If you use the model, consider citing the MMS paper:

@article{pratap2023mms, title={Scaling Speech Technology to 1,000+ Languages}, author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli}, journal={arXiv}, year={2023} }

Notable Acknowledgements:

Downloads last month: 15

Safetensors

Model size

36.3M params

Tensor type

F32

Model tree for rnjema-unima/mms-tts-luo-baseline

Base model

facebook/mms-tts-ach

Finetuned

(2)

this model

Dataset used to train rnjema-unima/mms-tts-luo-baseline

Paper for rnjema-unima/mms-tts-luo-baseline

WAXAL: A Large-Scale Multilingual African Language Speech Corpus

Paper • 2602.02734 • Published Feb 2 • 3

WAXAL MMS-TTS — Luo (luo)