Unified Multilingual SpeechT5 (Hindi & Bengali)
This repository contains completely optimized, Kaggle-ready pipelines for fine-tuning the Microsoft SpeechT5 architecture for both Text-to-Speech (TTS) and Speech-to-Text (ASR/STT) in two major Indic languages: Hindi and Bengali.
π Features & Optimizations
- Unified Character Mapping: A custom dictionary natively maps complex Hindi (Devanagari) and Bengali scripts into SpeechT5's default Latin vocabulary. This completely eliminates the need to resize or retrain the model's tokenizer from scratch!
- PEFT (LoRA) Integration: Freezes 99% of the massive 144M parameter model and only trains adapter modules. This drastically reduces VRAM requirements and enables high-batch training.
- Kaggle T4 Battle-Tested: Extensive optimizations specifically built for the free Kaggle 16GB T4 GPU environment:
- Strict
CUDA_VISIBLE_DEVICEShacks to stop destructive Multi-GPUDataParallelzip crash errors. DataCollatorinterception to natively pad lengths to an even number, satisfying SpeechT5's 2x reduction factor constraints.- Dataset preprocessing filters to instantly drop overly heavy audio strings that dynamically blow up Kaggle GPU limits.
- Strict
π Training Metrics (TTS Model)
The unified Text-To-Speech model was successfully fine-tuned on Kaggle using the dual-language dataset. By utilizing PEFT and optimal VRAM parameters, the model completed its training loop with the following performance metrics:
- Total Unified Samples: 2,534 (Hindi + Bengali)
- Total Epochs Completed: ~60
- Global Steps: 2,000
- Initial Loss:
10.11(at Step 200) - Final Loss:
8.70(at Step 2000)
π Repository Structure
Unified_Fine_Tune_SpeechT5.ipynb: The core unified Kaggle notebook for Text-to-Speech (TTS) fine-tuning.Unified_Fine_Tune_SpeechT5_STT.ipynb: The companion Kaggle notebook for Speech-to-Text (ASR) fine-tuning.Hindi TTS app T5/app.py: Gradio Web Interface for deploying your custom Hindi TTS model.Bengali tts app T5/app.py: Gradio Web Interface for deploying your custom Bengali TTS model.
π οΈ How to Train
- Create an account on Kaggle and start a new Notebook.
- Set the accelerator to GPU T4x2.
- Upload either the TTS or STT
.ipynbfile. - If you are running the
Hindidataset (abhirl/hindi-tts-dataset), ensure you click "Agree and access repository" on its HuggingFace page first, as it is a gated dataset. - Plug your Hugging Face API key into the
notebook_login()or Kaggle Secrets to seamlessly download the dataset and subsequently upload your finished LoRA weights. - Hit Run All. The model will automatically train and upload your adapter checkpoints securely to your HuggingFace profile!
π§ͺ Deployment
Once your model has successfully uploaded to your HuggingFace Hub (e.g. YourUsername/SpeechT5-Unified-TTS-PEFT), you can easily merge it into your Gradio interfaces (app.py) using PeftModel.from_pretrained coupled with the base SpeechT5 architecture.
- Downloads last month
- 64