Unified Multilingual SpeechT5 (Hindi & Bengali)

This repository contains completely optimized, Kaggle-ready pipelines for fine-tuning the Microsoft SpeechT5 architecture for both Text-to-Speech (TTS) and Speech-to-Text (ASR/STT) in two major Indic languages: Hindi and Bengali.

🚀 Features & Optimizations

Unified Character Mapping: A custom dictionary natively maps complex Hindi (Devanagari) and Bengali scripts into SpeechT5's default Latin vocabulary. This completely eliminates the need to resize or retrain the model's tokenizer from scratch!
PEFT (LoRA) Integration: Freezes 99% of the massive 144M parameter model and only trains adapter modules. This drastically reduces VRAM requirements and enables high-batch training.
Kaggle T4 Battle-Tested: Extensive optimizations specifically built for the free Kaggle 16GB T4 GPU environment:
- Strict CUDA_VISIBLE_DEVICES hacks to stop destructive Multi-GPU DataParallel zip crash errors.
- DataCollator interception to natively pad lengths to an even number, satisfying SpeechT5's 2x reduction factor constraints.
- Dataset preprocessing filters to instantly drop overly heavy audio strings that dynamically blow up Kaggle GPU limits.

📊 Training Metrics (TTS Model)

The unified Text-To-Speech model was successfully fine-tuned on Kaggle using the dual-language dataset. By utilizing PEFT and optimal VRAM parameters, the model completed its training loop with the following performance metrics:

Total Unified Samples: 2,534 (Hindi + Bengali)
Total Epochs Completed: ~60
Global Steps: 2,000
Initial Loss: 10.11 (at Step 200)
Final Loss: 8.70 (at Step 2000)

📁 Repository Structure

Unified_Fine_Tune_SpeechT5.ipynb: The core unified Kaggle notebook for Text-to-Speech (TTS) fine-tuning.
Unified_Fine_Tune_SpeechT5_STT.ipynb: The companion Kaggle notebook for Speech-to-Text (ASR) fine-tuning.
Hindi TTS app T5/app.py: Gradio Web Interface for deploying your custom Hindi TTS model.
Bengali tts app T5/app.py: Gradio Web Interface for deploying your custom Bengali TTS model.

🛠️ How to Train

Create an account on Kaggle and start a new Notebook.
Set the accelerator to GPU T4x2.
Upload either the TTS or STT .ipynb file.
If you are running the Hindi dataset (abhirl/hindi-tts-dataset), ensure you click "Agree and access repository" on its HuggingFace page first, as it is a gated dataset.
Plug your Hugging Face API key into the notebook_login() or Kaggle Secrets to seamlessly download the dataset and subsequently upload your finished LoRA weights.
Hit Run All. The model will automatically train and upload your adapter checkpoints securely to your HuggingFace profile!

🧪 Deployment

Once your model has successfully uploaded to your HuggingFace Hub (e.g. YourUsername/SpeechT5-Unified-TTS-PEFT), you can easily merge it into your Gradio interfaces (app.py) using PeftModel.from_pretrained coupled with the base SpeechT5 architecture.

Downloads last month: 64

Solo448
/

SpeechT5-Unified-TTS-PEFT