Instructions to use BettySara/whisper-large-v3-malayalam-FT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use BettySara/whisper-large-v3-malayalam-FT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="BettySara/whisper-large-v3-malayalam-FT")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("BettySara/whisper-large-v3-malayalam-FT") model = AutoModelForSpeechSeq2Seq.from_pretrained("BettySara/whisper-large-v3-malayalam-FT") - Notebooks
- Google Colab
- Kaggle
Whisper Large V3 Turbo Malayalam ASR – Full Fine-Tuned Model
Model Description
This model is a fully fine-tuned version of openai/whisper-large-v3-turbo for Malayalam Automatic Speech Recognition (ASR). The model was adapted using the Malayalam subset of the AI4Bharat IndicVoices dataset to improve transcription accuracy for Malayalam speech, especially in low-resource and regional-language ASR settings.
The model is intended for Malayalam speech-to-text transcription and was developed as part of an academic research project on fine-tuning Whisper ASR models for Malayalam.
Model Details
| Field | Description |
|---|---|
| Base model | openai/whisper-large-v3-turbo |
| Fine-tuning method | Full model fine-tuning |
| Language | Malayalam (ml) |
| Task | Automatic Speech Recognition / Transcription |
| Dataset | ai4bharat/IndicVoices, Malayalam subset |
| Sampling rate | 16 kHz |
| Evaluation metric | Word Error Rate (WER) |
| Framework | Hugging Face Transformers, PyTorch |
| Training epochs | 10 |
| Precision | BF16 |
Intended Use
This model can be used for:
- Malayalam speech transcription
- ASR research for low-resource Indic languages
- Academic experiments comparing full fine-tuning and PEFT methods
- Speech-based applications in Malayalam such as accessibility tools, transcription systems, and voice-enabled interfaces
Dataset
The model was trained and evaluated using the Malayalam subset of the AI4Bharat IndicVoices dataset. Audio files were resampled to 16 kHz before feature extraction. The text transcriptions were tokenized using the Whisper tokenizer configured for Malayalam transcription.
The preprocessing pipeline included:
- Loading Malayalam train and validation splits from
ai4bharat/IndicVoices - Removing unused metadata columns
- Casting audio to 16 kHz
- Extracting Whisper log-Mel input features
- Tokenizing Malayalam text labels
- Filtering examples exceeding the maximum decoder target length
Training Configuration
model_id = "openai/whisper-large-v3-turbo"
epochs = 10
batch_size = 32
learning_rate = 1e-5
warmup_steps = 1000
precision = "bf16"
eval_strategy = "epoch"
save_strategy = "epoch"
metric_for_best_model = "wer"
greater_is_better = False
generation_max_length = 448
lr_scheduler_type = "constant"
seed = 42
data_seed = 42
Evaluation
The model was evaluated using Word Error Rate (WER), computed with the evaluate and jiwer libraries.
| Model | Fine-tuning Strategy | Epochs | Metric | Result |
|---|---|---|---|---|
| Whisper Large V3 Turbo | Zero-shot baseline | ~ 102 | Higher baseline WER | |
| Whisper Large V3 Turbo Malayalam | Full fine-tuning | ~ 56 | WER | Improved Malayalam transcription accuracy |
Note: Replace the WER value above with the exact final
eval_werfrom the completed training run before final publication if needed.
Inference
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
MODEL_ID = "BettySara/whisper-large-v3-malayalam-FT"
processor = WhisperProcessor.from_pretrained(
MODEL_ID,
language="Malayalam",
task="transcribe"
)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(
language="ml",
task="transcribe"
)
def transcribe(audio_path):
speech, sr = librosa.load(audio_path, sr=16000)
inputs = processor(
speech,
sampling_rate=16000,
return_tensors="pt"
)
input_features = inputs.input_features.to(device)
predicted_ids = model.generate(
input_features,
forced_decoder_ids=forced_decoder_ids,
max_length=448
)
transcription = processor.batch_decode(
predicted_ids,
skip_special_tokens=True
)[0]
return transcription
print(transcribe("sample_malayalam_audio.wav"))
Limitations
- The model is specialized for Malayalam and may not perform well on other languages.
- Performance may vary across dialects, noisy speech, overlapping speakers, and long-form audio.
- Very long audio should be chunked before inference.
- The model may still produce spelling or word-boundary errors in conversational Malayalam.
- Evaluation should be repeated on a larger held-out test set before production use.
Ethical Considerations
This model should be used responsibly. Users should obtain consent before transcribing private speech. The model may produce incorrect transcriptions, so outputs should be reviewed before use in sensitive domains such as healthcare, legal, or official documentation.
Citation
If you use this model, please cite the base Whisper model and the IndicVoices dataset.
@article{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and others},
journal={arXiv preprint arXiv:2212.04356},
year={2022}
}
Author
Developed by Bettilda Sara Santhosh and Gourinath HS as part of research on Malayalam ASR (RSET & IHUB School of Learning)
- Downloads last month
- 65
Model tree for BettySara/whisper-large-v3-malayalam-FT
Base model
openai/whisper-large-v3