Whisper Large V3 Turbo Malayalam ASR – Full Fine-Tuned Model

Model Description

This model is a fully fine-tuned version of openai/whisper-large-v3-turbo for Malayalam Automatic Speech Recognition (ASR). The model was adapted using the Malayalam subset of the AI4Bharat IndicVoices dataset to improve transcription accuracy for Malayalam speech, especially in low-resource and regional-language ASR settings.

The model is intended for Malayalam speech-to-text transcription and was developed as part of an academic research project on fine-tuning Whisper ASR models for Malayalam.

Model Details

Field Description
Base model openai/whisper-large-v3-turbo
Fine-tuning method Full model fine-tuning
Language Malayalam (ml)
Task Automatic Speech Recognition / Transcription
Dataset ai4bharat/IndicVoices, Malayalam subset
Sampling rate 16 kHz
Evaluation metric Word Error Rate (WER)
Framework Hugging Face Transformers, PyTorch
Training epochs 10
Precision BF16

Intended Use

This model can be used for:

  • Malayalam speech transcription
  • ASR research for low-resource Indic languages
  • Academic experiments comparing full fine-tuning and PEFT methods
  • Speech-based applications in Malayalam such as accessibility tools, transcription systems, and voice-enabled interfaces

Dataset

The model was trained and evaluated using the Malayalam subset of the AI4Bharat IndicVoices dataset. Audio files were resampled to 16 kHz before feature extraction. The text transcriptions were tokenized using the Whisper tokenizer configured for Malayalam transcription.

The preprocessing pipeline included:

  1. Loading Malayalam train and validation splits from ai4bharat/IndicVoices
  2. Removing unused metadata columns
  3. Casting audio to 16 kHz
  4. Extracting Whisper log-Mel input features
  5. Tokenizing Malayalam text labels
  6. Filtering examples exceeding the maximum decoder target length

Training Configuration

model_id = "openai/whisper-large-v3-turbo"
epochs = 10
batch_size = 32
learning_rate = 1e-5
warmup_steps = 1000
precision = "bf16"
eval_strategy = "epoch"
save_strategy = "epoch"
metric_for_best_model = "wer"
greater_is_better = False
generation_max_length = 448
lr_scheduler_type = "constant"
seed = 42
data_seed = 42

Evaluation

The model was evaluated using Word Error Rate (WER), computed with the evaluate and jiwer libraries.

Model Fine-tuning Strategy Epochs Metric Result
Whisper Large V3 Turbo Zero-shot baseline ~ 102 Higher baseline WER
Whisper Large V3 Turbo Malayalam Full fine-tuning ~ 56 WER Improved Malayalam transcription accuracy

Note: Replace the WER value above with the exact final eval_wer from the completed training run before final publication if needed.

Inference

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

MODEL_ID = "BettySara/whisper-large-v3-malayalam-FT"

processor = WhisperProcessor.from_pretrained(
    MODEL_ID,
    language="Malayalam",
    task="transcribe"
)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="ml",
    task="transcribe"
)

def transcribe(audio_path):
    speech, sr = librosa.load(audio_path, sr=16000)

    inputs = processor(
        speech,
        sampling_rate=16000,
        return_tensors="pt"
    )

    input_features = inputs.input_features.to(device)

    predicted_ids = model.generate(
        input_features,
        forced_decoder_ids=forced_decoder_ids,
        max_length=448
    )

    transcription = processor.batch_decode(
        predicted_ids,
        skip_special_tokens=True
    )[0]

    return transcription

print(transcribe("sample_malayalam_audio.wav"))

Limitations

  • The model is specialized for Malayalam and may not perform well on other languages.
  • Performance may vary across dialects, noisy speech, overlapping speakers, and long-form audio.
  • Very long audio should be chunked before inference.
  • The model may still produce spelling or word-boundary errors in conversational Malayalam.
  • Evaluation should be repeated on a larger held-out test set before production use.

Ethical Considerations

This model should be used responsibly. Users should obtain consent before transcribing private speech. The model may produce incorrect transcriptions, so outputs should be reviewed before use in sensitive domains such as healthcare, legal, or official documentation.

Citation

If you use this model, please cite the base Whisper model and the IndicVoices dataset.

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and others},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

Author

Developed by Bettilda Sara Santhosh and Gourinath HS as part of research on Malayalam ASR (RSET & IHUB School of Learning)

Downloads last month
65
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BettySara/whisper-large-v3-malayalam-FT

Finetuned
(530)
this model

Dataset used to train BettySara/whisper-large-v3-malayalam-FT

Space using BettySara/whisper-large-v3-malayalam-FT 1

Paper for BettySara/whisper-large-v3-malayalam-FT