Whisper Large V3 Turbo Malayalam ASR – Full Fine-Tuned Model

Model Description

This model is a fully fine-tuned version of openai/whisper-large-v3-turbo for Malayalam Automatic Speech Recognition (ASR). The model was adapted using the Malayalam subset of the AI4Bharat IndicVoices dataset to improve transcription accuracy for Malayalam speech, especially in low-resource and regional-language ASR settings.

The model is intended for Malayalam speech-to-text transcription and was developed as part of an academic research project on fine-tuning Whisper ASR models for Malayalam.

Model Details

Field	Description
Base model	`openai/whisper-large-v3-turbo`
Fine-tuning method	Full model fine-tuning
Language	Malayalam (`ml`)
Task	Automatic Speech Recognition / Transcription
Dataset	`ai4bharat/IndicVoices`, Malayalam subset
Sampling rate	16 kHz
Evaluation metric	Word Error Rate (WER)
Framework	Hugging Face Transformers, PyTorch
Training epochs	10
Precision	BF16

Intended Use

This model can be used for:

Malayalam speech transcription
ASR research for low-resource Indic languages
Academic experiments comparing full fine-tuning and PEFT methods
Speech-based applications in Malayalam such as accessibility tools, transcription systems, and voice-enabled interfaces

Dataset

The model was trained and evaluated using the Malayalam subset of the AI4Bharat IndicVoices dataset. Audio files were resampled to 16 kHz before feature extraction. The text transcriptions were tokenized using the Whisper tokenizer configured for Malayalam transcription.

The preprocessing pipeline included:

Loading Malayalam train and validation splits from ai4bharat/IndicVoices
Removing unused metadata columns
Casting audio to 16 kHz
Extracting Whisper log-Mel input features
Tokenizing Malayalam text labels
Filtering examples exceeding the maximum decoder target length

Training Configuration

model_id = "openai/whisper-large-v3-turbo"
epochs = 10
batch_size = 32
learning_rate = 1e-5
warmup_steps = 1000
precision = "bf16"
eval_strategy = "epoch"
save_strategy = "epoch"
metric_for_best_model = "wer"
greater_is_better = False
generation_max_length = 448
lr_scheduler_type = "constant"
seed = 42
data_seed = 42

Evaluation

The model was evaluated using Word Error Rate (WER), computed with the evaluate and jiwer libraries.

Model	Fine-tuning Strategy	Epochs	Metric	Result
Whisper Large V3 Turbo	Zero-shot baseline	~ 102	Higher baseline WER
Whisper Large V3 Turbo Malayalam	Full fine-tuning	~ 56	WER	Improved Malayalam transcription accuracy

Note: Replace the WER value above with the exact final eval_wer from the completed training run before final publication if needed.

Inference

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

MODEL_ID = "BettySara/whisper-large-v3-malayalam-FT"

processor = WhisperProcessor.from_pretrained(
    MODEL_ID,
    language="Malayalam",
    task="transcribe"
)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="ml",
    task="transcribe"
)

def transcribe(audio_path):
    speech, sr = librosa.load(audio_path, sr=16000)

    inputs = processor(
        speech,
        sampling_rate=16000,
        return_tensors="pt"
    )

    input_features = inputs.input_features.to(device)

    predicted_ids = model.generate(
        input_features,
        forced_decoder_ids=forced_decoder_ids,
        max_length=448
    )

    transcription = processor.batch_decode(
        predicted_ids,
        skip_special_tokens=True
    )[0]

    return transcription

print(transcribe("sample_malayalam_audio.wav"))

Limitations

The model is specialized for Malayalam and may not perform well on other languages.
Performance may vary across dialects, noisy speech, overlapping speakers, and long-form audio.
Very long audio should be chunked before inference.
The model may still produce spelling or word-boundary errors in conversational Malayalam.
Evaluation should be repeated on a larger held-out test set before production use.

Ethical Considerations

This model should be used responsibly. Users should obtain consent before transcribing private speech. The model may produce incorrect transcriptions, so outputs should be reviewed before use in sensitive domains such as healthcare, legal, or official documentation.

Citation

If you use this model, please cite the base Whisper model and the IndicVoices dataset.

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and others},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

Author

Developed by Bettilda Sara Santhosh and Gourinath HS as part of research on Malayalam ASR (RSET & IHUB School of Learning)

Downloads last month: 65

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for BettySara/whisper-large-v3-malayalam-FT

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

(530)

this model

Dataset used to train BettySara/whisper-large-v3-malayalam-FT

Space using BettySara/whisper-large-v3-malayalam-FT 1

Paper for BettySara/whisper-large-v3-malayalam-FT

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 54