multilingual-whisper-v3-turbo

This model is a fine-tuned version of openai/whisper-large-v3-turbo on the dsfsi-anv dataset. It achieves the following results on the evaluation set:

  • Loss: 0.2427
  • Wer: 0.1501
  • Cer: 0.0510

Model description

This model is a fine-tuned version of the Whisper Large V3 Turbo model, optimized for multilingual Automatic Speech Recognition (ASR). It has been trained on the ANV (Swivuriso) dataset to improve performance on specific target languages and domains represented in that corpus.

Whisper is a Transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on weak supervision using large-scale noisy data, and this fine-tuning step adapts it specifically for the languages and accents found in the dsfsi-anv dataset.

Intended uses & limitations

Intended Uses

  • Automatic Speech Recognition (ASR): The model is primarily intended to transcribe audio in the languages present in the training data.
  • Research: Suitable for researchers studying low-resource language adaptation and fine-tuning efficiency.

Limitations

  • Hallucinations: Like the base Whisper model, this model may generate repetitive text or hallucinations, particularly in silence or with background noise.
  • Domain Specificity: Performance may degrade on audio that differs significantly (in terms of accent, noise, or recording quality) from the ANV dataset.

Training and evaluation data

The model was trained on the dsfsi-anv dataset.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 16
  • optimizer: AdamW (betas=(0.9,0.98), epsilon=1e-08)
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • training_steps: 10,000
  • framework: PyTorch 2.9.1+cu128 / Transformers 4.57.3

Training results

Epoch Step Training Loss Validation Loss WER CER
0.1 1000 0.4108 0.5753 0.3702 0.1237
0.2 2000 0.2326 0.4653 0.2888 0.0881
0.3 3000 0.4429 0.3750 0.2354 0.0782
0.4 4000 0.3309 0.3388 0.2075 0.0674
0.5 5000 0.3298 0.3135 0.1952 0.0635
0.6 6000 0.3238 0.2929 0.1782 0.0592
0.7 7000 0.3926 0.2766 0.1688 0.0545
0.8 8000 0.2261 0.2627 0.1593 0.0519
0.9 9000 0.2197 0.2514 0.1573 0.0506
1.0 10000 0.2276 0.2427 0.1501 0.0510

Usage

This model can be used with the Hugging Face transformers library via the pipeline class.

pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate


import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load your fine-tuned model
model_id = "dsfsi-anv/multilingual-whisper-v3-turbo"
processor_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(processor_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

# Example: Transcribe a sample file
# result = pipe("path/to/audio.wav")
# print(result["text"])

Framework versions

  • Transformers 4.57.3
  • Pytorch 2.9.1+cu128
  • Datasets 4.4.1
  • Tokenizers 0.22.1

BibTeX entry and citation info

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {[https://arxiv.org/abs/2212.04356](https://arxiv.org/abs/2212.04356)},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}
Downloads last month
47
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xbiglylabs/za-anv-multilingual-whisper-v3-turbo

Finetuned
(512)
this model

Paper for xbiglylabs/za-anv-multilingual-whisper-v3-turbo

Evaluation results