multilingual-whisper-v3-turbo
This model is a fine-tuned version of openai/whisper-large-v3-turbo on the dsfsi-anv dataset. It achieves the following results on the evaluation set:
- Loss: 0.2427
- Wer: 0.1501
- Cer: 0.0510
Model description
This model is a fine-tuned version of the Whisper Large V3 Turbo model, optimized for multilingual Automatic Speech Recognition (ASR). It has been trained on the ANV (Swivuriso) dataset to improve performance on specific target languages and domains represented in that corpus.
Whisper is a Transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on weak supervision using large-scale noisy data, and this fine-tuning step adapts it specifically for the languages and accents found in the dsfsi-anv dataset.
Intended uses & limitations
Intended Uses
- Automatic Speech Recognition (ASR): The model is primarily intended to transcribe audio in the languages present in the training data.
- Research: Suitable for researchers studying low-resource language adaptation and fine-tuning efficiency.
Limitations
- Hallucinations: Like the base Whisper model, this model may generate repetitive text or hallucinations, particularly in silence or with background noise.
- Domain Specificity: Performance may degrade on audio that differs significantly (in terms of accent, noise, or recording quality) from the ANV dataset.
Training and evaluation data
The model was trained on the dsfsi-anv dataset.
- Dataset Name: ANV (Swivuriso)
- Source: https://huggingface.co/dsfsi-anv
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: AdamW (betas=(0.9,0.98), epsilon=1e-08)
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 10,000
- framework: PyTorch 2.9.1+cu128 / Transformers 4.57.3
Training results
| Epoch | Step | Training Loss | Validation Loss | WER | CER |
|---|---|---|---|---|---|
| 0.1 | 1000 | 0.4108 | 0.5753 | 0.3702 | 0.1237 |
| 0.2 | 2000 | 0.2326 | 0.4653 | 0.2888 | 0.0881 |
| 0.3 | 3000 | 0.4429 | 0.3750 | 0.2354 | 0.0782 |
| 0.4 | 4000 | 0.3309 | 0.3388 | 0.2075 | 0.0674 |
| 0.5 | 5000 | 0.3298 | 0.3135 | 0.1952 | 0.0635 |
| 0.6 | 6000 | 0.3238 | 0.2929 | 0.1782 | 0.0592 |
| 0.7 | 7000 | 0.3926 | 0.2766 | 0.1688 | 0.0545 |
| 0.8 | 8000 | 0.2261 | 0.2627 | 0.1593 | 0.0519 |
| 0.9 | 9000 | 0.2197 | 0.2514 | 0.1573 | 0.0506 |
| 1.0 | 10000 | 0.2276 | 0.2427 | 0.1501 | 0.0510 |
Usage
This model can be used with the Hugging Face transformers library via the pipeline class.
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Load your fine-tuned model
model_id = "dsfsi-anv/multilingual-whisper-v3-turbo"
processor_id = "openai/whisper-large-v3-turbo"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(processor_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
# Example: Transcribe a sample file
# result = pipe("path/to/audio.wav")
# print(result["text"])
Framework versions
- Transformers 4.57.3
- Pytorch 2.9.1+cu128
- Datasets 4.4.1
- Tokenizers 0.22.1
BibTeX entry and citation info
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {[https://arxiv.org/abs/2212.04356](https://arxiv.org/abs/2212.04356)},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
- Downloads last month
- 47
Model tree for xbiglylabs/za-anv-multilingual-whisper-v3-turbo
Base model
openai/whisper-large-v3Paper for xbiglylabs/za-anv-multilingual-whisper-v3-turbo
Evaluation results
- Wer on dsfsi-anvself-reported0.150
- Cer on dsfsi-anvself-reported0.051
- Validation Loss on dsfsi-anvself-reported0.243