Whisper Large V3 Turbo - Fine-tuned for Canadian French (Québécois)
This is a fine-tuned version of openai/whisper-large-v3-turbo for Automatic Speech Recognition (ASR) on Canadian French (Québécois) speech.
It achieves a Word Error Rate (WER) of 6.65% on the validation set, offering significant improvements over the base model for the specific nuances, cadence, and vocabulary of Quebec French.
⚡ Optimized Inference (WhisperX / CTranslate2)
If you are using WhisperX, Faster-Whisper, or CTranslate2, please use the dedicated optimized repository which contains the converted Float16 weights:
👉 ele-sage/whisper-large-v3-turbo-fr-quebecois-ct2
Model Details
Model Description
This model is fine-tuned from OpenAI's Whisper Large V3 Turbo. Unlike general French models, this version was trained on a mix of formal political speech and crowdsourced data to ensure robustness across different registers of Quebec French.
The training data includes the Assemblée Nationale du Québec (formal/political) and Common Voice (informal/diverse), making the model capable of handling both formal addresses and everyday accents.
- Model type: Sequence-to-sequence audio-to-text model (Whisper)
- Language(s): French (fr), specifically Canadian French (fr-CA)
- License: MIT
- Finetuned from model:
openai/whisper-large-v3-turbo
How to Get Started with the Model
The model can be easily used with the pipeline from the 🤗 Transformers library.
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="ele-sage/whisper-large-v3-turbo-fr-quebecois",
device=device,
)
# Transcribe an audio file
audio_path = "path/to/your/audio.wav"
result = pipe(audio_path)
print(result["text"])
Uses
Direct Use
This model is intended for transcribing audio files containing Canadian French speech. It is particularly useful for:
- Transcribing Quebec media, podcasts, and interviews.
- Transcribing formal settings (political/legal) due to the inclusion of Assemblée Nationale data.
- General Canadian French ASR tasks where standard models struggle with the accent.
Training Details
Training Data
The model was fine-tuned on a total of approximately 69 hours of audio, combining three distinct datasets to balance specific dialect accuracy with general French capability:
- Assemblée Nationale du Québec: ~13 hours (Formal, political context).
- Common Voice (Français du Canada): ~16.5 hours (Native Quebecois accents).
- Common Voice (Other French): ~39 hours (General French to prevent overfitting and maintain vocabulary coverage).
| Dataset Source | Duration |
|---|---|
| Assemblée Nationale | 13h 11m 20s |
| Common Voice (CA) | 16h 31m 23s |
| Common Voice (Other) | 39h 12m 17s |
| Total | 68h 55m |
Training Procedure
Preprocessing
The raw audio was sampled at 16kHz. Audio clips longer than 30 seconds were excluded or segmented. SpecAugment was applied during training (masking time and frequency features) to provide data augmentation and combat overfitting.
Training Hyperparameters
- Framework: Transformers
Seq2SeqTrainer - Model:
openai/whisper-large-v3-turbo - Learning Rate:
5e-6 - LR Scheduler:
cosine - Warmup Steps: 100
- Num Epochs: 1
- Batch Size: 4 (with gradient accumulation of 4, effective batch size = 16)
- Optimizer: AdamW
- Precision:
bf16(BFloat16) - SpecAugment Settings:
dropout: 0.2attention_dropout: 0.15activation_dropout: 0.15mask_time_prob: 0.10mask_feature_prob: 0.10
Results
The model demonstrated steady convergence, achieving a final Word Error Rate (WER) of 6.65%.
| Step | Training Loss | Validation Loss | Wer |
|---|---|---|---|
| 400 | 0.3294 | 0.2459 | 8.10% |
| 800 | 0.3203 | 0.2375 | 7.98% |
| 1200 | 0.3118 | 0.2234 | 7.85% |
| 1600 | 0.2764 | 0.2094 | 7.12% |
| 2000 | 0.2715 | 0.2008 | 6.81% |
| 2400 | 0.2521 | 0.1967 | 6.65% |
Citation
If you use this model, please consider citing the original Whisper paper:
@misc{radford2022whisper,
doi = {10.48550/ARXIV.2212.04356},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
- Downloads last month
- 1,124