Whisper Large V3 Turbo - Fine-tuned for Canadian French (Québécois)

This is a fine-tuned version of openai/whisper-large-v3-turbo for Automatic Speech Recognition (ASR) on Canadian French (Québécois) speech.

It achieves a Word Error Rate (WER) of 6.65% on the validation set, offering significant improvements over the base model for the specific nuances, cadence, and vocabulary of Quebec French.

⚡ Optimized Inference (WhisperX / CTranslate2)

If you are using WhisperX, Faster-Whisper, or CTranslate2, please use the dedicated optimized repository which contains the converted Float16 weights:

👉 ele-sage/whisper-large-v3-turbo-fr-quebecois-ct2

Model Details

Model Description

This model is fine-tuned from OpenAI's Whisper Large V3 Turbo. Unlike general French models, this version was trained on a mix of formal political speech and crowdsourced data to ensure robustness across different registers of Quebec French.

The training data includes the Assemblée Nationale du Québec (formal/political) and Common Voice (informal/diverse), making the model capable of handling both formal addresses and everyday accents.

  • Model type: Sequence-to-sequence audio-to-text model (Whisper)
  • Language(s): French (fr), specifically Canadian French (fr-CA)
  • License: MIT
  • Finetuned from model: openai/whisper-large-v3-turbo

How to Get Started with the Model

The model can be easily used with the pipeline from the 🤗 Transformers library.

from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"

pipe = pipeline(
  "automatic-speech-recognition",
  model="ele-sage/whisper-large-v3-turbo-fr-quebecois",
  device=device,
)

# Transcribe an audio file
audio_path = "path/to/your/audio.wav"
result = pipe(audio_path)

print(result["text"])

Uses

Direct Use

This model is intended for transcribing audio files containing Canadian French speech. It is particularly useful for:

  • Transcribing Quebec media, podcasts, and interviews.
  • Transcribing formal settings (political/legal) due to the inclusion of Assemblée Nationale data.
  • General Canadian French ASR tasks where standard models struggle with the accent.

Training Details

Training Data

The model was fine-tuned on a total of approximately 69 hours of audio, combining three distinct datasets to balance specific dialect accuracy with general French capability:

  1. Assemblée Nationale du Québec: ~13 hours (Formal, political context).
  2. Common Voice (Français du Canada): ~16.5 hours (Native Quebecois accents).
  3. Common Voice (Other French): ~39 hours (General French to prevent overfitting and maintain vocabulary coverage).
Dataset Source Duration
Assemblée Nationale 13h 11m 20s
Common Voice (CA) 16h 31m 23s
Common Voice (Other) 39h 12m 17s
Total 68h 55m

Training Procedure

Preprocessing

The raw audio was sampled at 16kHz. Audio clips longer than 30 seconds were excluded or segmented. SpecAugment was applied during training (masking time and frequency features) to provide data augmentation and combat overfitting.

Training Hyperparameters

  • Framework: Transformers Seq2SeqTrainer
  • Model: openai/whisper-large-v3-turbo
  • Learning Rate: 5e-6
  • LR Scheduler: cosine
  • Warmup Steps: 100
  • Num Epochs: 1
  • Batch Size: 4 (with gradient accumulation of 4, effective batch size = 16)
  • Optimizer: AdamW
  • Precision: bf16 (BFloat16)
  • SpecAugment Settings:
    • dropout: 0.2
    • attention_dropout: 0.15
    • activation_dropout: 0.15
    • mask_time_prob: 0.10
    • mask_feature_prob: 0.10

Results

The model demonstrated steady convergence, achieving a final Word Error Rate (WER) of 6.65%.

Step Training Loss Validation Loss Wer
400 0.3294 0.2459 8.10%
800 0.3203 0.2375 7.98%
1200 0.3118 0.2234 7.85%
1600 0.2764 0.2094 7.12%
2000 0.2715 0.2008 6.81%
2400 0.2521 0.1967 6.65%

Citation

If you use this model, please consider citing the original Whisper paper:

@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}
Downloads last month
1,124
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for ele-sage/whisper-large-v3-turbo-fr-quebecois

Finetuned
(512)
this model
Finetunes
1 model

Paper for ele-sage/whisper-large-v3-turbo-fr-quebecois