Røst-v3-wav2vec2-315m

This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by the Alexandra Institute.

This repository contains a Wav2vec2-XLSR-300M model trained on the CoRal-v3 dataset. The CoRal-v3 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).

Quick Start

Start by installing the required libraries:

$ pip install transformers

Next you can use the model using the transformers Python package as follows:

>>> from transformers import pipeline
>>> audio = get_audio()  # 16kHz raw audio array
>>> transcriber = pipeline(model="CoRal-project/roest-v3-wav2vec2-315m")
>>> transcriber(audio)
{'text': 'your transcription'}

Model Details

Wav2vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained Wav2vec2-XLS-R has been fine-tuned for automatic speech recognition with the CoRal-v3 dataset dataset to enhance its performance in recognising Danish speech with consideration to different dialects. The model was trained using the CoRal model training framework by running:

uv run accelerate launch \
  --use-deepspeed \
  --zero-stage 2 \
  src/scripts/finetune_asr_model.py \
  model=wav2vec2-small \
  per_device_batch_size=64 \
  max_steps=100000

Note that the dataset, and thus also this model, is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with few restrictions (speech synthesis and biometric identification) - see license.

Evaluation

The model was evaluated using Character Error Rate (CER), which is the percentage of characters incorrectly transcribed.

Conversational CoRal Performance

Model	Number of parameters	Finetuned on data of type	CoRal-v3::conversation CER
CoRal-project/roest-whisper-1.5b-v2	1540M	Read-aloud and conversation	11.6%
CoRal-project/roest-wav2vec2-315m-v3 (this model)	315M	Read-aloud and conversation	13.7%
CoRal-project/roest-wav2vec2-315m-v2	315M	Read-aloud and conversation	24.2%
CoRal-project/roest-wav2vec2-315m-v1	315M	Read-aloud	17.6%
CoRal-project/roest-whisper-1.5b-v1	1540M	Read-aloud	35.6%
syvai/hviske-v3-conversation	1540M	Read-aloud and conversation	15.1%
syvai/hviske-v2	1540M	Read-aloud	29.4%
openai/whisper-large-v3	1540M	-	27.5%

Read-aloud CoRal Performance

Model	Number of parameters	Finetuned on data of type	CoRal-v3::read_aloud CER
CoRal-project/roest-whisper-1.5b-v2	1540M	Read-aloud and conversation	4.5%
CoRal-project/roest-wav2vec2-315m-v3 (this model)	315M	Read-aloud and conversation	5.9%
CoRal-project/roest-wav2vec2-315m-v2	315M	Read-aloud and conversation	6.4%
CoRal-project/roest-wav2vec2-315m-v1	315M	Read-aloud	8.2%
CoRal-project/roest-whisper-1.5b-v1	1540M	Read-aloud	4.0%
syvai/hviske-v3-conversation	1540M	Read-aloud and conversation	4.5%
syvai/hviske-v2	1540M	Read-aloud	4.0%
openai/whisper-large-v3	1540M	-	10.1%