Røst-v3-wav2vec2-315m

This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by the Alexandra Institute.

This repository contains a Wav2vec2-XLSR-300M model trained on the CoRal-v3 dataset. The CoRal-v3 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).

Quick Start

Start by installing the required libraries:

$ pip install transformers

Next you can use the model using the transformers Python package as follows:

>>> from transformers import pipeline
>>> audio = get_audio()  # 16kHz raw audio array
>>> transcriber = pipeline(model="CoRal-project/roest-v3-wav2vec2-315m")
>>> transcriber(audio)
{'text': 'your transcription'}

Model Details

Wav2vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained Wav2vec2-XLS-R has been fine-tuned for automatic speech recognition with the CoRal-v3 dataset dataset to enhance its performance in recognising Danish speech with consideration to different dialects. The model was trained using the CoRal model training framework by running:

uv run accelerate launch \
  --use-deepspeed \
  --zero-stage 2 \
  src/scripts/finetune_asr_model.py \
  model=wav2vec2-small \
  per_device_batch_size=64 \
  max_steps=100000

Note that the dataset, and thus also this model, is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with few restrictions (speech synthesis and biometric identification) - see license.


Evaluation

The model was evaluated using Character Error Rate (CER), which is the percentage of characters incorrectly transcribed.

Conversational CoRal Performance

Model Number of parameters Finetuned on data of type CoRal-v3::conversation CER
CoRal-project/roest-whisper-1.5b-v2 1540M Read-aloud and conversation 11.6%
CoRal-project/roest-wav2vec2-315m-v3 (this model) 315M Read-aloud and conversation 13.7%
CoRal-project/roest-wav2vec2-315m-v2 315M Read-aloud and conversation 24.2%
CoRal-project/roest-wav2vec2-315m-v1 315M Read-aloud 17.6%
CoRal-project/roest-whisper-1.5b-v1 1540M Read-aloud 35.6%
syvai/hviske-v3-conversation 1540M Read-aloud and conversation 15.1%
syvai/hviske-v2 1540M Read-aloud 29.4%
openai/whisper-large-v3 1540M - 27.5%

Read-aloud CoRal Performance

Model Number of parameters Finetuned on data of type CoRal-v3::read_aloud CER
CoRal-project/roest-whisper-1.5b-v2 1540M Read-aloud and conversation 4.5%
CoRal-project/roest-wav2vec2-315m-v3 (this model) 315M Read-aloud and conversation 5.9%
CoRal-project/roest-wav2vec2-315m-v2 315M Read-aloud and conversation 6.4%
CoRal-project/roest-wav2vec2-315m-v1 315M Read-aloud 8.2%
CoRal-project/roest-whisper-1.5b-v1 1540M Read-aloud 4.0%
syvai/hviske-v3-conversation 1540M Read-aloud and conversation 4.5%
syvai/hviske-v2 1540M Read-aloud 4.0%
openai/whisper-large-v3 1540M - 10.1%

Creators and Funders

This model has been trained and the model card written by Dan Saattrup Smart at the Alexandra Institute.

The CoRal project is funded by the Danish Innovation Fund and consists of the following partners:

Downloads last month
100
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CoRal-project/roest-v3-wav2vec2-315m

Finetuned
(831)
this model

Dataset used to train CoRal-project/roest-v3-wav2vec2-315m

Collection including CoRal-project/roest-v3-wav2vec2-315m