Røst-v3-wav2vec2-315m
This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by the Alexandra Institute.
This repository contains a Wav2vec2-XLSR-300M model trained on the CoRal-v3 dataset. The CoRal-v3 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
Quick Start
Start by installing the required libraries:
$ pip install transformers
Next you can use the model using the transformers Python package as follows:
>>> from transformers import pipeline
>>> audio = get_audio() # 16kHz raw audio array
>>> transcriber = pipeline(model="CoRal-project/roest-v3-wav2vec2-315m")
>>> transcriber(audio)
{'text': 'your transcription'}
Model Details
Wav2vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained Wav2vec2-XLS-R has been fine-tuned for automatic speech recognition with the CoRal-v3 dataset dataset to enhance its performance in recognising Danish speech with consideration to different dialects. The model was trained using the CoRal model training framework by running:
uv run accelerate launch \
--use-deepspeed \
--zero-stage 2 \
src/scripts/finetune_asr_model.py \
model=wav2vec2-small \
per_device_batch_size=64 \
max_steps=100000
Note that the dataset, and thus also this model, is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with few restrictions (speech synthesis and biometric identification) - see license.
Evaluation
The model was evaluated using Character Error Rate (CER), which is the percentage of characters incorrectly transcribed.
Conversational CoRal Performance
| Model | Number of parameters | Finetuned on data of type | CoRal-v3::conversation CER |
|---|---|---|---|
| CoRal-project/roest-whisper-1.5b-v2 | 1540M | Read-aloud and conversation | 11.6% |
| CoRal-project/roest-wav2vec2-315m-v3 (this model) | 315M | Read-aloud and conversation | 13.7% |
| CoRal-project/roest-wav2vec2-315m-v2 | 315M | Read-aloud and conversation | 24.2% |
| CoRal-project/roest-wav2vec2-315m-v1 | 315M | Read-aloud | 17.6% |
| CoRal-project/roest-whisper-1.5b-v1 | 1540M | Read-aloud | 35.6% |
| syvai/hviske-v3-conversation | 1540M | Read-aloud and conversation | 15.1% |
| syvai/hviske-v2 | 1540M | Read-aloud | 29.4% |
| openai/whisper-large-v3 | 1540M | - | 27.5% |
Read-aloud CoRal Performance
| Model | Number of parameters | Finetuned on data of type | CoRal-v3::read_aloud CER |
|---|---|---|---|
| CoRal-project/roest-whisper-1.5b-v2 | 1540M | Read-aloud and conversation | 4.5% |
| CoRal-project/roest-wav2vec2-315m-v3 (this model) | 315M | Read-aloud and conversation | 5.9% |
| CoRal-project/roest-wav2vec2-315m-v2 | 315M | Read-aloud and conversation | 6.4% |
| CoRal-project/roest-wav2vec2-315m-v1 | 315M | Read-aloud | 8.2% |
| CoRal-project/roest-whisper-1.5b-v1 | 1540M | Read-aloud | 4.0% |
| syvai/hviske-v3-conversation | 1540M | Read-aloud and conversation | 4.5% |
| syvai/hviske-v2 | 1540M | Read-aloud | 4.0% |
| openai/whisper-large-v3 | 1540M | - | 10.1% |
Creators and Funders
This model has been trained and the model card written by Dan Saattrup Smart at the Alexandra Institute.
The CoRal project is funded by the Danish Innovation Fund and consists of the following partners:
- Downloads last month
- 100
Model tree for CoRal-project/roest-v3-wav2vec2-315m
Base model
facebook/wav2vec2-xls-r-300m