speechbrain
Arabic
speech
ssl
arabic
dialect
AraBEST-RQ-600M-14k / README.md
HarounElleuch's picture
Create README.md
cee6d09 verified
---
datasets:
- Elyadata/Ara-Best-RQ_dataset
language:
- ar
library_name: speechbrain
tags:
- speech
- ssl
- arabic
- dialect
---
# Ara-BEST-RQ-600M-14k
**Ara-BEST-RQ-600M-14k** is a 600M-parameter self-supervised speech representation model for Arabic and Arabic dialects. It is part of the Ara-BEST-RQ family introduced in **[Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)**.
This model was pretrained on the **combined Ara-BEST-RQ dataset**: 13,723h 08m 43s of speech, combining the crawled Ara-BEST-RQ data with other publicly available datasets.
- **Paper:** [Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)
- **Dataset:** [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
- **Implementation:** [elyadata/AraBEST-RQ](https://github.com/elyadata/AraBEST-RQ)
## Model Details
### Model Description
Ara-BEST-RQ is a family of Arabic-focused self-supervised learning (SSL) speech models based on the BEST-RQ framework. The models are designed to learn speech representations that transfer well to Arabic speech processing tasks, including automatic speech recognition (ASR) and dialect identification (DID).
This checkpoint corresponds to the **600M** variant pretrained on the **combined 14k-hour dataset**.
- **Model type:** Self-supervised speech representation model
- **Architecture:** Conformer-based BEST-RQ encoder
- **Parameters:** ~600M (611.6M)
- **Training data:** combined Ara-BEST-RQ dataset
- **Languages:** Arabic, including multiple dialects
- **Primary use:** Speech representation learning / downstream fine-tuning
### Architecture
The 600M Ara-BEST-RQ model uses:
- 24 Conformer encoder layers
- Model dimension: 1024
- 8 attention heads
- Feed-forward dimension: 4096
- GELU activations
- Layer normalization before attention
- Relative position multi-head attention
- Convolutional front-end with two blocks
- Random projection quantizer with 4096 codebook entries of dimension 16
## Training Data
The model was pretrained on the combined Ara-BEST-RQ dataset: **13,723h 08m 43s** of speech data. The combined set includes the crawled Ara-BEST-RQ data together with other publicly available datasets described in the paper.
The released dataset on Hugging Face provides **metadata only**: YouTube video identifiers and audio segment boundaries. No audio or video files are distributed as part of the dataset.
Dataset link: [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
## Pretraining
The paper reports the following pretraining losses after 300k updates for this model:
| Training set | Train loss | Validation loss |
|---|---:|---:|
| Combined | 3.57 | 3.40 |
## Evaluation
The paper evaluates Ara-BEST-RQ models on automatic speech recognition and dialect identification tasks. The following results are reported for the **Ara-BEST-RQ-600M-14k** model.
### Automatic Speech Recognition
WER scores on ASR benchmarks:
| Dataset | WER |
|---|---:|
| Common Voice 19.0 Arabic | 18.59 |
| MGB-3 | 28.78 |
| MGB-5 | 54.54 |
| TARIC-SLU | 21.14 |
| Average | 30.76 |
### Dialect Identification
Results on ADI-20:
| Split | Accuracy | Weighted F1 |
|---|---:|---:|
| Validation | 94.66 | 94.71 |
| Test | 92.05 | 92.07 |
## Usage
This is a self-supervised pretrained model intended to be used as a speech encoder or as an initialization checkpoint for downstream fine-tuning.
For training and fine-tuning recipes, please refer to the official implementation:
```bash
git clone https://github.com/elyadata/AraBEST-RQ
cd AraBEST-RQ
```
You can download the checkpoint from Hugging Face using:
```python
from huggingface_hub import snapshot_download
model_dir = snapshot_download("Elyadata/AraBEST-RQ-600M-14k")
print(model_dir)
```
Please refer to the repository configuration and SpeechBrain recipes for the correct model-loading interface.
### Fine-tuning with SpeechBrain
To fine-tune this pretrained Ara-BEST-RQ checkpoint in a SpeechBrain recipe, adapt the `pretrainer` section of your YAML configuration so that it loads both the pretrained model checkpoint and the corresponding normalizer.
Example:
```yaml
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
collect_in: !ref <save_folder>
loadables:
pt_model: !ref <pt_model>
normalize: !ref <normalize>
paths:
pt_model: !ref <pt_model_path>/model.ckpt
normalize: !ref <pt_model_path>/normalizer.ckpt
```
In your downstream recipe, make sure that:
- `<pt_model>` points to the Ara-BEST-RQ pretrained model object used in your training graph.
- `<normalize>` points to the normalization module used by the recipe.
- `<pt_model_path>` points to the local directory containing `model.ckpt` and `normalizer.ckpt`.
- `<save_folder>` is the experiment directory where SpeechBrain should collect and manage pretrained components.
This setup allows SpeechBrain to initialize the downstream model from the Ara-BEST-RQ SSL checkpoint before fine-tuning on task-specific data.
## Citation
If you use this model, please cite the Ara-BEST-RQ paper:
```bibtex
@misc{elleuch2026arabestrqmultidialectalarabic,
title={Ara-Best-RQ: Multi Dialectal Arabic SSL},
author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares},
year={2026},
eprint={2603.21900},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.21900},
}
```