File size: 5,536 Bytes
cee6d09 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | ---
datasets:
- Elyadata/Ara-Best-RQ_dataset
language:
- ar
library_name: speechbrain
tags:
- speech
- ssl
- arabic
- dialect
---
# Ara-BEST-RQ-600M-14k
**Ara-BEST-RQ-600M-14k** is a 600M-parameter self-supervised speech representation model for Arabic and Arabic dialects. It is part of the Ara-BEST-RQ family introduced in **[Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)**.
This model was pretrained on the **combined Ara-BEST-RQ dataset**: 13,723h 08m 43s of speech, combining the crawled Ara-BEST-RQ data with other publicly available datasets.
- **Paper:** [Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)
- **Dataset:** [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
- **Implementation:** [elyadata/AraBEST-RQ](https://github.com/elyadata/AraBEST-RQ)
## Model Details
### Model Description
Ara-BEST-RQ is a family of Arabic-focused self-supervised learning (SSL) speech models based on the BEST-RQ framework. The models are designed to learn speech representations that transfer well to Arabic speech processing tasks, including automatic speech recognition (ASR) and dialect identification (DID).
This checkpoint corresponds to the **600M** variant pretrained on the **combined 14k-hour dataset**.
- **Model type:** Self-supervised speech representation model
- **Architecture:** Conformer-based BEST-RQ encoder
- **Parameters:** ~600M (611.6M)
- **Training data:** combined Ara-BEST-RQ dataset
- **Languages:** Arabic, including multiple dialects
- **Primary use:** Speech representation learning / downstream fine-tuning
### Architecture
The 600M Ara-BEST-RQ model uses:
- 24 Conformer encoder layers
- Model dimension: 1024
- 8 attention heads
- Feed-forward dimension: 4096
- GELU activations
- Layer normalization before attention
- Relative position multi-head attention
- Convolutional front-end with two blocks
- Random projection quantizer with 4096 codebook entries of dimension 16
## Training Data
The model was pretrained on the combined Ara-BEST-RQ dataset: **13,723h 08m 43s** of speech data. The combined set includes the crawled Ara-BEST-RQ data together with other publicly available datasets described in the paper.
The released dataset on Hugging Face provides **metadata only**: YouTube video identifiers and audio segment boundaries. No audio or video files are distributed as part of the dataset.
Dataset link: [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
## Pretraining
The paper reports the following pretraining losses after 300k updates for this model:
| Training set | Train loss | Validation loss |
|---|---:|---:|
| Combined | 3.57 | 3.40 |
## Evaluation
The paper evaluates Ara-BEST-RQ models on automatic speech recognition and dialect identification tasks. The following results are reported for the **Ara-BEST-RQ-600M-14k** model.
### Automatic Speech Recognition
WER scores on ASR benchmarks:
| Dataset | WER |
|---|---:|
| Common Voice 19.0 Arabic | 18.59 |
| MGB-3 | 28.78 |
| MGB-5 | 54.54 |
| TARIC-SLU | 21.14 |
| Average | 30.76 |
### Dialect Identification
Results on ADI-20:
| Split | Accuracy | Weighted F1 |
|---|---:|---:|
| Validation | 94.66 | 94.71 |
| Test | 92.05 | 92.07 |
## Usage
This is a self-supervised pretrained model intended to be used as a speech encoder or as an initialization checkpoint for downstream fine-tuning.
For training and fine-tuning recipes, please refer to the official implementation:
```bash
git clone https://github.com/elyadata/AraBEST-RQ
cd AraBEST-RQ
```
You can download the checkpoint from Hugging Face using:
```python
from huggingface_hub import snapshot_download
model_dir = snapshot_download("Elyadata/AraBEST-RQ-600M-14k")
print(model_dir)
```
Please refer to the repository configuration and SpeechBrain recipes for the correct model-loading interface.
### Fine-tuning with SpeechBrain
To fine-tune this pretrained Ara-BEST-RQ checkpoint in a SpeechBrain recipe, adapt the `pretrainer` section of your YAML configuration so that it loads both the pretrained model checkpoint and the corresponding normalizer.
Example:
```yaml
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
collect_in: !ref <save_folder>
loadables:
pt_model: !ref <pt_model>
normalize: !ref <normalize>
paths:
pt_model: !ref <pt_model_path>/model.ckpt
normalize: !ref <pt_model_path>/normalizer.ckpt
```
In your downstream recipe, make sure that:
- `<pt_model>` points to the Ara-BEST-RQ pretrained model object used in your training graph.
- `<normalize>` points to the normalization module used by the recipe.
- `<pt_model_path>` points to the local directory containing `model.ckpt` and `normalizer.ckpt`.
- `<save_folder>` is the experiment directory where SpeechBrain should collect and manage pretrained components.
This setup allows SpeechBrain to initialize the downstream model from the Ara-BEST-RQ SSL checkpoint before fine-tuning on task-specific data.
## Citation
If you use this model, please cite the Ara-BEST-RQ paper:
```bibtex
@misc{elleuch2026arabestrqmultidialectalarabic,
title={Ara-Best-RQ: Multi Dialectal Arabic SSL},
author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares},
year={2026},
eprint={2603.21900},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.21900},
}
```
|