Elyadata
/

AraBEST-RQ-600M-6k

+---
+datasets:
+- Elyadata/Ara-Best-RQ_dataset
+language:
+- ar
+library_name: speechbrain
+tags:
+- speech
+- ssl
+- arabic
+- dialect
+---
+# Ara-BEST-RQ-600M-6k
+**Ara-BEST-RQ-600M-6k** is a 600M-parameter self-supervised speech representation model for Arabic and Arabic dialects. It is part of the Ara-BEST-RQ family introduced in **[Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)**.
+This model was pretrained on the **crawled Ara-BEST-RQ dataset**: 5,639h 04m 27s of Creative Commons Arabic speech collected from publicly available YouTube videos and segmented for self-supervised speech learning.
+- **Paper:** [Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)
+- **Dataset:** [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
+- **Implementation:** [elyadata/AraBEST-RQ](https://github.com/elyadata/AraBEST-RQ)
+## Model Details
+### Model Description
+Ara-BEST-RQ is a family of Arabic-focused self-supervised learning (SSL) speech models based on the BEST-RQ framework. The models are designed to learn speech representations that transfer well to Arabic speech processing tasks, including automatic speech recognition (ASR) and dialect identification (DID).
+This checkpoint corresponds to the **600M** variant pretrained on the **crawled 6k-hour dataset**.
+- **Model type:** Self-supervised speech representation model
+- **Architecture:** Conformer-based BEST-RQ encoder
+- **Parameters:** ~600M (611.3M)
+- **Training data:** crawled Ara-BEST-RQ dataset
+- **Languages:** Arabic, including multiple dialects
+- **Primary use:** Speech representation learning / downstream fine-tuning
+### Architecture
+The 600M Ara-BEST-RQ model uses:
+- 24 Conformer encoder layers
+- Model dimension: 1024
+- 8 attention heads
+- Feed-forward dimension: 4096
+- GELU activations
+- Layer normalization before attention
+- Relative position multi-head attention
+- Convolutional front-end with two blocks
+- Random projection quantizer with 4096 codebook entries of dimension 16
+## Training Data
+The model was pretrained on the crawled Ara-BEST-RQ dataset: **5,639h 04m 27s** of Creative Commons speech data.
+The released dataset on Hugging Face provides **metadata only**: YouTube video identifiers and audio segment boundaries. No audio or video files are distributed as part of the dataset.
+Dataset link: [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
+## Pretraining
+The paper reports the following pretraining losses after 300k updates for this model:
+| Training set | Train loss | Validation loss |
+|---|---:|---:|
+| Crawled | 3.53 | 3.70 |
+## Evaluation
+The paper evaluates Ara-BEST-RQ models on automatic speech recognition and dialect identification tasks. The following results are reported for the **Ara-BEST-RQ-600M-6k** model.
+### Automatic Speech Recognition
+WER scores on ASR benchmarks:
+| Dataset | WER |
+|---|---:|
+| Common Voice 19.0 Arabic | 19.50 |
+| MGB-3 | 30.83 |
+| MGB-5 | 55.78 |
+| TARIC-SLU | 22.41 |
+| Average | 32.13 |
+### Dialect Identification
+Results on ADI-20:
+| Split | Accuracy | Weighted F1 |
+|---|---:|---:|
+| Validation | 92.86 | 92.87 |
+| Test | 91.05 | 91.04 |
+## Usage
+This is a self-supervised pretrained model intended to be used as a speech encoder or as an initialization checkpoint for downstream fine-tuning.
+For training and fine-tuning recipes, please refer to the official implementation:
+```bash
+git clone https://github.com/elyadata/AraBEST-RQ
+cd AraBEST-RQ
+```
+You can download the checkpoint from Hugging Face using:
+```python
+from huggingface_hub import snapshot_download
+model_dir = snapshot_download("Elyadata/AraBEST-RQ-600M-6k")
+print(model_dir)
+```
+Please refer to the repository configuration and SpeechBrain recipes for the correct model-loading interface.
+### Fine-tuning with SpeechBrain
+To fine-tune this pretrained Ara-BEST-RQ checkpoint in a SpeechBrain recipe, adapt the `pretrainer` section of your YAML configuration so that it loads both the pretrained model checkpoint and the corresponding normalizer.
+Example:
+```yaml
+pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
+    collect_in: !ref <save_folder>
+    loadables:
+        pt_model: !ref <pt_model>
+        normalize: !ref <normalize>
+    paths:
+        pt_model: !ref <pt_model_path>/model.ckpt
+        normalize: !ref <pt_model_path>/normalizer.ckpt
+```
+In your downstream recipe, make sure that:
+- `<pt_model>` points to the Ara-BEST-RQ pretrained model object used in your training graph.
+- `<normalize>` points to the normalization module used by the recipe.
+- `<pt_model_path>` points to the local directory containing `model.ckpt` and `normalizer.ckpt`.
+- `<save_folder>` is the experiment directory where SpeechBrain should collect and manage pretrained components.
+This setup allows SpeechBrain to initialize the downstream model from the Ara-BEST-RQ SSL checkpoint before fine-tuning on task-specific data.
+## Citation
+If you use this model, please cite the Ara-BEST-RQ paper:
+```bibtex
+@misc{elleuch2026arabestrqmultidialectalarabic,
+      title={Ara-Best-RQ: Multi Dialectal Arabic SSL},
+      author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares},
+      year={2026},
+      eprint={2603.21900},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2603.21900},
+}
+```