| --- |
| datasets: |
| - Elyadata/Ara-Best-RQ_dataset |
| language: |
| - ar |
| library_name: speechbrain |
| tags: |
| - speech |
| - ssl |
| - arabic |
| - dialect |
| --- |
| |
| # Ara-BEST-RQ-600M-6k |
|
|
| **Ara-BEST-RQ-600M-6k** is a 600M-parameter self-supervised speech representation model for Arabic and Arabic dialects. It is part of the Ara-BEST-RQ family introduced in **[Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)**. |
|
|
| This model was pretrained on the **crawled Ara-BEST-RQ dataset**: 5,639h 04m 27s of Creative Commons Arabic speech collected from publicly available YouTube videos and segmented for self-supervised speech learning. |
|
|
| - **Paper:** [Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900) |
| - **Dataset:** [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset) |
| - **Implementation:** [elyadata/AraBEST-RQ](https://github.com/elyadata/AraBEST-RQ) |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| Ara-BEST-RQ is a family of Arabic-focused self-supervised learning (SSL) speech models based on the BEST-RQ framework. The models are designed to learn speech representations that transfer well to Arabic speech processing tasks, including automatic speech recognition (ASR) and dialect identification (DID). |
|
|
| This checkpoint corresponds to the **600M** variant pretrained on the **crawled 6k-hour dataset**. |
|
|
| - **Model type:** Self-supervised speech representation model |
| - **Architecture:** Conformer-based BEST-RQ encoder |
| - **Parameters:** ~600M (611.3M) |
| - **Training data:** crawled Ara-BEST-RQ dataset |
| - **Languages:** Arabic, including multiple dialects |
| - **Primary use:** Speech representation learning / downstream fine-tuning |
|
|
| ### Architecture |
|
|
| The 600M Ara-BEST-RQ model uses: |
|
|
| - 24 Conformer encoder layers |
| - Model dimension: 1024 |
| - 8 attention heads |
| - Feed-forward dimension: 4096 |
| - GELU activations |
| - Layer normalization before attention |
| - Relative position multi-head attention |
| - Convolutional front-end with two blocks |
| - Random projection quantizer with 4096 codebook entries of dimension 16 |
|
|
|
|
| ## Training Data |
|
|
| The model was pretrained on the crawled Ara-BEST-RQ dataset: **5,639h 04m 27s** of Creative Commons speech data. |
|
|
| The released dataset on Hugging Face provides **metadata only**: YouTube video identifiers and audio segment boundaries. No audio or video files are distributed as part of the dataset. |
|
|
| Dataset link: [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset) |
|
|
|
|
| ## Pretraining |
|
|
| The paper reports the following pretraining losses after 300k updates for this model: |
|
|
| | Training set | Train loss | Validation loss | |
| |---|---:|---:| |
| | Crawled | 3.53 | 3.70 | |
|
|
| ## Evaluation |
|
|
| The paper evaluates Ara-BEST-RQ models on automatic speech recognition and dialect identification tasks. The following results are reported for the **Ara-BEST-RQ-600M-6k** model. |
|
|
| ### Automatic Speech Recognition |
|
|
| WER scores on ASR benchmarks: |
|
|
| | Dataset | WER | |
| |---|---:| |
| | Common Voice 19.0 Arabic | 19.50 | |
| | MGB-3 | 30.83 | |
| | MGB-5 | 55.78 | |
| | TARIC-SLU | 22.41 | |
| | Average | 32.13 | |
|
|
| ### Dialect Identification |
|
|
| Results on ADI-20: |
|
|
| | Split | Accuracy | Weighted F1 | |
| |---|---:|---:| |
| | Validation | 92.86 | 92.87 | |
| | Test | 91.05 | 91.04 | |
|
|
| ## Usage |
|
|
| This is a self-supervised pretrained model intended to be used as a speech encoder or as an initialization checkpoint for downstream fine-tuning. |
|
|
| For training and fine-tuning recipes, please refer to the official implementation: |
|
|
| ```bash |
| git clone https://github.com/elyadata/AraBEST-RQ |
| cd AraBEST-RQ |
| ``` |
|
|
| You can download the checkpoint from Hugging Face using: |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| |
| model_dir = snapshot_download("Elyadata/AraBEST-RQ-600M-6k") |
| print(model_dir) |
| ``` |
|
|
| Please refer to the repository configuration and SpeechBrain recipes for the correct model-loading interface. |
|
|
| ### Fine-tuning with SpeechBrain |
|
|
| To fine-tune this pretrained Ara-BEST-RQ checkpoint in a SpeechBrain recipe, adapt the `pretrainer` section of your YAML configuration so that it loads both the pretrained model checkpoint and the corresponding normalizer. |
|
|
| Example: |
|
|
| ```yaml |
| pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer |
| collect_in: !ref <save_folder> |
| loadables: |
| pt_model: !ref <pt_model> |
| normalize: !ref <normalize> |
| paths: |
| pt_model: !ref <pt_model_path>/model.ckpt |
| normalize: !ref <pt_model_path>/normalizer.ckpt |
| ``` |
|
|
| In your downstream recipe, make sure that: |
|
|
| - `<pt_model>` points to the Ara-BEST-RQ pretrained model object used in your training graph. |
| - `<normalize>` points to the normalization module used by the recipe. |
| - `<pt_model_path>` points to the local directory containing `model.ckpt` and `normalizer.ckpt`. |
| - `<save_folder>` is the experiment directory where SpeechBrain should collect and manage pretrained components. |
|
|
| This setup allows SpeechBrain to initialize the downstream model from the Ara-BEST-RQ SSL checkpoint before fine-tuning on task-specific data. |
|
|
|
|
| ## Citation |
|
|
| If you use this model, please cite the Ara-BEST-RQ paper: |
|
|
| ```bibtex |
| @misc{elleuch2026arabestrqmultidialectalarabic, |
| title={Ara-Best-RQ: Multi Dialectal Arabic SSL}, |
| author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares}, |
| year={2026}, |
| eprint={2603.21900}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2603.21900}, |
| } |
| ``` |
|
|