| ---
|
| datasets:
|
| - Elyadata/Ara-Best-RQ_dataset
|
| language:
|
| - ar
|
| library_name: speechbrain
|
| tags:
|
| - speech
|
| - ssl
|
| - arabic
|
| - dialect
|
| ---
|
|
|
| # Ara-BEST-RQ-300M-6k
|
|
|
| **Ara-BEST-RQ-300M-6k** is a 300M-parameter self-supervised speech representation model for Arabic and Arabic dialects. It is part of the Ara-BEST-RQ family introduced in **[Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)**.
|
|
|
| This model was pretrained on the **crawled Ara-BEST-RQ dataset**: approximately **5.6k hours** of Creative Commons Arabic speech collected from publicly available YouTube videos and segmented for self-supervised speech learning.
|
|
|
| - **Paper:** [Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)
|
| - **Dataset:** [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
|
| - **Implementation:** [elyadata/AraBEST-RQ](https://github.com/elyadata/AraBEST-RQ)
|
|
|
| ## Model Details
|
|
|
| ### Model Description
|
|
|
| Ara-BEST-RQ is a family of Arabic-focused self-supervised learning (SSL) speech models based on the BEST-RQ framework. The models are designed to learn speech representations that transfer well to Arabic speech processing tasks, including automatic speech recognition (ASR) and dialect identification (DID).
|
|
|
| This checkpoint corresponds to the **300M** variant pretrained on the **crawled 6k-hour dataset**.
|
|
|
| - **Model type:** Self-supervised speech representation model
|
| - **Architecture:** Conformer-based BEST-RQ encoder
|
| - **Parameters:** ~300M
|
| - **Training data:** Crawled Arabic speech data
|
| - **Training hours:** ~5,640 hours
|
| - **Languages:** Arabic, including multiple dialects
|
| - **Primary use:** Speech representation learning / downstream fine-tuning
|
|
|
| ### Architecture
|
|
|
| The 300M Ara-BEST-RQ model uses:
|
|
|
| - 24 Conformer encoder layers
|
| - Model dimension: 848
|
| - 8 attention heads
|
| - Feed-forward dimension: 2048
|
| - GELU activations
|
| - Relative position multi-head attention
|
| - Convolutional front-end
|
| - Random projection quantizer with 4096 codebook entries
|
|
|
|
|
| ## Training Data
|
|
|
| The model was pretrained on the crawled Ara-BEST-RQ dataset.
|
|
|
| The released dataset on Hugging Face provides **metadata only**: YouTube video identifiers and audio segment boundaries. No audio or video files are distributed as part of the dataset.
|
|
|
| Dataset link: [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
|
|
|
| ## Evaluation
|
|
|
| The paper evaluates Ara-BEST-RQ models on automatic speech recognition and dialect identification tasks. The following results are reported for the **Ara-BEST-RQ crawled 300M** model.
|
|
|
| ### Automatic Speech Recognition
|
|
|
| WER scores on ASR benchmarks:
|
|
|
| | Dataset | WER |
|
| |---|---:|
|
| | Common Voice 19.0 Arabic | 18.67 |
|
| | MGB-3 | 30.85 |
|
| | MGB-5 | 54.18 |
|
| | TARIC-SLU | 23.98 |
|
| | Average | 31.92 |
|
|
|
| ### Dialect Identification
|
|
|
| Results on ADI-20:
|
|
|
| | Split | Accuracy | Weighted F1 |
|
| |---|---:|---:|
|
| | Validation | 97.21 | 97.17 |
|
| | Test | 96.02 | 95.98 |
|
|
|
| ## Usage
|
|
|
| This is a self-supervised pretrained model intended to be used as a speech encoder or as an initialization checkpoint for downstream fine-tuning.
|
|
|
| For training and fine-tuning recipes, please refer to the official implementation:
|
|
|
| ```bash
|
| git clone https://github.com/elyadata/AraBEST-RQ
|
| cd AraBEST-RQ
|
| ```
|
|
|
| You can download the checkpoint from Hugging Face using:
|
|
|
| ```python
|
| from huggingface_hub import snapshot_download
|
|
|
| model_dir = snapshot_download("Elyadata/AraBEST-RQ-300M-6k")
|
| print(model_dir)
|
| ```
|
|
|
| Please refer to the repository configuration and SpeechBrain recipes for the correct model-loading interface.
|
|
|
| ### Fine-tuning with SpeechBrain
|
|
|
| To fine-tune this pretrained Ara-BEST-RQ checkpoint in a SpeechBrain recipe, adapt the `pretrainer` section of your YAML configuration so that it loads both the pretrained model checkpoint and the corresponding normalizer.
|
|
|
| Example:
|
|
|
| ```yaml
|
| pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
|
| collect_in: !ref <save_folder>
|
| loadables:
|
| pt_model: !ref <pt_model>
|
| normalize: !ref <normalize>
|
| paths:
|
| pt_model: !ref <pt_model_path>/model.ckpt
|
| normalize: !ref <pt_model_path>/normalizer.ckpt
|
| ```
|
|
|
| In your downstream recipe, make sure that:
|
|
|
| - `<pt_model>` points to the Ara-BEST-RQ pretrained model object used in your training graph.
|
| - `<normalize>` points to the normalization module used by the recipe.
|
| - `<pt_model_path>` points to the local directory containing `model.ckpt` and `normalizer.ckpt`.
|
| - `<save_folder>` is the experiment directory where SpeechBrain should collect and manage pretrained components.
|
|
|
| This setup allows SpeechBrain to initialize the downstream model from the Ara-BEST-RQ SSL checkpoint before fine-tuning on task-specific data.
|
|
|
|
|
| ## Citation
|
|
|
| If you use this model, please cite the Ara-BEST-RQ paper:
|
|
|
| ```bibtex
|
| @misc{elleuch2026arabestrqmultidialectalarabic,
|
| title={Ara-Best-RQ: Multi Dialectal Arabic SSL},
|
| author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares},
|
| year={2026},
|
| eprint={2603.21900},
|
| archivePrefix={arXiv},
|
| primaryClass={cs.CL},
|
| url={https://arxiv.org/abs/2603.21900},
|
| }
|
| ```
|
|
|