--- datasets: - Elyadata/Ara-Best-RQ_dataset language: - ar library_name: speechbrain tags: - speech - ssl - arabic - dialect --- # Ara-BEST-RQ-600M-14k **Ara-BEST-RQ-600M-14k** is a 600M-parameter self-supervised speech representation model for Arabic and Arabic dialects. It is part of the Ara-BEST-RQ family introduced in **[Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)**. This model was pretrained on the **combined Ara-BEST-RQ dataset**: 13,723h 08m 43s of speech, combining the crawled Ara-BEST-RQ data with other publicly available datasets. - **Paper:** [Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900) - **Dataset:** [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset) - **Implementation:** [elyadata/AraBEST-RQ](https://github.com/elyadata/AraBEST-RQ) ## Model Details ### Model Description Ara-BEST-RQ is a family of Arabic-focused self-supervised learning (SSL) speech models based on the BEST-RQ framework. The models are designed to learn speech representations that transfer well to Arabic speech processing tasks, including automatic speech recognition (ASR) and dialect identification (DID). This checkpoint corresponds to the **600M** variant pretrained on the **combined 14k-hour dataset**. - **Model type:** Self-supervised speech representation model - **Architecture:** Conformer-based BEST-RQ encoder - **Parameters:** ~600M (611.6M) - **Training data:** combined Ara-BEST-RQ dataset - **Languages:** Arabic, including multiple dialects - **Primary use:** Speech representation learning / downstream fine-tuning ### Architecture The 600M Ara-BEST-RQ model uses: - 24 Conformer encoder layers - Model dimension: 1024 - 8 attention heads - Feed-forward dimension: 4096 - GELU activations - Layer normalization before attention - Relative position multi-head attention - Convolutional front-end with two blocks - Random projection quantizer with 4096 codebook entries of dimension 16 ## Training Data The model was pretrained on the combined Ara-BEST-RQ dataset: **13,723h 08m 43s** of speech data. The combined set includes the crawled Ara-BEST-RQ data together with other publicly available datasets described in the paper. The released dataset on Hugging Face provides **metadata only**: YouTube video identifiers and audio segment boundaries. No audio or video files are distributed as part of the dataset. Dataset link: [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset) ## Pretraining The paper reports the following pretraining losses after 300k updates for this model: | Training set | Train loss | Validation loss | |---|---:|---:| | Combined | 3.57 | 3.40 | ## Evaluation The paper evaluates Ara-BEST-RQ models on automatic speech recognition and dialect identification tasks. The following results are reported for the **Ara-BEST-RQ-600M-14k** model. ### Automatic Speech Recognition WER scores on ASR benchmarks: | Dataset | WER | |---|---:| | Common Voice 19.0 Arabic | 18.59 | | MGB-3 | 28.78 | | MGB-5 | 54.54 | | TARIC-SLU | 21.14 | | Average | 30.76 | ### Dialect Identification Results on ADI-20: | Split | Accuracy | Weighted F1 | |---|---:|---:| | Validation | 94.66 | 94.71 | | Test | 92.05 | 92.07 | ## Usage This is a self-supervised pretrained model intended to be used as a speech encoder or as an initialization checkpoint for downstream fine-tuning. For training and fine-tuning recipes, please refer to the official implementation: ```bash git clone https://github.com/elyadata/AraBEST-RQ cd AraBEST-RQ ``` You can download the checkpoint from Hugging Face using: ```python from huggingface_hub import snapshot_download model_dir = snapshot_download("Elyadata/AraBEST-RQ-600M-14k") print(model_dir) ``` Please refer to the repository configuration and SpeechBrain recipes for the correct model-loading interface. ### Fine-tuning with SpeechBrain To fine-tune this pretrained Ara-BEST-RQ checkpoint in a SpeechBrain recipe, adapt the `pretrainer` section of your YAML configuration so that it loads both the pretrained model checkpoint and the corresponding normalizer. Example: ```yaml pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer collect_in: !ref loadables: pt_model: !ref normalize: !ref paths: pt_model: !ref /model.ckpt normalize: !ref /normalizer.ckpt ``` In your downstream recipe, make sure that: - `` points to the Ara-BEST-RQ pretrained model object used in your training graph. - `` points to the normalization module used by the recipe. - `` points to the local directory containing `model.ckpt` and `normalizer.ckpt`. - `` is the experiment directory where SpeechBrain should collect and manage pretrained components. This setup allows SpeechBrain to initialize the downstream model from the Ara-BEST-RQ SSL checkpoint before fine-tuning on task-specific data. ## Citation If you use this model, please cite the Ara-BEST-RQ paper: ```bibtex @misc{elleuch2026arabestrqmultidialectalarabic, title={Ara-Best-RQ: Multi Dialectal Arabic SSL}, author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares}, year={2026}, eprint={2603.21900}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.21900}, } ```