File size: 5,536 Bytes

cee6d09

---
datasets:
- Elyadata/Ara-Best-RQ_dataset
language:
- ar
library_name: speechbrain
tags:
- speech
- ssl
- arabic
- dialect
---

# Ara-BEST-RQ-600M-14k

**Ara-BEST-RQ-600M-14k** is a 600M-parameter self-supervised speech representation model for Arabic and Arabic dialects. It is part of the Ara-BEST-RQ family introduced in **[Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)**.

This model was pretrained on the **combined Ara-BEST-RQ dataset**: 13,723h 08m 43s of speech, combining the crawled Ara-BEST-RQ data with other publicly available datasets.

- **Paper:** [Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)
- **Dataset:** [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
- **Implementation:** [elyadata/AraBEST-RQ](https://github.com/elyadata/AraBEST-RQ)

## Model Details

### Model Description

Ara-BEST-RQ is a family of Arabic-focused self-supervised learning (SSL) speech models based on the BEST-RQ framework. The models are designed to learn speech representations that transfer well to Arabic speech processing tasks, including automatic speech recognition (ASR) and dialect identification (DID).

This checkpoint corresponds to the **600M** variant pretrained on the **combined 14k-hour dataset**.

- **Model type:** Self-supervised speech representation model
- **Architecture:** Conformer-based BEST-RQ encoder
- **Parameters:** ~600M (611.6M)
- **Training data:** combined Ara-BEST-RQ dataset
- **Languages:** Arabic, including multiple dialects
- **Primary use:** Speech representation learning / downstream fine-tuning

### Architecture

The 600M Ara-BEST-RQ model uses:

- 24 Conformer encoder layers
- Model dimension: 1024
- 8 attention heads
- Feed-forward dimension: 4096
- GELU activations
- Layer normalization before attention
- Relative position multi-head attention
- Convolutional front-end with two blocks
- Random projection quantizer with 4096 codebook entries of dimension 16


## Training Data

The model was pretrained on the combined Ara-BEST-RQ dataset: **13,723h 08m 43s** of speech data. The combined set includes the crawled Ara-BEST-RQ data together with other publicly available datasets described in the paper.

The released dataset on Hugging Face provides **metadata only**: YouTube video identifiers and audio segment boundaries. No audio or video files are distributed as part of the dataset.

Dataset link: [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)


## Pretraining

The paper reports the following pretraining losses after 300k updates for this model:

| Training set | Train loss | Validation loss |
|---|---:|---:|
| Combined | 3.57 | 3.40 |

## Evaluation

The paper evaluates Ara-BEST-RQ models on automatic speech recognition and dialect identification tasks. The following results are reported for the **Ara-BEST-RQ-600M-14k** model.

### Automatic Speech Recognition

WER scores on ASR benchmarks:

| Dataset | WER |
|---|---:|
| Common Voice 19.0 Arabic | 18.59 |
| MGB-3 | 28.78 |
| MGB-5 | 54.54 |
| TARIC-SLU | 21.14 |
| Average | 30.76 |

### Dialect Identification

Results on ADI-20:

| Split | Accuracy | Weighted F1 |
|---|---:|---:|
| Validation | 94.66 | 94.71 |
| Test | 92.05 | 92.07 |

## Usage

This is a self-supervised pretrained model intended to be used as a speech encoder or as an initialization checkpoint for downstream fine-tuning.

For training and fine-tuning recipes, please refer to the official implementation:

```bash
git clone https://github.com/elyadata/AraBEST-RQ
cd AraBEST-RQ
```

You can download the checkpoint from Hugging Face using:

```python
from huggingface_hub import snapshot_download

model_dir = snapshot_download("Elyadata/AraBEST-RQ-600M-14k")
print(model_dir)
```

Please refer to the repository configuration and SpeechBrain recipes for the correct model-loading interface.

### Fine-tuning with SpeechBrain

To fine-tune this pretrained Ara-BEST-RQ checkpoint in a SpeechBrain recipe, adapt the `pretrainer` section of your YAML configuration so that it loads both the pretrained model checkpoint and the corresponding normalizer.

Example:

```yaml
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    collect_in: !ref <save_folder>
    loadables:
        pt_model: !ref <pt_model>
        normalize: !ref <normalize>
    paths:
        pt_model: !ref <pt_model_path>/model.ckpt
        normalize: !ref <pt_model_path>/normalizer.ckpt
```

In your downstream recipe, make sure that:

- `<pt_model>` points to the Ara-BEST-RQ pretrained model object used in your training graph.
- `<normalize>` points to the normalization module used by the recipe.
- `<pt_model_path>` points to the local directory containing `model.ckpt` and `normalizer.ckpt`.
- `<save_folder>` is the experiment directory where SpeechBrain should collect and manage pretrained components.

This setup allows SpeechBrain to initialize the downstream model from the Ara-BEST-RQ SSL checkpoint before fine-tuning on task-specific data.


## Citation

If you use this model, please cite the Ara-BEST-RQ paper:

```bibtex
@misc{elleuch2026arabestrqmultidialectalarabic,
      title={Ara-Best-RQ: Multi Dialectal Arabic SSL}, 
      author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares},
      year={2026},
      eprint={2603.21900},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.21900}, 
}
```