Upload folder using huggingface_hub

dff453e verified about 20 hours ago

5.33 kB

	---
	datasets:
	- Elyadata/Ara-Best-RQ_dataset
	language:
	- ar
	library_name: speechbrain
	tags:
	- speech
	- ssl
	- arabic
	- dialect
	---

	# Ara-BEST-RQ-300M-6k

	Ara-BEST-RQ-300M-6k is a 300M-parameter self-supervised speech representation model for Arabic and Arabic dialects. It is part of the Ara-BEST-RQ family introduced in [Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900).

	This model was pretrained on the crawled Ara-BEST-RQ dataset: approximately 5.6k hours of Creative Commons Arabic speech collected from publicly available YouTube videos and segmented for self-supervised speech learning.

	- Paper: [Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)
	- Dataset: [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
	- Implementation: [elyadata/AraBEST-RQ](https://github.com/elyadata/AraBEST-RQ)

	## Model Details

	### Model Description

	Ara-BEST-RQ is a family of Arabic-focused self-supervised learning (SSL) speech models based on the BEST-RQ framework. The models are designed to learn speech representations that transfer well to Arabic speech processing tasks, including automatic speech recognition (ASR) and dialect identification (DID).

	This checkpoint corresponds to the 300M variant pretrained on the crawled 6k-hour dataset.

	- Model type: Self-supervised speech representation model
	- Architecture: Conformer-based BEST-RQ encoder
	- Parameters: ~300M
	- Training data: Crawled Arabic speech data
	- Training hours: ~5,640 hours
	- Languages: Arabic, including multiple dialects
	- Primary use: Speech representation learning / downstream fine-tuning

	### Architecture

	The 300M Ara-BEST-RQ model uses:

	- 24 Conformer encoder layers
	- Model dimension: 848
	- 8 attention heads
	- Feed-forward dimension: 2048
	- GELU activations
	- Relative position multi-head attention
	- Convolutional front-end
	- Random projection quantizer with 4096 codebook entries


	## Training Data

	The model was pretrained on the crawled Ara-BEST-RQ dataset.

	The released dataset on Hugging Face provides metadata only: YouTube video identifiers and audio segment boundaries. No audio or video files are distributed as part of the dataset.

	Dataset link: [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)

	## Evaluation

	The paper evaluates Ara-BEST-RQ models on automatic speech recognition and dialect identification tasks. The following results are reported for the Ara-BEST-RQ crawled 300M model.

	### Automatic Speech Recognition

	WER scores on ASR benchmarks:

	\| Dataset \| WER \|
	\|---\|---:\|
	\| Common Voice 19.0 Arabic \| 18.67 \|
	\| MGB-3 \| 30.85 \|
	\| MGB-5 \| 54.18 \|
	\| TARIC-SLU \| 23.98 \|
	\| Average \| 31.92 \|

	### Dialect Identification

	Results on ADI-20:

	\| Split \| Accuracy \| Weighted F1 \|
	\|---\|---:\|---:\|
	\| Validation \| 97.21 \| 97.17 \|
	\| Test \| 96.02 \| 95.98 \|

	## Usage

	This is a self-supervised pretrained model intended to be used as a speech encoder or as an initialization checkpoint for downstream fine-tuning.

	For training and fine-tuning recipes, please refer to the official implementation:

	```bash
	git clone https://github.com/elyadata/AraBEST-RQ
	cd AraBEST-RQ
	```

	You can download the checkpoint from Hugging Face using:

	```python
	from huggingface_hub import snapshot_download

	model_dir = snapshot_download("Elyadata/AraBEST-RQ-300M-6k")
	print(model_dir)
	```

	Please refer to the repository configuration and SpeechBrain recipes for the correct model-loading interface.

	### Fine-tuning with SpeechBrain

	To fine-tune this pretrained Ara-BEST-RQ checkpoint in a SpeechBrain recipe, adapt the `pretrainer` section of your YAML configuration so that it loads both the pretrained model checkpoint and the corresponding normalizer.

	Example:

	```yaml
	pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
	collect_in: !ref <save_folder>
	loadables:
	pt_model: !ref <pt_model>
	normalize: !ref <normalize>
	paths:
	pt_model: !ref <pt_model_path>/model.ckpt
	normalize: !ref <pt_model_path>/normalizer.ckpt
	```

	In your downstream recipe, make sure that:

	- `<pt_model>` points to the Ara-BEST-RQ pretrained model object used in your training graph.
	- `<normalize>` points to the normalization module used by the recipe.
	- `<pt_model_path>` points to the local directory containing `model.ckpt` and `normalizer.ckpt`.
	- `<save_folder>` is the experiment directory where SpeechBrain should collect and manage pretrained components.

	This setup allows SpeechBrain to initialize the downstream model from the Ara-BEST-RQ SSL checkpoint before fine-tuning on task-specific data.


	## Citation

	If you use this model, please cite the Ara-BEST-RQ paper:

	```bibtex
	@misc{elleuch2026arabestrqmultidialectalarabic,
	title={Ara-Best-RQ: Multi Dialectal Arabic SSL},
	author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares},
	year={2026},
	eprint={2603.21900},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2603.21900},
	}
	```