speechbrain
Arabic
speech
ssl
arabic
dialect
HarounElleuch commited on
Commit
cee6d09
·
verified ·
1 Parent(s): 294b1b0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -0
README.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - Elyadata/Ara-Best-RQ_dataset
4
+ language:
5
+ - ar
6
+ library_name: speechbrain
7
+ tags:
8
+ - speech
9
+ - ssl
10
+ - arabic
11
+ - dialect
12
+ ---
13
+
14
+ # Ara-BEST-RQ-600M-14k
15
+
16
+ **Ara-BEST-RQ-600M-14k** is a 600M-parameter self-supervised speech representation model for Arabic and Arabic dialects. It is part of the Ara-BEST-RQ family introduced in **[Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)**.
17
+
18
+ This model was pretrained on the **combined Ara-BEST-RQ dataset**: 13,723h 08m 43s of speech, combining the crawled Ara-BEST-RQ data with other publicly available datasets.
19
+
20
+ - **Paper:** [Ara-Best-RQ: Multi Dialectal Arabic SSL](https://arxiv.org/abs/2603.21900)
21
+ - **Dataset:** [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
22
+ - **Implementation:** [elyadata/AraBEST-RQ](https://github.com/elyadata/AraBEST-RQ)
23
+
24
+ ## Model Details
25
+
26
+ ### Model Description
27
+
28
+ Ara-BEST-RQ is a family of Arabic-focused self-supervised learning (SSL) speech models based on the BEST-RQ framework. The models are designed to learn speech representations that transfer well to Arabic speech processing tasks, including automatic speech recognition (ASR) and dialect identification (DID).
29
+
30
+ This checkpoint corresponds to the **600M** variant pretrained on the **combined 14k-hour dataset**.
31
+
32
+ - **Model type:** Self-supervised speech representation model
33
+ - **Architecture:** Conformer-based BEST-RQ encoder
34
+ - **Parameters:** ~600M (611.6M)
35
+ - **Training data:** combined Ara-BEST-RQ dataset
36
+ - **Languages:** Arabic, including multiple dialects
37
+ - **Primary use:** Speech representation learning / downstream fine-tuning
38
+
39
+ ### Architecture
40
+
41
+ The 600M Ara-BEST-RQ model uses:
42
+
43
+ - 24 Conformer encoder layers
44
+ - Model dimension: 1024
45
+ - 8 attention heads
46
+ - Feed-forward dimension: 4096
47
+ - GELU activations
48
+ - Layer normalization before attention
49
+ - Relative position multi-head attention
50
+ - Convolutional front-end with two blocks
51
+ - Random projection quantizer with 4096 codebook entries of dimension 16
52
+
53
+
54
+ ## Training Data
55
+
56
+ The model was pretrained on the combined Ara-BEST-RQ dataset: **13,723h 08m 43s** of speech data. The combined set includes the crawled Ara-BEST-RQ data together with other publicly available datasets described in the paper.
57
+
58
+ The released dataset on Hugging Face provides **metadata only**: YouTube video identifiers and audio segment boundaries. No audio or video files are distributed as part of the dataset.
59
+
60
+ Dataset link: [Elyadata/Ara-Best-RQ_dataset](https://huggingface.co/datasets/Elyadata/Ara-Best-RQ_dataset)
61
+
62
+
63
+ ## Pretraining
64
+
65
+ The paper reports the following pretraining losses after 300k updates for this model:
66
+
67
+ | Training set | Train loss | Validation loss |
68
+ |---|---:|---:|
69
+ | Combined | 3.57 | 3.40 |
70
+
71
+ ## Evaluation
72
+
73
+ The paper evaluates Ara-BEST-RQ models on automatic speech recognition and dialect identification tasks. The following results are reported for the **Ara-BEST-RQ-600M-14k** model.
74
+
75
+ ### Automatic Speech Recognition
76
+
77
+ WER scores on ASR benchmarks:
78
+
79
+ | Dataset | WER |
80
+ |---|---:|
81
+ | Common Voice 19.0 Arabic | 18.59 |
82
+ | MGB-3 | 28.78 |
83
+ | MGB-5 | 54.54 |
84
+ | TARIC-SLU | 21.14 |
85
+ | Average | 30.76 |
86
+
87
+ ### Dialect Identification
88
+
89
+ Results on ADI-20:
90
+
91
+ | Split | Accuracy | Weighted F1 |
92
+ |---|---:|---:|
93
+ | Validation | 94.66 | 94.71 |
94
+ | Test | 92.05 | 92.07 |
95
+
96
+ ## Usage
97
+
98
+ This is a self-supervised pretrained model intended to be used as a speech encoder or as an initialization checkpoint for downstream fine-tuning.
99
+
100
+ For training and fine-tuning recipes, please refer to the official implementation:
101
+
102
+ ```bash
103
+ git clone https://github.com/elyadata/AraBEST-RQ
104
+ cd AraBEST-RQ
105
+ ```
106
+
107
+ You can download the checkpoint from Hugging Face using:
108
+
109
+ ```python
110
+ from huggingface_hub import snapshot_download
111
+
112
+ model_dir = snapshot_download("Elyadata/AraBEST-RQ-600M-14k")
113
+ print(model_dir)
114
+ ```
115
+
116
+ Please refer to the repository configuration and SpeechBrain recipes for the correct model-loading interface.
117
+
118
+ ### Fine-tuning with SpeechBrain
119
+
120
+ To fine-tune this pretrained Ara-BEST-RQ checkpoint in a SpeechBrain recipe, adapt the `pretrainer` section of your YAML configuration so that it loads both the pretrained model checkpoint and the corresponding normalizer.
121
+
122
+ Example:
123
+
124
+ ```yaml
125
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
126
+ collect_in: !ref <save_folder>
127
+ loadables:
128
+ pt_model: !ref <pt_model>
129
+ normalize: !ref <normalize>
130
+ paths:
131
+ pt_model: !ref <pt_model_path>/model.ckpt
132
+ normalize: !ref <pt_model_path>/normalizer.ckpt
133
+ ```
134
+
135
+ In your downstream recipe, make sure that:
136
+
137
+ - `<pt_model>` points to the Ara-BEST-RQ pretrained model object used in your training graph.
138
+ - `<normalize>` points to the normalization module used by the recipe.
139
+ - `<pt_model_path>` points to the local directory containing `model.ckpt` and `normalizer.ckpt`.
140
+ - `<save_folder>` is the experiment directory where SpeechBrain should collect and manage pretrained components.
141
+
142
+ This setup allows SpeechBrain to initialize the downstream model from the Ara-BEST-RQ SSL checkpoint before fine-tuning on task-specific data.
143
+
144
+
145
+ ## Citation
146
+
147
+ If you use this model, please cite the Ara-BEST-RQ paper:
148
+
149
+ ```bibtex
150
+ @misc{elleuch2026arabestrqmultidialectalarabic,
151
+ title={Ara-Best-RQ: Multi Dialectal Arabic SSL},
152
+ author={Haroun Elleuch and Ryan Whetten and Salima Mdhaffar and Yannick Estève and Fethi Bougares},
153
+ year={2026},
154
+ eprint={2603.21900},
155
+ archivePrefix={arXiv},
156
+ primaryClass={cs.CL},
157
+ url={https://arxiv.org/abs/2603.21900},
158
+ }
159
+ ```