IdioleX-AR โ Style-Aware Arabic Sentence Embeddings
IdioleX-AR is a sentence encoder trained under the IDIOLEX framework for idiolectal representation learning โ capturing how text is expressed rather than what it says. Embeddings encode stylistic and dialectal variation across 15 Arabic varieties (including MSA), decoupled from semantic content.
Kantharuban et al., IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation (preprint, under review). Code: github.com/AnjaliRuban/IdioleX
Architecture
input_ids โ AraBERT v2 (BERT-base) โ layer-wise attention โ mean pool
โ mean centering โ L2 normalize โ embedding
| Component | Detail |
|---|---|
| Base encoder | aubmindlab/bert-base-arabertv2 |
| Pooling | Learnable layer-wise attention over all 13 hidden states (embedding + 12 transformer layers), then mean pool |
| Centering | Running-mean subtraction estimated over the Arabic training corpus |
| Output | L2-normalized vector, 768-dimensional |
The scalar-mix weights are learned jointly with the encoder, following Rei et al. (2020).
Training
Data
Training data consists of Reddit comments from 15 regional Arabic-language subreddits, collected via the Pushshift archive through December 2024 and filtered for language and quality. Pre-training uses ~80k authors and ~1.77M sentences. Feature-supervised training uses 200 authors per dialect with LLM-annotated linguistic features.
| Variety | Subreddit | Variety | Subreddit |
|---|---|---|---|
| Algerian | r/algeria | Palestinian | r/Palestine |
| Egyptian | r/Egypt | Qatari | r/qatar |
| Iraqi | r/Iraq | Saudi | r/saudiarabia |
| Jordanian | r/jordan | Sudanese | r/Sudan |
| Kuwaiti | r/Kuwait | Syrian | r/Syria |
| Lebanese | r/lebanon | Emirati | r/UAE |
| Libyan | r/Libya | ||
| Moroccan | r/Morocco | ||
| Omani | r/Oman |
Linguistic Features
74 binary dialectal features are extracted sentence-by-sentence using GPT-5-mini,
covering morphosyntax and clause structure (case endings, dual suffixes, future
markers sa/sawfa/dialectal rah/ha/ghadi/bash, progressive markers,
imperfect prefixes bi/ba/ka/3a, copula presence/absence, relative pronouns
alladhi vs. illi/yalli), negation and interrogatives (laysa, lan, lam,
ma...sh, mish, mu, dialectal interrogatives eh/shu/wesh/shinu),
and dialectal lexical/orthographic markers (Egyptian izaay/keda/ba2a, Levantine
baddi/halla/shu, Maghrebi barcha/hshuma, Gulf wain/yalla, non-standard
orthography including hamza omission, tatweel, laughter tokens, etc.).
Objectives
Training proceeds in two stages:
Stage 1 โ Ranking pre-training (full dataset): A margin ranking loss encourages sentences with higher hierarchical proximity to be closer in embedding space. Each batch of 16 is structured so every sentence has exactly one same-comment neighbor (r=3), two same-author neighbors (r=2), four same-dialect neighbors (r=1), and eight cross-dialect neighbors (r=0). The margin ฮป warms up linearly from 0 to 0.5 over 25k steps.
Stage 2 โ Feature-aware training (annotated subset, ฮฑ=0.5):
| Loss | Weight | Purpose |
|---|---|---|
| Margin ranking loss | 1 โ ฮฑ = 0.5 | Proximity-based ranking |
| Feature prediction BCE | 0.25 ร ฮฑ = 0.125 | Predict 74 linguistic features |
| Supervised contrastive (Jaccard-weighted) | ฮฑ = 0.5 | Feature-similarity alignment |
A VICReg regularizer (weight 0.25) enforces variance โฅ 1 per dimension and decorrelates embedding dimensions to prevent anisotropy.
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | aubmindlab/bert-base-arabertv2 |
| Hidden size | 768 |
| Transformer layers | 12 |
| Max sequence length | 512 |
| Batch size | 32 |
| Ranking group size | 16 |
| Feature vector dimension | 74 |
| Feature loss weight (ฮฑ) | 0.5 |
| Contrastive temperature (ฯ) | 0.07 |
| Jaccard top-k | 5 |
| Learning rate | 1 ร 10โปโต |
| LR warmup | 25k steps |
| Optimizer | Adam |
| Margin (ฮป) | 0 โ 0.5 (linear warmup) |
| Training GPUs | 4 |
| Max training time | โค 48 hrs |
Performance
Dialect Identification โ MADAR 26 (25 city-level dialects + MSA)
| Model | F1 | Exact Match |
|---|---|---|
| IdioleX-AR | 0.43 | 0.43 |
| Finetuned IdioleX-AR | 0.61 | 0.61 |
| Finetuned IdioleX-AR + Lexical | 0.66 | 0.66 |
| Finetuned BERT (baseline) | 0.56 | 0.56 |
| Centroid Clustering w/ BERT | 0.20 | 0.22 |
| Samih et al., 2019 (neural, top shared task) | 0.59 | 0.59 |
| Abu Kwaik & Saad, 2019 (feature-engineered, best shared task) | 0.67 | 0.67 |
IdioleX-AR outperforms the top neural submission to the MADAR 26 shared task by 7 points without any task-specific engineering.
Semantic Decoupling โ MADAR 26 (parallel sentences)
Because MADAR 26 contains semantically equivalent translations across dialects, any differences in idiolectal similarity scores on this data must arise from linguistic form rather than content. IdioleX-AR shows a clear gradient:
| Grouping | Avg. Idiolectal Similarity |
|---|---|
| Same country (different city) | highest |
| Same dialectal region | intermediate |
| Different dialectal region | lowest |
Differences are statistically significant (p < 0.005), providing strong evidence of content-independent dialectal structure in the embedding space.
Pearson correlation between IdioleX-AR idiolectal similarity scores and Multilingual-E5 semantic similarity scores on withheld Reddit test pairs: ฯ = 0.19.
Arabic Dialect Generation โ AMIYA Shared Task
IdioleX-AR embeddings also serve as a training objective for LLM post-training. Augmenting standard SFT with an IDIOLEX alignment loss consistently improves dialectal adherence (ADI2) while maintaining translation quality (ChrF++) across five Arabic varieties, using only ~50k training pairs.
| Model | Egyptian ADI2 | Moroccan ADI2 |
|---|---|---|
| Allam + IdioleX SFT | 0.48 | 0.54 |
| Allam + SFT only | 0.45 | 0.49 |
| Allam (base) | 0.28 | 0.26 |
Usage
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
"AnjaliRuban/idiolex-arabertv2-ar",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("AnjaliRuban/idiolex-arabertv2-ar")
sentences = [
"ุฅุฒููุ ุนุงู
ู ุฅููุ", # Egyptian Arabic
"ููู ุญุงููุ", # MSA
]
inputs = tokenizer(
sentences,
return_tensors="pt",
truncation=True,
padding=True,
max_length=model.config.model_len, # 512
)
with torch.no_grad():
embeddings = model(**inputs) # [2, 768], L2-normalized
# Cosine similarity (embeddings are L2-normalized, so dot product = cosine sim)
similarity = embeddings @ embeddings.T
print(similarity)
Config
| Parameter | Value |
|---|---|
base_model |
aubmindlab/bert-base-arabertv2 |
embedding_dim |
768 |
layerwise_pooling |
True |
num_layers |
13 (12 transformer layers + embedding layer) |
layer_norm |
False |
layer_dropout |
None |
mean_center |
True |
model_len |
512 |
Custom files
This model uses trust_remote_code=True. The following files are hosted in this repo:
| File | Purpose |
|---|---|
configuration_idiolex.py |
IdioleXConfig |
modeling_idiolex.py |
IdioleXModel |
centering.py |
MeanCenterer โ distributed running-mean buffer |
layer_pool.py |
LayerwiseAttention โ scalar-mix of transformer layers |
pooling_utils.py |
last_token_pool, average_pool |
Citation
@article{kantharuban2025idiolex,
title = {IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation},
author = {Kantharuban, Anjali and Srivastava, Aarohi and Faisal, Fahim and Ahia, Orevaoghene
and Anastasopoulos, Antonios and Chiang, David and Tsvetkov, Yulia and Neubig, Graham},
year = {2025},
note = {Preprint, under review}
}
- Downloads last month
- 61
Model tree for AnjaliRuban/idiolex-arabertv2-ar
Base model
aubmindlab/bert-base-arabertv2