IdioleX-AR โ€” Style-Aware Arabic Sentence Embeddings

IdioleX-AR is a sentence encoder trained under the IDIOLEX framework for idiolectal representation learning โ€” capturing how text is expressed rather than what it says. Embeddings encode stylistic and dialectal variation across 15 Arabic varieties (including MSA), decoupled from semantic content.

Kantharuban et al., IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation (preprint, under review). Code: github.com/AnjaliRuban/IdioleX

Architecture

input_ids โ†’ AraBERT v2 (BERT-base) โ†’ layer-wise attention โ†’ mean pool
          โ†’ mean centering โ†’ L2 normalize โ†’ embedding
Component Detail
Base encoder aubmindlab/bert-base-arabertv2
Pooling Learnable layer-wise attention over all 13 hidden states (embedding + 12 transformer layers), then mean pool
Centering Running-mean subtraction estimated over the Arabic training corpus
Output L2-normalized vector, 768-dimensional

The scalar-mix weights are learned jointly with the encoder, following Rei et al. (2020).

Training

Data

Training data consists of Reddit comments from 15 regional Arabic-language subreddits, collected via the Pushshift archive through December 2024 and filtered for language and quality. Pre-training uses ~80k authors and ~1.77M sentences. Feature-supervised training uses 200 authors per dialect with LLM-annotated linguistic features.

Variety Subreddit Variety Subreddit
Algerian r/algeria Palestinian r/Palestine
Egyptian r/Egypt Qatari r/qatar
Iraqi r/Iraq Saudi r/saudiarabia
Jordanian r/jordan Sudanese r/Sudan
Kuwaiti r/Kuwait Syrian r/Syria
Lebanese r/lebanon Emirati r/UAE
Libyan r/Libya
Moroccan r/Morocco
Omani r/Oman

Linguistic Features

74 binary dialectal features are extracted sentence-by-sentence using GPT-5-mini, covering morphosyntax and clause structure (case endings, dual suffixes, future markers sa/sawfa/dialectal rah/ha/ghadi/bash, progressive markers, imperfect prefixes bi/ba/ka/3a, copula presence/absence, relative pronouns alladhi vs. illi/yalli), negation and interrogatives (laysa, lan, lam, ma...sh, mish, mu, dialectal interrogatives eh/shu/wesh/shinu), and dialectal lexical/orthographic markers (Egyptian izaay/keda/ba2a, Levantine baddi/halla/shu, Maghrebi barcha/hshuma, Gulf wain/yalla, non-standard orthography including hamza omission, tatweel, laughter tokens, etc.).

Objectives

Training proceeds in two stages:

Stage 1 โ€” Ranking pre-training (full dataset): A margin ranking loss encourages sentences with higher hierarchical proximity to be closer in embedding space. Each batch of 16 is structured so every sentence has exactly one same-comment neighbor (r=3), two same-author neighbors (r=2), four same-dialect neighbors (r=1), and eight cross-dialect neighbors (r=0). The margin ฮป warms up linearly from 0 to 0.5 over 25k steps.

Stage 2 โ€” Feature-aware training (annotated subset, ฮฑ=0.5):

Loss Weight Purpose
Margin ranking loss 1 โˆ’ ฮฑ = 0.5 Proximity-based ranking
Feature prediction BCE 0.25 ร— ฮฑ = 0.125 Predict 74 linguistic features
Supervised contrastive (Jaccard-weighted) ฮฑ = 0.5 Feature-similarity alignment

A VICReg regularizer (weight 0.25) enforces variance โ‰ฅ 1 per dimension and decorrelates embedding dimensions to prevent anisotropy.

Hyperparameters

Parameter Value
Base model aubmindlab/bert-base-arabertv2
Hidden size 768
Transformer layers 12
Max sequence length 512
Batch size 32
Ranking group size 16
Feature vector dimension 74
Feature loss weight (ฮฑ) 0.5
Contrastive temperature (ฯ„) 0.07
Jaccard top-k 5
Learning rate 1 ร— 10โปโต
LR warmup 25k steps
Optimizer Adam
Margin (ฮป) 0 โ†’ 0.5 (linear warmup)
Training GPUs 4
Max training time โ‰ค 48 hrs

Performance

Dialect Identification โ€” MADAR 26 (25 city-level dialects + MSA)

Model F1 Exact Match
IdioleX-AR 0.43 0.43
Finetuned IdioleX-AR 0.61 0.61
Finetuned IdioleX-AR + Lexical 0.66 0.66
Finetuned BERT (baseline) 0.56 0.56
Centroid Clustering w/ BERT 0.20 0.22
Samih et al., 2019 (neural, top shared task) 0.59 0.59
Abu Kwaik & Saad, 2019 (feature-engineered, best shared task) 0.67 0.67

IdioleX-AR outperforms the top neural submission to the MADAR 26 shared task by 7 points without any task-specific engineering.

Semantic Decoupling โ€” MADAR 26 (parallel sentences)

Because MADAR 26 contains semantically equivalent translations across dialects, any differences in idiolectal similarity scores on this data must arise from linguistic form rather than content. IdioleX-AR shows a clear gradient:

Grouping Avg. Idiolectal Similarity
Same country (different city) highest
Same dialectal region intermediate
Different dialectal region lowest

Differences are statistically significant (p < 0.005), providing strong evidence of content-independent dialectal structure in the embedding space.

Pearson correlation between IdioleX-AR idiolectal similarity scores and Multilingual-E5 semantic similarity scores on withheld Reddit test pairs: ฯ = 0.19.

Arabic Dialect Generation โ€” AMIYA Shared Task

IdioleX-AR embeddings also serve as a training objective for LLM post-training. Augmenting standard SFT with an IDIOLEX alignment loss consistently improves dialectal adherence (ADI2) while maintaining translation quality (ChrF++) across five Arabic varieties, using only ~50k training pairs.

Model Egyptian ADI2 Moroccan ADI2
Allam + IdioleX SFT 0.48 0.54
Allam + SFT only 0.45 0.49
Allam (base) 0.28 0.26

Usage

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "AnjaliRuban/idiolex-arabertv2-ar",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("AnjaliRuban/idiolex-arabertv2-ar")

sentences = [
    "ุฅุฒูŠูƒุŸ ุนุงู…ู„ ุฅูŠู‡ุŸ",    # Egyptian Arabic
    "ูƒูŠู ุญุงู„ูƒุŸ",           # MSA
]

inputs = tokenizer(
    sentences,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=model.config.model_len,  # 512
)

with torch.no_grad():
    embeddings = model(**inputs)   # [2, 768], L2-normalized

# Cosine similarity (embeddings are L2-normalized, so dot product = cosine sim)
similarity = embeddings @ embeddings.T
print(similarity)

Config

Parameter Value
base_model aubmindlab/bert-base-arabertv2
embedding_dim 768
layerwise_pooling True
num_layers 13 (12 transformer layers + embedding layer)
layer_norm False
layer_dropout None
mean_center True
model_len 512

Custom files

This model uses trust_remote_code=True. The following files are hosted in this repo:

File Purpose
configuration_idiolex.py IdioleXConfig
modeling_idiolex.py IdioleXModel
centering.py MeanCenterer โ€” distributed running-mean buffer
layer_pool.py LayerwiseAttention โ€” scalar-mix of transformer layers
pooling_utils.py last_token_pool, average_pool

Citation

@article{kantharuban2025idiolex,
  title   = {IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation},
  author  = {Kantharuban, Anjali and Srivastava, Aarohi and Faisal, Fahim and Ahia, Orevaoghene
             and Anastasopoulos, Antonios and Chiang, David and Tsvetkov, Yulia and Neubig, Graham},
  year    = {2025},
  note    = {Preprint, under review}
}
Downloads last month
61
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AnjaliRuban/idiolex-arabertv2-ar

Finetuned
(69)
this model

Collection including AnjaliRuban/idiolex-arabertv2-ar