IdioleX-AR — Style-Aware Arabic Sentence Embeddings

IdioleX-AR is a sentence encoder trained under the IDIOLEX framework for idiolectal representation learning — capturing how text is expressed rather than what it says. Embeddings encode stylistic and dialectal variation across 15 Arabic varieties (including MSA), decoupled from semantic content.

Kantharuban et al., IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation (preprint, under review). Code: github.com/AnjaliRuban/IdioleX

Architecture

input_ids → AraBERT v2 (BERT-base) → layer-wise attention → mean pool
          → mean centering → L2 normalize → embedding

Component	Detail
Base encoder	`aubmindlab/bert-base-arabertv2`
Pooling	Learnable layer-wise attention over all 13 hidden states (embedding + 12 transformer layers), then mean pool
Centering	Running-mean subtraction estimated over the Arabic training corpus
Output	L2-normalized vector, 768-dimensional

The scalar-mix weights are learned jointly with the encoder, following Rei et al. (2020).

Training

Data

Training data consists of Reddit comments from 15 regional Arabic-language subreddits, collected via the Pushshift archive through December 2024 and filtered for language and quality. Pre-training uses ~80k authors and ~1.77M sentences. Feature-supervised training uses 200 authors per dialect with LLM-annotated linguistic features.

Variety	Subreddit	Variety	Subreddit
Algerian	r/algeria	Palestinian	r/Palestine
Egyptian	r/Egypt	Qatari	r/qatar
Iraqi	r/Iraq	Saudi	r/saudiarabia
Jordanian	r/jordan	Sudanese	r/Sudan
Kuwaiti	r/Kuwait	Syrian	r/Syria
Lebanese	r/lebanon	Emirati	r/UAE
Libyan	r/Libya
Moroccan	r/Morocco
Omani	r/Oman

Linguistic Features

74 binary dialectal features are extracted sentence-by-sentence using GPT-5-mini, covering morphosyntax and clause structure (case endings, dual suffixes, future markers sa/sawfa/dialectal rah/ha/ghadi/bash, progressive markers, imperfect prefixes bi/ba/ka/3a, copula presence/absence, relative pronouns alladhi vs. illi/yalli), negation and interrogatives (laysa, lan, lam, ma...sh, mish, mu, dialectal interrogatives eh/shu/wesh/shinu), and dialectal lexical/orthographic markers (Egyptian izaay/keda/ba2a, Levantine baddi/halla/shu, Maghrebi barcha/hshuma, Gulf wain/yalla, non-standard orthography including hamza omission, tatweel, laughter tokens, etc.).

Objectives

Training proceeds in two stages:

Stage 1 — Ranking pre-training (full dataset): A margin ranking loss encourages sentences with higher hierarchical proximity to be closer in embedding space. Each batch of 16 is structured so every sentence has exactly one same-comment neighbor (r=3), two same-author neighbors (r=2), four same-dialect neighbors (r=1), and eight cross-dialect neighbors (r=0). The margin λ warms up linearly from 0 to 0.5 over 25k steps.

Stage 2 — Feature-aware training (annotated subset, α=0.5):

Loss	Weight	Purpose
Margin ranking loss	1 − α = 0.5	Proximity-based ranking
Feature prediction BCE	0.25 × α = 0.125	Predict 74 linguistic features
Supervised contrastive (Jaccard-weighted)	α = 0.5	Feature-similarity alignment

A VICReg regularizer (weight 0.25) enforces variance ≥ 1 per dimension and decorrelates embedding dimensions to prevent anisotropy.

Hyperparameters

Parameter	Value
Base model	`aubmindlab/bert-base-arabertv2`
Hidden size	768
Transformer layers	12
Max sequence length	512
Batch size	32
Ranking group size	16
Feature vector dimension	74
Feature loss weight (α)	0.5
Contrastive temperature (τ)	0.07
Jaccard top-k	5
Learning rate	1 × 10⁻⁵
LR warmup	25k steps
Optimizer	Adam
Margin (λ)	0 → 0.5 (linear warmup)
Training GPUs	4
Max training time	≤ 48 hrs

Performance

Dialect Identification — MADAR 26 (25 city-level dialects + MSA)

Model	F1	Exact Match
IdioleX-AR	0.43	0.43
Finetuned IdioleX-AR	0.61	0.61
Finetuned IdioleX-AR + Lexical	0.66	0.66
Finetuned BERT (baseline)	0.56	0.56
Centroid Clustering w/ BERT	0.20	0.22
Samih et al., 2019 (neural, top shared task)	0.59	0.59
Abu Kwaik & Saad, 2019 (feature-engineered, best shared task)	0.67	0.67

IdioleX-AR outperforms the top neural submission to the MADAR 26 shared task by 7 points without any task-specific engineering.

Semantic Decoupling — MADAR 26 (parallel sentences)

Because MADAR 26 contains semantically equivalent translations across dialects, any differences in idiolectal similarity scores on this data must arise from linguistic form rather than content. IdioleX-AR shows a clear gradient:

Grouping	Avg. Idiolectal Similarity
Same country (different city)	highest
Same dialectal region	intermediate
Different dialectal region	lowest

Differences are statistically significant (p < 0.005), providing strong evidence of content-independent dialectal structure in the embedding space.

Pearson correlation between IdioleX-AR idiolectal similarity scores and Multilingual-E5 semantic similarity scores on withheld Reddit test pairs: ρ = 0.19.

Arabic Dialect Generation — AMIYA Shared Task

IdioleX-AR embeddings also serve as a training objective for LLM post-training. Augmenting standard SFT with an IDIOLEX alignment loss consistently improves dialectal adherence (ADI2) while maintaining translation quality (ChrF++) across five Arabic varieties, using only ~50k training pairs.

Model	Egyptian ADI2	Moroccan ADI2
Allam + IdioleX SFT	0.48	0.54
Allam + SFT only	0.45	0.49
Allam (base)	0.28	0.26

Usage

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "AnjaliRuban/idiolex-arabertv2-ar",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("AnjaliRuban/idiolex-arabertv2-ar")

sentences = [
    "إزيك؟ عامل إيه؟",    # Egyptian Arabic
    "كيف حالك؟",           # MSA
]

inputs = tokenizer(
    sentences,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=model.config.model_len,  # 512
)

with torch.no_grad():
    embeddings = model(**inputs)   # [2, 768], L2-normalized

# Cosine similarity (embeddings are L2-normalized, so dot product = cosine sim)
similarity = embeddings @ embeddings.T
print(similarity)

Config

Parameter	Value
`base_model`	`aubmindlab/bert-base-arabertv2`
`embedding_dim`	`768`
`layerwise_pooling`	`True`
`num_layers`	`13` (12 transformer layers + embedding layer)
`layer_norm`	`False`
`layer_dropout`	`None`
`mean_center`	`True`
`model_len`	`512`

Custom files

This model uses trust_remote_code=True. The following files are hosted in this repo:

File	Purpose
`configuration_idiolex.py`	`IdioleXConfig`
`modeling_idiolex.py`	`IdioleXModel`
`centering.py`	`MeanCenterer` — distributed running-mean buffer
`layer_pool.py`	`LayerwiseAttention` — scalar-mix of transformer layers
`pooling_utils.py`	`last_token_pool`, `average_pool`

Citation

@article{kantharuban2025idiolex,
  title   = {IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation},
  author  = {Kantharuban, Anjali and Srivastava, Aarohi and Faisal, Fahim and Ahia, Orevaoghene
             and Anastasopoulos, Antonios and Chiang, David and Tsvetkov, Yulia and Neubig, Graham},
  year    = {2025},
  note    = {Preprint, under review}
}

Downloads last month: 61

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for AnjaliRuban/idiolex-arabertv2-ar

Base model

aubmindlab/bert-base-arabertv2

Finetuned

(69)

this model

Collection including AnjaliRuban/idiolex-arabertv2-ar

IdioleX

Collection

Everything associated with the IdioleX paper. • 5 items • Updated 13 days ago