Levantine Arabic Incitement Detector (CLS + Mean + Max)

This repository contains a custom fine-tuned MARBERTv2 pooled-architecture model for 3-way classification of Levantine Arabic social text:

  • normal
  • abusive
  • incitement

The training setup uses:

  • class-balanced cross-entropy
  • asymmetric error cost for incitement mistakes
  • a small ordinal penalty
  • an auxiliary lexicon head based on incitement.csv
  • pooled sentence representation: CLS + mean pooling + max pooling

Validation Summary

These are the k-fold cross-validation averages used to select this configuration:

Metric Value
Accuracy 82.40%
F1 Macro 0.8025
F1 Incitement 0.7752

Labels

ID Label
0 normal
1 abusive
2 incitement

Confusion Matrix

The image below is a training-set sanity check for the final model trained on all data. It is not an unbiased test result.

Confusion Matrix

How to Load

This is a custom model wrapper, so load it with the provided model.pt weights plus the MARBERTv2 encoder and tokenizer.

import torch
import re
import unicodedata
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

# --- 1. CONFIGURATION & SETUP ---
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('device:', DEVICE)
MAX_LENGTH = 160
USE_NORMALIZED_TEXT_FOR_MODEL = False
REPO_ID = "amitca71/marbertv2-levantine-incitement-detector-cls-mean-max"
ARABIC_DIACRITICS = re.compile(r'[\u0610-\u061A\u064B-\u065F\u0670\u06D6-\u06ED]')

def normalize_arabic(text: str) -> str:
    text = unicodedata.normalize('NFKC', text or '').strip()
    text = ARABIC_DIACRITICS.sub('', text)
    text = text.replace('أ', 'ا').replace('إ', 'ا').replace('آ', 'ا')
    text = text.replace('ى', 'ي').replace('ؤ', 'و').replace('ئ', 'ي').replace('ة', 'ه')
    text = re.sub(r'[^\w\s#@/]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip().lower()

def text_for_model(text: str, use_normalized: bool = USE_NORMALIZED_TEXT_FOR_MODEL) -> str:
    return normalize_arabic(text) if use_normalized else (text or '').strip()

# --- 2. LOAD MODEL & TOKENIZER ---
print("Loading model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/MARBERTv2")

# Initialize custom model (Ensure MarbertMultiTask is defined in your script before this)
model = MarbertMultiTask("UBC-NLP/MARBERTv2", pooling_strategy="cls_mean_max")

# Download and load weights
model_path = hf_hub_download(repo_id=REPO_ID, filename="model.pt")
model.load_state_dict(torch.load(model_path, map_location=DEVICE, weights_only=True))
model.to(DEVICE)
model.eval()

# Package everything into the bundle expected by the function
bundle = {
    "model": model,
    "tokenizer": tokenizer,
    "label_map": {0: "normal", 1: "abusive", 2: "incitement"}
}

# --- 3. PREDICTION FUNCTION ---
def predict_one(bundle, text: str):
    encoded = bundle["tokenizer"](
        text_for_model(text),
        truncation=True,
        padding="max_length",
        max_length=MAX_LENGTH,
        return_tensors="pt"
    ).to(DEVICE)

    with torch.no_grad():
        out = bundle["model"](**encoded)

    # Handle different forward() return types
    if isinstance(out, dict):
        logits = out["logits"]
        lexicon_logits = out.get("lexicon_logits", None)
    elif isinstance(out, tuple):
        logits = out[0]
        lexicon_logits = out[1] if len(out) > 1 else None
    else:
        logits = out
        lexicon_logits = None

    probs = torch.softmax(logits, dim=-1).squeeze(0).tolist()
    pred_id = probs.index(max(probs))

    response = {
        "pred_label": bundle["label_map"][pred_id],
        "confidence": probs[pred_id],
        "prob_normal": probs[0],
        "prob_abusive": probs[1],
        "prob_incitement": probs[2],
    }

    if lexicon_logits is not None:
        response["lexicon_signal_prob"] = torch.sigmoid(lexicon_logits).squeeze(0).item()

    return response

# --- 4. EXECUTE ONE CALL ---
sample_text = "انت يا عميل السفارات يا ابن الكلب حسابك عسير"
response = predict_one(bundle, sample_text)

print("\nPrediction Response:")
print(response)

Limitations

  • The top class is built from open-source proxies for incitement, not a gold incitement-only annotation project.
  • Performance is best on Levantine political/social text and may degrade on other Arabic varieties or platforms.
  • The confusion matrix in this card is from the full-data training run and should not be treated as held-out evaluation.
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support