🌟 Roman Urdu Emotion Classifier — XLM-R v2

The First and Highest-Accuracy Open-Source Emotion Detection Model for Roman Urdu

HuggingFace License Macro F1 Accuracy Language

Macro F1: 0.9896 · Accuracy: 98.96% · 7 Emotion Classes · 28,000 Training Samples

Trained on social media, WhatsApp chats, and comment data — the actual language 230 million people use.


Table of Contents


Why This Model Matters

Roman Urdu is the dominant language of digital Pakistan — and one of the most underserved languages in NLP.

Over 230 million people speak Urdu as a first or second language. In digital spaces — WhatsApp, Twitter/X, Facebook, TikTok comments, YouTube — the overwhelming majority write in Roman Urdu: Urdu expressed in Latin script, without a standardized orthography, heavily mixed with English, and rich in slang, regional variation, and emotionally charged informal expression.

Despite this scale, Roman Urdu remains a severely low-resource language in NLP. The reasons are structural:

  • No standardized spelling — the same word can appear in dozens of valid transliterations
  • Aggressive code-switching between Urdu and English within single sentences
  • A near-total absence of labeled datasets at scale
  • Existing multilingual models (trained primarily on formal Urdu script) generalize poorly to informal Roman Urdu text

This model directly addresses that gap.

roman-urdu-emotion-xlmr-v2 is, to our knowledge, the first publicly available, high-accuracy, open-source emotion classification model for Roman Urdu on HuggingFace. It achieves 98.96% accuracy and 0.9896 Macro F1 across seven emotion classes on a human-validated test set — performance that is competitive with state-of-the-art emotion classifiers for high-resource languages like English.

This is not an incremental contribution. For a language with virtually no prior open-source emotion recognition tooling, this model represents a foundational resource for researchers, developers, and organizations working with Urdu-speaking populations.


Applications

The ability to automatically detect emotion in Roman Urdu text unlocks a wide range of impactful downstream applications:

Mental Health Monitoring

Depression, anxiety, and emotional distress manifest in language long before clinical intervention. This model enables:

  • Passive screening of social media activity for early signs of emotional distress in Urdu-speaking populations
  • Longitudinal tracking of emotional state changes in anonymized text data
  • Support tools for mental health researchers studying Pakistani and South Asian communities — populations historically underrepresented in clinical NLP research
  • Potential integration into counseling support platforms to flag high-distress conversations for human review

Social Media and Public Discourse Analysis

  • Real-time emotion monitoring of public discourse on Pakistani social media
  • Brand sentiment and emotion analysis for Urdu-speaking markets
  • Detection of emotionally charged misinformation or harmful content campaigns
  • Crisis response: identifying fear or anger spikes during public emergencies or natural disasters

Policy and Governance

  • Public opinion analysis of government communications and policy announcements
  • Monitoring community emotional response to news events, elections, and social issues
  • Understanding population emotional needs for targeted resource allocation

Low-Resource NLP Research

  • First benchmark model for Roman Urdu affective computing — a direct baseline for future work
  • Foundation for transfer learning to related low-resource languages (Hindi in Latin script, Punjabi, Sindhi)
  • Demonstrates the viability of continued fine-tuning pipelines for low-resource settings with limited labeled data

Conversational AI and Customer Experience

  • Emotion-aware chatbots and virtual assistants for Urdu-speaking users
  • Customer service systems that detect frustrated or distressed users for priority routing
  • Educational platforms that adapt content delivery based on detected student emotional state

Quick Start

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="Khubaib01/roman-urdu-emotion-xlmr-v2",
    trust_remote_code=True,   # required — model uses a custom classification head
    top_k=None,               # returns scores for all 7 classes
)

# Single prediction — all scores
pipe("yaar mun preshan ho rha")

# Top prediction only
result = pipe("bohat khushi ho rhi hai aaj!")
top = max(result[0], key=lambda x: x["score"])
print(f"{top['label']}: {top['score']:.4f}")
# happy: 0.9901

# Batch prediction
texts = [
    "mujhe dar lag rha hai",
    "ye sab dekh ke dil bahut dukha",
    "acha! ye toh maine socha bhi nahi tha",
    "theek hai, koi baat nahi",
]
results = pipe(texts)
for text, scores in zip(texts, results):
    top = max(scores, key=lambda x: x["score"])
    print(f"{top['label']:10} ({top['score']:.3f})  →  {text}")
# fear       (0.987)  →  mujhe dar lag rha hai
# sad        (0.983)  →  ye sab dekh ke dil bahut dukha
# surprise   (0.990)  →  acha! ye toh maine socha bhi nahi tha
# none       (0.998)  →  theek hai, koi baat nahi

Note on trust_remote_code=True: This flag is required because the model uses a custom two-layer MLP classification head rather than the standard HuggingFace linear classifier. The full architecture code (emotion_model.py) is included in this repository and is fully auditable before use.


Labels

Seven classes — Ekman's six universal basic emotions plus a none class for emotionally neutral content.

ID Label Urdu Equivalent Description Example (Roman Urdu)
0 anger غصہ (Gussa) Frustration, rage, irritation yaar mujhe bahut gussa aa rha hai
1 disgust نفرت (Nafrat) Revulsion, strong disapproval ugh ye cheez bilkul pasand nahi
2 fear ڈر (Dar) Anxiety, dread, apprehension mujhe dar lag rha hai is cheez se
3 happy خوشی (Khushi) Joy, happiness, contentment, delight bohat khushi ho rhi hai aaj!
4 sad اداسی (Udaasi) Grief, sorrow, loss, disappointment ye sab dekh ke dil bahut dukha
5 surprise حیرت (Hairat) Astonishment — positive or negative acha! ye toh maine socha bhi nahi
6 none غیر جذباتی (Neutral) Neutral / no dominant emotional signal theek hai, jo hoga dekha jaega

Performance

All evaluation metrics are computed on a held-out test set of 2,801 samples that were withheld from training and validation entirely. Each sample in the test set was independently reviewed by human validators with native Roman Urdu proficiency before inclusion, ensuring that the evaluation reflects real annotation quality rather than automated labeling noise.

Overall Metrics

Metric Score
Accuracy 0.9896
Macro F1 0.9896
Weighted F1 0.9896
Macro Precision 0.9896
Macro Recall 0.9896

Per-Class Results

Class Precision Recall F1-Score Support
anger 0.9975 1.0000 0.9988 401
disgust 0.9823 0.9725 0.9774 400
fear 0.9874 0.9825 0.9850 400
happy 0.9901 1.0000 0.9950 400
sad 0.9800 0.9825 0.9813 400
surprise 0.9900 0.9900 0.9900 400
none 1.0000 1.0000 1.0000 400
macro avg 0.9896 0.9896 0.9896 2801

Key Observations

Perfect scores on none (F1 = 1.000): The model completely and cleanly separates neutral text from all emotional categories. This is a critical capability for real-world deployment where the majority of messages in any corpus will be emotionally neutral — a poor none classifier pollutes all other class predictions.

Perfect recall on anger (Recall = 1.000, F1 = 0.9988): The model does not miss a single angry text in the test set. In mental health monitoring and crisis detection contexts, this zero-miss behavior on anger is especially valuable — false negatives in distress detection carry higher cost than false positives.

Perfect scores on none and near-perfect on happy and anger: These three classes together account for over 57% of real-world text in informal corpora (neutral, positive, or overtly negative). The model's strongest performance on these highest-frequency categories ensures robust behavior at deployment scale.

Lowest F1 on disgust (0.977): This is consistent with findings across the affective computing literature — disgust and anger share substantial lexical overlap in informal text and represent the hardest emotion pair to separate even for human annotators. A Macro F1 of 0.977 on disgust still represents an exceptionally strong result for this class in a low-resource language setting.

Balanced support across all classes (~400 per class): The near-equal class distribution in the test set means macro F1 = weighted F1 = accuracy = 0.9896 — a meaningful alignment confirming these scores are not inflated by any dominant class.


Visualizations

Benchmark vizualizations will be soon uploaded with paper

Per-Class F1 Bar Chart

Per-class F1 bar chart


Confusion Matrix

Normalized confusion matrix


Architecture

The model wraps XLM-RoBERTa-base with a custom two-layer MLP classification head that replaces the standard single-linear classifier in HuggingFace's default XLMRobertaForSequenceClassification.

Input: Roman Urdu text
  (tokenized via XLM-R SentencePiece BPE, vocab=250,002, max=512 tokens)
         │
         ▼
┌────────────────────────────────────────────────────┐
│            XLM-RoBERTa-base Encoder                │
│   12 transformer layers  ·  hidden size = 768      │
│   12 attention heads  ·  ~270M parameters          │
│   vocab: 250,002 (multilingual SentencePiece)      │
│   position embeddings: 514 (XLM-R convention)      │
└────────────────────────────────────────────────────┘
         │
         │   [CLS] token representation  (batch × 768)
         ▼
┌────────────────────────────────────────────────────┐
│            Emotion Classification Head             │
│                                                    │
│   LayerNorm(768)                                   │
│        ↓                                           │
│   Dropout(0.35)                                    │
│        ↓                                           │
│   Linear(768 → 256)                                │
│        ↓                                           │
│   GELU activation                                  │
│        ↓                                           │
│   Dropout(0.175)                                   │
│        ↓                                           │
│   Linear(256 → 7)                                  │
└────────────────────────────────────────────────────┘
         │
         ▼
   Emotion logits  (batch × 7)
   → softmax → predicted class + confidence scores

Why a two-layer head? The standard single-layer Linear(768 → 7) collapses all representational transformation into one linear step. For Roman Urdu emotion classification, an intermediate non-linear projection is beneficial because: (1) several emotion classes share substantial surface-level lexical overlap (particularly anger/disgust and fear/sadness); (2) informal text produces highly variable surface forms for the same underlying emotional content; and (3) the GELU projection stage learns a compact emotion-relevant subspace from the full 768-dimensional encoder representation before the final seven-way classification boundary is drawn. This design was validated against a single-layer baseline during v1 development.

Component Parameters
XLM-R encoder ~270M
Emotion head ~197k
Total ~270.2M

Training Details

Model Lineage

xlm-roberta-base
    │  HuggingFace — 12 layers, 270M params, 100+ languages
    ▼
Khubaib01/roman-urdu-sentiment-xlm-r
    │  Sentiment fine-tune on Roman Urdu
    ▼
Khubaib01/roman-urdu-emotion-xlmr          ← v1  (21k samples)
    │  Emotion fine-tune, first version
    ▼
Khubaib01/roman-urdu-emotion-xlmr-v2       ← v2  (28k samples, this model)
    Continued fine-tune on expanded corpus

Each stage transfers progressively more task-specific and language-specific knowledge. This lineage allows v2 to achieve near-perfect performance with conservative encoder learning rates that preserve learned representations rather than overwriting them.

Hyperparameters

Parameter Value Rationale
Seed 42 Full reproducibility
Max epochs 10 With early stopping (patience = 3)
Early stopping patience 3 Halts if validation F1 stagnates for 3 consecutive epochs
Train batch size 16
Eval batch size 32
Encoder LR 5e-6 Conservative — warm-started from v1, avoids catastrophic forgetting
Head LR 3e-5 6× encoder LR; head adapts faster to expanded data distribution
LR layer-wise decay 0.90 Lower encoder layers updated less aggressively
Weight decay 0.02 L2 regularization; increased vs v1 (0.01) for larger corpus
Warmup ratio 0.10 10% of total steps for smooth LR ramp-up
Max gradient norm 1.0 Gradient clipping for training stability
Dropout 0.35 Slightly higher than v1 (0.30) to counter larger training set
Label smoothing 0.10 Prevents overconfidence on noisy social media annotations
Mixed precision fp16 NVIDIA GPU training
LR scheduler Cosine with linear warmup

Layer-wise Learning Rate Decay

Rather than applying a uniform learning rate across the entire encoder, a layer-wise decay of 0.90 is applied so lower transformer layers receive proportionally smaller updates. For layer l counted from the top output layer downward:

LR(l) = BASE_LR × (0.90)^l = 5e-6 × (0.90)^l

Lower layers encode general linguistic structure (subword morphology, syntax) that transfers across tasks and should be minimally disturbed. Upper layers encode higher-level task-specific semantics and receive rates closer to BASE_LR. The classification head receives HEAD_LR = 3e-5, six times the encoder base rate.

Loss Function

Cross-entropy with label smoothing (ε = 0.10). Label smoothing distributes a fraction ε of the target probability mass uniformly across non-target classes, preventing the model from becoming pathologically overconfident on noisy, user-generated training data, and improving output calibration at inference time.


Dataset

Property v1 v2 (this model)
Training samples 21,000 28,000 (+33%)
Test samples 2,801 (human-validated)
Emotion classes 7 7
Approx. class balance ~3,000/class ~4,000/class
Human validation Yes Yes (expanded team)

Sources: Public social media platforms, public comment sections, WhatsApp-style conversational text corpora. All data was collected from publicly available sources. No personally identifiable information is included.

Corpus language characteristics:

  • Orthographic variability: the same word appears in multiple valid transliterations (khushi, khushee, khushy, khushii)
  • Code-switching: frequent natural mixing of Roman Urdu and English within single utterances
  • Informal register: abbreviations, slang, non-standard punctuation, emoticons, sentence fragments
  • Platform diversity: multiple source platforms to improve domain generalization

Related resources:


Comparison with Related Work

A full benchmark comparison against multilingual baselines and existing Urdu NLP models, inter-annotator agreement (IAA) statistics, and ablation studies are in preparation and will be published in the accompanying research paper.

Preliminary positioning:

Model Language Task Macro F1 Open Source
roman-urdu-emotion-xlmr-v2 (ours) Roman Urdu 7-class emotion 0.9896 ** Yes**
roman-urdu-emotion-xlmr v1 (ours) Roman Urdu 7-class emotion Yes
xlm-roberta-base (zero-shot) Multilingual ~0.40–0.55* Yes
Formal baselines See paper

*Estimated zero-shot upper bound on informal Roman Urdu — formal evaluation results in the paper.

To our knowledge, no other open-source model on HuggingFace achieves comparable accuracy on Roman Urdu emotion classification at this level of granularity. Prior Urdu NLP work has focused primarily on formal Urdu script (Nastaliq) and binary/ternary sentiment polarity — leaving fine-grained emotion classification in Roman Urdu unaddressed at production-grade performance.


Team

Role Name
Project Lead & ML Engineer Muhammad Khubaib Ahmad
Data Manager Khadija Faisal
Annotator & Validator Ayesha Khalid
Annotator & Validator Muzammil Shadab
Annotator & Validator Faiez Ahmad

Muhammad Khubaib Ahmad conceived the overall research direction, designed the complete data collection, annotation, and validation workflow, developed the full modeling and training pipeline, and led all engineering work across both v1 and v2. This project represents an ongoing commitment to building open, high-quality NLP infrastructure for the Roman Urdu language community.


Limitations

Domain specificity. Trained on informal digital text. Performance may degrade on formal Roman Urdu, literary text, news writing, or highly technical content not represented in the training corpus.

Orthographic coverage. While the training corpus covers the most common spelling variants, novel or region-specific transliterations of rare words may reduce model confidence. This is a structural challenge inherent to any Roman Urdu NLP system in the absence of a standardized orthography.

Disgust–anger boundary. The disgust class carries the lowest F1 (0.977), consistent with the documented challenge of separating disgust from anger in informal text — a difficulty that persists even for human annotators. Applications requiring high-precision disgust detection should account for this boundary.

Sarcasm and irony. Sarcastic expressions common in Pakistani social media may be misclassified because the surface text does not carry the intended emotional signal. Sarcasm detection is a distinct task not addressed by this model.

Code-switching generalization. The model handles Roman Urdu–English code-switching patterns present in training data. Highly English-dominant text or unusual code-switching structures may produce lower-confidence predictions.

Geographic and demographic scope. Training data was primarily collected from Pakistani digital platforms. Urdu-speaking communities in India, the diaspora, or other regions may use lexical and stylistic patterns underrepresented in the corpus.

Not a clinical tool. While this model has clear applications in mental health research, it is a text classifier and has not been validated for clinical decision-making. Any deployment in mental health-adjacent systems requires appropriate clinical oversight and independent ethical review.


Upcoming Work

  • Research paper (in preparation): Full benchmark evaluation against multilingual baselines, inter-annotator agreement (IAA) analysis, ablation studies on architecture and training decisions, and error analysis on the disgust–anger boundary
  • Emotion corpus release: Khubaib01/RomanUrdu-NLP-Emotion-Corpus — the 28k human-labeled training corpus will be released publicly to support reproducible research and downstream development
  • Sarcasm-aware variant: Planned extension incorporating sarcasm detection to improve performance at the disgust–anger boundary
  • Interactive demo: HuggingFace Spaces demo for real-time testing

Citation

If you use this model in your research, please cite:

@misc{muhammad_khubaib_ahmad_2026,
    author       = { Muhammad Khubaib Ahmad and Khadija Faisal },
    title        = { roman-urdu-emotion-xlmr-v2 (Revision 7cd7dd2) },
    year         = 2026,
    url          = { https://huggingface.co/Khubaib01/roman-urdu-emotion-xlmr-v2 },
    doi          = { 10.57967/hf/8347 },
    publisher    = { Hugging Face }
    note      = {Accompanying research paper in preparation}
}

An updated BibTeX entry for the research paper will be added to this card upon publication.


License

Released under the Apache License 2.0.

You are free to use, modify, and distribute this model for both research and commercial purposes with attribution. See LICENSE for full terms.


Built for Roman Urdu. Built for scale. Built open.

If this work is useful to your research or application, please consider starring the repository and citing the paper when it is published.

Downloads last month
245
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Khubaib01/roman-urdu-emotion-xlmr-v2

Evaluation results