🌟 Roman Urdu Emotion Classifier — XLM-R v2

The First and Highest-Accuracy Open-Source Emotion Detection Model for Roman Urdu

Macro F1: 0.9896 · Accuracy: 98.96% · 7 Emotion Classes · 28,000 Training Samples

Trained on social media, WhatsApp chats, and comment data — the actual language 230 million people use.

Why This Model Matters
Applications
Quick Start
Labels
Performance
Visualizations
Architecture
Training Details
Dataset
Comparison with Related Work
Team
Limitations
Upcoming Work
Citation

Why This Model Matters

Roman Urdu is the dominant language of digital Pakistan — and one of the most underserved languages in NLP.

Over 230 million people speak Urdu as a first or second language. In digital spaces — WhatsApp, Twitter/X, Facebook, TikTok comments, YouTube — the overwhelming majority write in Roman Urdu: Urdu expressed in Latin script, without a standardized orthography, heavily mixed with English, and rich in slang, regional variation, and emotionally charged informal expression.

Despite this scale, Roman Urdu remains a severely low-resource language in NLP. The reasons are structural:

No standardized spelling — the same word can appear in dozens of valid transliterations
Aggressive code-switching between Urdu and English within single sentences
A near-total absence of labeled datasets at scale
Existing multilingual models (trained primarily on formal Urdu script) generalize poorly to informal Roman Urdu text

This model directly addresses that gap.

roman-urdu-emotion-xlmr-v2 is, to our knowledge, the first publicly available, high-accuracy, open-source emotion classification model for Roman Urdu on HuggingFace. It achieves 98.96% accuracy and 0.9896 Macro F1 across seven emotion classes on a human-validated test set — performance that is competitive with state-of-the-art emotion classifiers for high-resource languages like English.

This is not an incremental contribution. For a language with virtually no prior open-source emotion recognition tooling, this model represents a foundational resource for researchers, developers, and organizations working with Urdu-speaking populations.

Applications

The ability to automatically detect emotion in Roman Urdu text unlocks a wide range of impactful downstream applications:

Mental Health Monitoring

Depression, anxiety, and emotional distress manifest in language long before clinical intervention. This model enables:

Passive screening of social media activity for early signs of emotional distress in Urdu-speaking populations
Longitudinal tracking of emotional state changes in anonymized text data
Support tools for mental health researchers studying Pakistani and South Asian communities — populations historically underrepresented in clinical NLP research
Potential integration into counseling support platforms to flag high-distress conversations for human review

Social Media and Public Discourse Analysis

Real-time emotion monitoring of public discourse on Pakistani social media
Brand sentiment and emotion analysis for Urdu-speaking markets
Detection of emotionally charged misinformation or harmful content campaigns
Crisis response: identifying fear or anger spikes during public emergencies or natural disasters

Policy and Governance

Public opinion analysis of government communications and policy announcements
Monitoring community emotional response to news events, elections, and social issues
Understanding population emotional needs for targeted resource allocation

Low-Resource NLP Research

First benchmark model for Roman Urdu affective computing — a direct baseline for future work
Foundation for transfer learning to related low-resource languages (Hindi in Latin script, Punjabi, Sindhi)
Demonstrates the viability of continued fine-tuning pipelines for low-resource settings with limited labeled data

Conversational AI and Customer Experience

Emotion-aware chatbots and virtual assistants for Urdu-speaking users
Customer service systems that detect frustrated or distressed users for priority routing
Educational platforms that adapt content delivery based on detected student emotional state

Quick Start

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="Khubaib01/roman-urdu-emotion-xlmr-v2",
    trust_remote_code=True,   # required — model uses a custom classification head
    top_k=None,               # returns scores for all 7 classes
)

# Single prediction — all scores
pipe("yaar mun preshan ho rha")

# Top prediction only
result = pipe("bohat khushi ho rhi hai aaj!")
top = max(result[0], key=lambda x: x["score"])
print(f"{top['label']}: {top['score']:.4f}")
# happy: 0.9901

# Batch prediction
texts = [
    "mujhe dar lag rha hai",
    "ye sab dekh ke dil bahut dukha",
    "acha! ye toh maine socha bhi nahi tha",
    "theek hai, koi baat nahi",
]
results = pipe(texts)
for text, scores in zip(texts, results):
    top = max(scores, key=lambda x: x["score"])
    print(f"{top['label']:10} ({top['score']:.3f})  →  {text}")
# fear       (0.987)  →  mujhe dar lag rha hai
# sad        (0.983)  →  ye sab dekh ke dil bahut dukha
# surprise   (0.990)  →  acha! ye toh maine socha bhi nahi tha
# none       (0.998)  →  theek hai, koi baat nahi

Note on trust_remote_code=True: This flag is required because the model uses a custom two-layer MLP classification head rather than the standard HuggingFace linear classifier. The full architecture code (emotion_model.py) is included in this repository and is fully auditable before use.

Labels

Seven classes — Ekman's six universal basic emotions plus a none class for emotionally neutral content.

ID	Label	Urdu Equivalent	Description	Example (Roman Urdu)
0	anger	غصہ (Gussa)	Frustration, rage, irritation	yaar mujhe bahut gussa aa rha hai
1	disgust	نفرت (Nafrat)	Revulsion, strong disapproval	ugh ye cheez bilkul pasand nahi
2	fear	ڈر (Dar)	Anxiety, dread, apprehension	mujhe dar lag rha hai is cheez se
3	happy	خوشی (Khushi)	Joy, happiness, contentment, delight	bohat khushi ho rhi hai aaj!
4	sad	اداسی (Udaasi)	Grief, sorrow, loss, disappointment	ye sab dekh ke dil bahut dukha
5	surprise	حیرت (Hairat)	Astonishment — positive or negative	acha! ye toh maine socha bhi nahi
6	none	غیر جذباتی (Neutral)	Neutral / no dominant emotional signal	theek hai, jo hoga dekha jaega

Performance

All evaluation metrics are computed on a held-out test set of 2,801 samples that were withheld from training and validation entirely. Each sample in the test set was independently reviewed by human validators with native Roman Urdu proficiency before inclusion, ensuring that the evaluation reflects real annotation quality rather than automated labeling noise.

Overall Metrics

Metric	Score
Accuracy	0.9896
Macro F1	0.9896
Weighted F1	0.9896
Macro Precision	0.9896
Macro Recall	0.9896

Per-Class Results

Class	Precision	Recall	F1-Score	Support
anger	0.9975	1.0000	0.9988	401
disgust	0.9823	0.9725	0.9774	400
fear	0.9874	0.9825	0.9850	400
happy	0.9901	1.0000	0.9950	400
sad	0.9800	0.9825	0.9813	400
surprise	0.9900	0.9900	0.9900	400
none	1.0000	1.0000	1.0000	400
macro avg	0.9896	0.9896	0.9896	2801

Key Observations

Perfect scores on none (F1 = 1.000): The model completely and cleanly separates neutral text from all emotional categories. This is a critical capability for real-world deployment where the majority of messages in any corpus will be emotionally neutral — a poor none classifier pollutes all other class predictions.

Perfect recall on anger (Recall = 1.000, F1 = 0.9988): The model does not miss a single angry text in the test set. In mental health monitoring and crisis detection contexts, this zero-miss behavior on anger is especially valuable — false negatives in distress detection carry higher cost than false positives.

Perfect scores on none and near-perfect on happy and anger: These three classes together account for over 57% of real-world text in informal corpora (neutral, positive, or overtly negative). The model's strongest performance on these highest-frequency categories ensures robust behavior at deployment scale.

Lowest F1 on disgust (0.977): This is consistent with findings across the affective computing literature — disgust and anger share substantial lexical overlap in informal text and represent the hardest emotion pair to separate even for human annotators. A Macro F1 of 0.977 on disgust still represents an exceptionally strong result for this class in a low-resource language setting.

Balanced support across all classes (~400 per class): The near-equal class distribution in the test set means macro F1 = weighted F1 = accuracy = 0.9896 — a meaningful alignment confirming these scores are not inflated by any dominant class.

Visualizations

Benchmark vizualizations will be soon uploaded with paper

Per-Class F1 Bar Chart

Confusion Matrix

Architecture

The model wraps XLM-RoBERTa-base with a custom two-layer MLP classification head that replaces the standard single-linear classifier in HuggingFace's default XLMRobertaForSequenceClassification.

Input: Roman Urdu text
  (tokenized via XLM-R SentencePiece BPE, vocab=250,002, max=512 tokens)
         │
         ▼
┌────────────────────────────────────────────────────┐
│            XLM-RoBERTa-base Encoder                │
│   12 transformer layers  ·  hidden size = 768      │
│   12 attention heads  ·  ~270M parameters          │
│   vocab: 250,002 (multilingual SentencePiece)      │
│   position embeddings: 514 (XLM-R convention)      │
└────────────────────────────────────────────────────┘
         │
         │   [CLS] token representation  (batch × 768)
         ▼
┌────────────────────────────────────────────────────┐
│            Emotion Classification Head             │
│                                                    │
│   LayerNorm(768)                                   │
│        ↓                                           │
│   Dropout(0.35)                                    │
│        ↓                                           │
│   Linear(768 → 256)                                │
│        ↓                                           │
│   GELU activation                                  │
│        ↓                                           │
│   Dropout(0.175)                                   │
│        ↓                                           │
│   Linear(256 → 7)                                  │
└────────────────────────────────────────────────────┘
         │
         ▼
   Emotion logits  (batch × 7)
   → softmax → predicted class + confidence scores

Why a two-layer head? The standard single-layer Linear(768 → 7) collapses all representational transformation into one linear step. For Roman Urdu emotion classification, an intermediate non-linear projection is beneficial because: (1) several emotion classes share substantial surface-level lexical overlap (particularly anger/disgust and fear/sadness); (2) informal text produces highly variable surface forms for the same underlying emotional content; and (3) the GELU projection stage learns a compact emotion-relevant subspace from the full 768-dimensional encoder representation before the final seven-way classification boundary is drawn. This design was validated against a single-layer baseline during v1 development.

Component	Parameters
XLM-R encoder	~270M
Emotion head	~197k
Total	~270.2M

Training Details

Model Lineage

xlm-roberta-base
    │  HuggingFace — 12 layers, 270M params, 100+ languages
    ▼
Khubaib01/roman-urdu-sentiment-xlm-r
    │  Sentiment fine-tune on Roman Urdu
    ▼
Khubaib01/roman-urdu-emotion-xlmr          ← v1  (21k samples)
    │  Emotion fine-tune, first version
    ▼
Khubaib01/roman-urdu-emotion-xlmr-v2       ← v2  (28k samples, this model)
    Continued fine-tune on expanded corpus

Each stage transfers progressively more task-specific and language-specific knowledge. This lineage allows v2 to achieve near-perfect performance with conservative encoder learning rates that preserve learned representations rather than overwriting them.

Hyperparameters

Parameter	Value	Rationale
Seed	42	Full reproducibility
Max epochs	10	With early stopping (patience = 3)
Early stopping patience	3	Halts if validation F1 stagnates for 3 consecutive epochs
Train batch size	16	—
Eval batch size	32	—
Encoder LR	5e-6	Conservative — warm-started from v1, avoids catastrophic forgetting
Head LR	3e-5	6× encoder LR; head adapts faster to expanded data distribution
LR layer-wise decay	0.90	Lower encoder layers updated less aggressively
Weight decay	0.02	L2 regularization; increased vs v1 (0.01) for larger corpus
Warmup ratio	0.10	10% of total steps for smooth LR ramp-up
Max gradient norm	1.0	Gradient clipping for training stability
Dropout	0.35	Slightly higher than v1 (0.30) to counter larger training set
Label smoothing	0.10	Prevents overconfidence on noisy social media annotations
Mixed precision	fp16	NVIDIA GPU training
LR scheduler	Cosine with linear warmup

Layer-wise Learning Rate Decay

Rather than applying a uniform learning rate across the entire encoder, a layer-wise decay of 0.90 is applied so lower transformer layers receive proportionally smaller updates. For layer l counted from the top output layer downward:

LR(l) = BASE_LR × (0.90)^l = 5e-6 × (0.90)^l

Lower layers encode general linguistic structure (subword morphology, syntax) that transfers across tasks and should be minimally disturbed. Upper layers encode higher-level task-specific semantics and receive rates closer to BASE_LR. The classification head receives HEAD_LR = 3e-5, six times the encoder base rate.

Loss Function

Cross-entropy with label smoothing (ε = 0.10). Label smoothing distributes a fraction ε of the target probability mass uniformly across non-target classes, preventing the model from becoming pathologically overconfident on noisy, user-generated training data, and improving output calibration at inference time.

Dataset

Property	v1	v2 (this model)
Training samples	21,000	28,000 (+33%)
Test samples	—	2,801 (human-validated)
Emotion classes	7	7
Approx. class balance	~3,000/class	~4,000/class
Human validation	Yes	Yes (expanded team)

Sources: Public social media platforms, public comment sections, WhatsApp-style conversational text corpora. All data was collected from publicly available sources. No personally identifiable information is included.

Corpus language characteristics:

Orthographic variability: the same word appears in multiple valid transliterations (khushi, khushee, khushy, khushii)
Code-switching: frequent natural mixing of Roman Urdu and English within single utterances
Informal register: abbreviations, slang, non-standard punctuation, emoticons, sentence fragments
Platform diversity: multiple source platforms to improve domain generalization

Related resources:

Khubaib01/roman-urdu-sentiment-xlm-r — parent sentiment model
Khubaib01/RomanUrdu-NLP-Sentiment-Corpus — 134k sentiment-labeled corpus
Khubaib01/RomanUrdu-NLP-Emotion-Corpus (coming soon) — 28k emotion-labeled training corpus

Comparison with Related Work

A full benchmark comparison against multilingual baselines and existing Urdu NLP models, inter-annotator agreement (IAA) statistics, and ablation studies are in preparation and will be published in the accompanying research paper.

Preliminary positioning:

Model	Language	Task	Macro F1	Open Source
roman-urdu-emotion-xlmr-v2 (ours)	Roman Urdu	7-class emotion	0.9896	Yes
roman-urdu-emotion-xlmr v1 (ours)	Roman Urdu	7-class emotion	—	Yes
xlm-roberta-base (zero-shot)	Multilingual	—	~0.40–0.55*	Yes
Formal baselines	See paper	—	—	—

*Estimated zero-shot upper bound on informal Roman Urdu — formal evaluation results in the paper.

To our knowledge, no other open-source model on HuggingFace achieves comparable accuracy on Roman Urdu emotion classification at this level of granularity. Prior Urdu NLP work has focused primarily on formal Urdu script (Nastaliq) and binary/ternary sentiment polarity — leaving fine-grained emotion classification in Roman Urdu unaddressed at production-grade performance.

Team

Role	Name
Project Lead & ML Engineer	Muhammad Khubaib Ahmad
Data Manager	Khadija Faisal
Annotator & Validator	Ayesha Khalid
Annotator & Validator	Muzammil Shadab
Annotator & Validator	Faiez Ahmad

Muhammad Khubaib Ahmad conceived the overall research direction, designed the complete data collection, annotation, and validation workflow, developed the full modeling and training pipeline, and led all engineering work across both v1 and v2. This project represents an ongoing commitment to building open, high-quality NLP infrastructure for the Roman Urdu language community.

Limitations

Domain specificity. Trained on informal digital text. Performance may degrade on formal Roman Urdu, literary text, news writing, or highly technical content not represented in the training corpus.

Orthographic coverage. While the training corpus covers the most common spelling variants, novel or region-specific transliterations of rare words may reduce model confidence. This is a structural challenge inherent to any Roman Urdu NLP system in the absence of a standardized orthography.

Disgust–anger boundary. The disgust class carries the lowest F1 (0.977), consistent with the documented challenge of separating disgust from anger in informal text — a difficulty that persists even for human annotators. Applications requiring high-precision disgust detection should account for this boundary.

Sarcasm and irony. Sarcastic expressions common in Pakistani social media may be misclassified because the surface text does not carry the intended emotional signal. Sarcasm detection is a distinct task not addressed by this model.

Code-switching generalization. The model handles Roman Urdu–English code-switching patterns present in training data. Highly English-dominant text or unusual code-switching structures may produce lower-confidence predictions.

Geographic and demographic scope. Training data was primarily collected from Pakistani digital platforms. Urdu-speaking communities in India, the diaspora, or other regions may use lexical and stylistic patterns underrepresented in the corpus.

Not a clinical tool. While this model has clear applications in mental health research, it is a text classifier and has not been validated for clinical decision-making. Any deployment in mental health-adjacent systems requires appropriate clinical oversight and independent ethical review.

Upcoming Work

Research paper (in preparation): Full benchmark evaluation against multilingual baselines, inter-annotator agreement (IAA) analysis, ablation studies on architecture and training decisions, and error analysis on the disgust–anger boundary
Emotion corpus release: Khubaib01/RomanUrdu-NLP-Emotion-Corpus — the 28k human-labeled training corpus will be released publicly to support reproducible research and downstream development
Sarcasm-aware variant: Planned extension incorporating sarcasm detection to improve performance at the disgust–anger boundary
Interactive demo: HuggingFace Spaces demo for real-time testing

Citation

If you use this model in your research, please cite:

@misc{muhammad_khubaib_ahmad_2026,
    author       = { Muhammad Khubaib Ahmad and Khadija Faisal },
    title        = { roman-urdu-emotion-xlmr-v2 (Revision 7cd7dd2) },
    year         = 2026,
    url          = { https://huggingface.co/Khubaib01/roman-urdu-emotion-xlmr-v2 },
    doi          = { 10.57967/hf/8347 },
    publisher    = { Hugging Face }
    note      = {Accompanying research paper in preparation}
}

An updated BibTeX entry for the research paper will be added to this card upon publication.

License

Released under the Apache License 2.0.

You are free to use, modify, and distribute this model for both research and commercial purposes with attribution. See LICENSE for full terms.

Built for Roman Urdu. Built for scale. Built open.

If this work is useful to your research or application, please consider starring the repository and citing the paper when it is published.

Downloads last month: 245

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Khubaib01/roman-urdu-emotion-xlmr-v2

Base model

FacebookAI/xlm-roberta-base

Finetuned

Khubaib01/roman-urdu-sentiment-xlm-r

Finetuned

Khubaib01/roman-urdu-emotion-xlmr

Finetuned

(1)

this model

Evaluation results

Accuracy on Roman Urdu Emotion Dataset v2 (28k)
test set self-reported

0.990
Macro F1 on Roman Urdu Emotion Dataset v2 (28k)
test set self-reported

0.990
Weighted F1 on Roman Urdu Emotion Dataset v2 (28k)
test set self-reported

0.990

Khubaib01
/

roman-urdu-emotion-xlmr-v2

🌟 Roman Urdu Emotion Classifier — XLM-R v2

The First and Highest-Accuracy Open-Source Emotion Detection Model for Roman Urdu

Table of Contents

Why This Model Matters

Applications

Mental Health Monitoring

Social Media and Public Discourse Analysis

Policy and Governance

Low-Resource NLP Research

Conversational AI and Customer Experience

Quick Start

Labels

Performance

Overall Metrics

Per-Class Results

Key Observations

Visualizations

Per-Class F1 Bar Chart

Confusion Matrix

Architecture

Training Details

Model Lineage

Hyperparameters

Layer-wise Learning Rate Decay

Loss Function

Dataset

Comparison with Related Work

Team

Limitations

Upcoming Work

Citation

License

Model tree for Khubaib01/roman-urdu-emotion-xlmr-v2

Evaluation results