🌟 Roman Urdu Emotion Classifier — XLM-R v2
The First and Highest-Accuracy Open-Source Emotion Detection Model for Roman Urdu
Macro F1: 0.9896 · Accuracy: 98.96% · 7 Emotion Classes · 28,000 Training Samples
Trained on social media, WhatsApp chats, and comment data — the actual language 230 million people use.
Table of Contents
- Why This Model Matters
- Applications
- Quick Start
- Labels
- Performance
- Visualizations
- Architecture
- Training Details
- Dataset
- Comparison with Related Work
- Team
- Limitations
- Upcoming Work
- Citation
Why This Model Matters
Roman Urdu is the dominant language of digital Pakistan — and one of the most underserved languages in NLP.
Over 230 million people speak Urdu as a first or second language. In digital spaces — WhatsApp, Twitter/X, Facebook, TikTok comments, YouTube — the overwhelming majority write in Roman Urdu: Urdu expressed in Latin script, without a standardized orthography, heavily mixed with English, and rich in slang, regional variation, and emotionally charged informal expression.
Despite this scale, Roman Urdu remains a severely low-resource language in NLP. The reasons are structural:
- No standardized spelling — the same word can appear in dozens of valid transliterations
- Aggressive code-switching between Urdu and English within single sentences
- A near-total absence of labeled datasets at scale
- Existing multilingual models (trained primarily on formal Urdu script) generalize poorly to informal Roman Urdu text
This model directly addresses that gap.
roman-urdu-emotion-xlmr-v2 is, to our knowledge, the first publicly available, high-accuracy, open-source emotion classification model for Roman Urdu on HuggingFace. It achieves 98.96% accuracy and 0.9896 Macro F1 across seven emotion classes on a human-validated test set — performance that is competitive with state-of-the-art emotion classifiers for high-resource languages like English.
This is not an incremental contribution. For a language with virtually no prior open-source emotion recognition tooling, this model represents a foundational resource for researchers, developers, and organizations working with Urdu-speaking populations.
Applications
The ability to automatically detect emotion in Roman Urdu text unlocks a wide range of impactful downstream applications:
Mental Health Monitoring
Depression, anxiety, and emotional distress manifest in language long before clinical intervention. This model enables:
- Passive screening of social media activity for early signs of emotional distress in Urdu-speaking populations
- Longitudinal tracking of emotional state changes in anonymized text data
- Support tools for mental health researchers studying Pakistani and South Asian communities — populations historically underrepresented in clinical NLP research
- Potential integration into counseling support platforms to flag high-distress conversations for human review
Social Media and Public Discourse Analysis
- Real-time emotion monitoring of public discourse on Pakistani social media
- Brand sentiment and emotion analysis for Urdu-speaking markets
- Detection of emotionally charged misinformation or harmful content campaigns
- Crisis response: identifying fear or anger spikes during public emergencies or natural disasters
Policy and Governance
- Public opinion analysis of government communications and policy announcements
- Monitoring community emotional response to news events, elections, and social issues
- Understanding population emotional needs for targeted resource allocation
Low-Resource NLP Research
- First benchmark model for Roman Urdu affective computing — a direct baseline for future work
- Foundation for transfer learning to related low-resource languages (Hindi in Latin script, Punjabi, Sindhi)
- Demonstrates the viability of continued fine-tuning pipelines for low-resource settings with limited labeled data
Conversational AI and Customer Experience
- Emotion-aware chatbots and virtual assistants for Urdu-speaking users
- Customer service systems that detect frustrated or distressed users for priority routing
- Educational platforms that adapt content delivery based on detected student emotional state
Quick Start
from transformers import pipeline
pipe = pipeline(
"text-classification",
model="Khubaib01/roman-urdu-emotion-xlmr-v2",
trust_remote_code=True, # required — model uses a custom classification head
top_k=None, # returns scores for all 7 classes
)
# Single prediction — all scores
pipe("yaar mun preshan ho rha")
# Top prediction only
result = pipe("bohat khushi ho rhi hai aaj!")
top = max(result[0], key=lambda x: x["score"])
print(f"{top['label']}: {top['score']:.4f}")
# happy: 0.9901
# Batch prediction
texts = [
"mujhe dar lag rha hai",
"ye sab dekh ke dil bahut dukha",
"acha! ye toh maine socha bhi nahi tha",
"theek hai, koi baat nahi",
]
results = pipe(texts)
for text, scores in zip(texts, results):
top = max(scores, key=lambda x: x["score"])
print(f"{top['label']:10} ({top['score']:.3f}) → {text}")
# fear (0.987) → mujhe dar lag rha hai
# sad (0.983) → ye sab dekh ke dil bahut dukha
# surprise (0.990) → acha! ye toh maine socha bhi nahi tha
# none (0.998) → theek hai, koi baat nahi
Note on
trust_remote_code=True: This flag is required because the model uses a custom two-layer MLP classification head rather than the standard HuggingFace linear classifier. The full architecture code (emotion_model.py) is included in this repository and is fully auditable before use.
Labels
Seven classes — Ekman's six universal basic emotions plus a none class for emotionally neutral content.
| ID | Label | Urdu Equivalent | Description | Example (Roman Urdu) |
|---|---|---|---|---|
| 0 | anger | غصہ (Gussa) | Frustration, rage, irritation | yaar mujhe bahut gussa aa rha hai |
| 1 | disgust | نفرت (Nafrat) | Revulsion, strong disapproval | ugh ye cheez bilkul pasand nahi |
| 2 | fear | ڈر (Dar) | Anxiety, dread, apprehension | mujhe dar lag rha hai is cheez se |
| 3 | happy | خوشی (Khushi) | Joy, happiness, contentment, delight | bohat khushi ho rhi hai aaj! |
| 4 | sad | اداسی (Udaasi) | Grief, sorrow, loss, disappointment | ye sab dekh ke dil bahut dukha |
| 5 | surprise | حیرت (Hairat) | Astonishment — positive or negative | acha! ye toh maine socha bhi nahi |
| 6 | none | غیر جذباتی (Neutral) | Neutral / no dominant emotional signal | theek hai, jo hoga dekha jaega |
Performance
All evaluation metrics are computed on a held-out test set of 2,801 samples that were withheld from training and validation entirely. Each sample in the test set was independently reviewed by human validators with native Roman Urdu proficiency before inclusion, ensuring that the evaluation reflects real annotation quality rather than automated labeling noise.
Overall Metrics
| Metric | Score |
|---|---|
| Accuracy | 0.9896 |
| Macro F1 | 0.9896 |
| Weighted F1 | 0.9896 |
| Macro Precision | 0.9896 |
| Macro Recall | 0.9896 |
Per-Class Results
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| anger | 0.9975 | 1.0000 | 0.9988 | 401 |
| disgust | 0.9823 | 0.9725 | 0.9774 | 400 |
| fear | 0.9874 | 0.9825 | 0.9850 | 400 |
| happy | 0.9901 | 1.0000 | 0.9950 | 400 |
| sad | 0.9800 | 0.9825 | 0.9813 | 400 |
| surprise | 0.9900 | 0.9900 | 0.9900 | 400 |
| none | 1.0000 | 1.0000 | 1.0000 | 400 |
| macro avg | 0.9896 | 0.9896 | 0.9896 | 2801 |
Key Observations
Perfect scores on none (F1 = 1.000): The model completely and cleanly separates neutral text from all emotional categories. This is a critical capability for real-world deployment where the majority of messages in any corpus will be emotionally neutral — a poor none classifier pollutes all other class predictions.
Perfect recall on anger (Recall = 1.000, F1 = 0.9988): The model does not miss a single angry text in the test set. In mental health monitoring and crisis detection contexts, this zero-miss behavior on anger is especially valuable — false negatives in distress detection carry higher cost than false positives.
Perfect scores on none and near-perfect on happy and anger: These three classes together account for over 57% of real-world text in informal corpora (neutral, positive, or overtly negative). The model's strongest performance on these highest-frequency categories ensures robust behavior at deployment scale.
Lowest F1 on disgust (0.977): This is consistent with findings across the affective computing literature — disgust and anger share substantial lexical overlap in informal text and represent the hardest emotion pair to separate even for human annotators. A Macro F1 of 0.977 on disgust still represents an exceptionally strong result for this class in a low-resource language setting.
Balanced support across all classes (~400 per class): The near-equal class distribution in the test set means macro F1 = weighted F1 = accuracy = 0.9896 — a meaningful alignment confirming these scores are not inflated by any dominant class.
Visualizations
Benchmark vizualizations will be soon uploaded with paper
Per-Class F1 Bar Chart
Confusion Matrix
Architecture
The model wraps XLM-RoBERTa-base with a custom two-layer MLP classification head that replaces the standard single-linear classifier in HuggingFace's default XLMRobertaForSequenceClassification.
Input: Roman Urdu text
(tokenized via XLM-R SentencePiece BPE, vocab=250,002, max=512 tokens)
│
▼
┌────────────────────────────────────────────────────┐
│ XLM-RoBERTa-base Encoder │
│ 12 transformer layers · hidden size = 768 │
│ 12 attention heads · ~270M parameters │
│ vocab: 250,002 (multilingual SentencePiece) │
│ position embeddings: 514 (XLM-R convention) │
└────────────────────────────────────────────────────┘
│
│ [CLS] token representation (batch × 768)
▼
┌────────────────────────────────────────────────────┐
│ Emotion Classification Head │
│ │
│ LayerNorm(768) │
│ ↓ │
│ Dropout(0.35) │
│ ↓ │
│ Linear(768 → 256) │
│ ↓ │
│ GELU activation │
│ ↓ │
│ Dropout(0.175) │
│ ↓ │
│ Linear(256 → 7) │
└────────────────────────────────────────────────────┘
│
▼
Emotion logits (batch × 7)
→ softmax → predicted class + confidence scores
Why a two-layer head? The standard single-layer Linear(768 → 7) collapses all representational transformation into one linear step. For Roman Urdu emotion classification, an intermediate non-linear projection is beneficial because: (1) several emotion classes share substantial surface-level lexical overlap (particularly anger/disgust and fear/sadness); (2) informal text produces highly variable surface forms for the same underlying emotional content; and (3) the GELU projection stage learns a compact emotion-relevant subspace from the full 768-dimensional encoder representation before the final seven-way classification boundary is drawn. This design was validated against a single-layer baseline during v1 development.
| Component | Parameters |
|---|---|
| XLM-R encoder | ~270M |
| Emotion head | ~197k |
| Total | ~270.2M |
Training Details
Model Lineage
xlm-roberta-base
│ HuggingFace — 12 layers, 270M params, 100+ languages
▼
Khubaib01/roman-urdu-sentiment-xlm-r
│ Sentiment fine-tune on Roman Urdu
▼
Khubaib01/roman-urdu-emotion-xlmr ← v1 (21k samples)
│ Emotion fine-tune, first version
▼
Khubaib01/roman-urdu-emotion-xlmr-v2 ← v2 (28k samples, this model)
Continued fine-tune on expanded corpus
Each stage transfers progressively more task-specific and language-specific knowledge. This lineage allows v2 to achieve near-perfect performance with conservative encoder learning rates that preserve learned representations rather than overwriting them.
Hyperparameters
| Parameter | Value | Rationale |
|---|---|---|
| Seed | 42 | Full reproducibility |
| Max epochs | 10 | With early stopping (patience = 3) |
| Early stopping patience | 3 | Halts if validation F1 stagnates for 3 consecutive epochs |
| Train batch size | 16 | — |
| Eval batch size | 32 | — |
| Encoder LR | 5e-6 | Conservative — warm-started from v1, avoids catastrophic forgetting |
| Head LR | 3e-5 | 6× encoder LR; head adapts faster to expanded data distribution |
| LR layer-wise decay | 0.90 | Lower encoder layers updated less aggressively |
| Weight decay | 0.02 | L2 regularization; increased vs v1 (0.01) for larger corpus |
| Warmup ratio | 0.10 | 10% of total steps for smooth LR ramp-up |
| Max gradient norm | 1.0 | Gradient clipping for training stability |
| Dropout | 0.35 | Slightly higher than v1 (0.30) to counter larger training set |
| Label smoothing | 0.10 | Prevents overconfidence on noisy social media annotations |
| Mixed precision | fp16 | NVIDIA GPU training |
| LR scheduler | Cosine with linear warmup |
Layer-wise Learning Rate Decay
Rather than applying a uniform learning rate across the entire encoder, a layer-wise decay of 0.90 is applied so lower transformer layers receive proportionally smaller updates. For layer l counted from the top output layer downward:
LR(l) = BASE_LR × (0.90)^l = 5e-6 × (0.90)^l
Lower layers encode general linguistic structure (subword morphology, syntax) that transfers across tasks and should be minimally disturbed. Upper layers encode higher-level task-specific semantics and receive rates closer to BASE_LR. The classification head receives HEAD_LR = 3e-5, six times the encoder base rate.
Loss Function
Cross-entropy with label smoothing (ε = 0.10). Label smoothing distributes a fraction ε of the target probability mass uniformly across non-target classes, preventing the model from becoming pathologically overconfident on noisy, user-generated training data, and improving output calibration at inference time.
Dataset
| Property | v1 | v2 (this model) |
|---|---|---|
| Training samples | 21,000 | 28,000 (+33%) |
| Test samples | — | 2,801 (human-validated) |
| Emotion classes | 7 | 7 |
| Approx. class balance | ~3,000/class | ~4,000/class |
| Human validation | Yes | Yes (expanded team) |
Sources: Public social media platforms, public comment sections, WhatsApp-style conversational text corpora. All data was collected from publicly available sources. No personally identifiable information is included.
Corpus language characteristics:
- Orthographic variability: the same word appears in multiple valid transliterations (khushi, khushee, khushy, khushii)
- Code-switching: frequent natural mixing of Roman Urdu and English within single utterances
- Informal register: abbreviations, slang, non-standard punctuation, emoticons, sentence fragments
- Platform diversity: multiple source platforms to improve domain generalization
Related resources:
Khubaib01/roman-urdu-sentiment-xlm-r— parent sentiment modelKhubaib01/RomanUrdu-NLP-Sentiment-Corpus— 134k sentiment-labeled corpusKhubaib01/RomanUrdu-NLP-Emotion-Corpus(coming soon) — 28k emotion-labeled training corpus
Comparison with Related Work
A full benchmark comparison against multilingual baselines and existing Urdu NLP models, inter-annotator agreement (IAA) statistics, and ablation studies are in preparation and will be published in the accompanying research paper.
Preliminary positioning:
| Model | Language | Task | Macro F1 | Open Source |
|---|---|---|---|---|
| roman-urdu-emotion-xlmr-v2 (ours) | Roman Urdu | 7-class emotion | 0.9896 | ** Yes** |
| roman-urdu-emotion-xlmr v1 (ours) | Roman Urdu | 7-class emotion | — | Yes |
| xlm-roberta-base (zero-shot) | Multilingual | — | ~0.40–0.55* | Yes |
| Formal baselines | See paper | — | — | — |
*Estimated zero-shot upper bound on informal Roman Urdu — formal evaluation results in the paper.
To our knowledge, no other open-source model on HuggingFace achieves comparable accuracy on Roman Urdu emotion classification at this level of granularity. Prior Urdu NLP work has focused primarily on formal Urdu script (Nastaliq) and binary/ternary sentiment polarity — leaving fine-grained emotion classification in Roman Urdu unaddressed at production-grade performance.
Team
| Role | Name |
|---|---|
| Project Lead & ML Engineer | Muhammad Khubaib Ahmad |
| Data Manager | Khadija Faisal |
| Annotator & Validator | Ayesha Khalid |
| Annotator & Validator | Muzammil Shadab |
| Annotator & Validator | Faiez Ahmad |
Muhammad Khubaib Ahmad conceived the overall research direction, designed the complete data collection, annotation, and validation workflow, developed the full modeling and training pipeline, and led all engineering work across both v1 and v2. This project represents an ongoing commitment to building open, high-quality NLP infrastructure for the Roman Urdu language community.
Limitations
Domain specificity. Trained on informal digital text. Performance may degrade on formal Roman Urdu, literary text, news writing, or highly technical content not represented in the training corpus.
Orthographic coverage. While the training corpus covers the most common spelling variants, novel or region-specific transliterations of rare words may reduce model confidence. This is a structural challenge inherent to any Roman Urdu NLP system in the absence of a standardized orthography.
Disgust–anger boundary. The disgust class carries the lowest F1 (0.977), consistent with the documented challenge of separating disgust from anger in informal text — a difficulty that persists even for human annotators. Applications requiring high-precision disgust detection should account for this boundary.
Sarcasm and irony. Sarcastic expressions common in Pakistani social media may be misclassified because the surface text does not carry the intended emotional signal. Sarcasm detection is a distinct task not addressed by this model.
Code-switching generalization. The model handles Roman Urdu–English code-switching patterns present in training data. Highly English-dominant text or unusual code-switching structures may produce lower-confidence predictions.
Geographic and demographic scope. Training data was primarily collected from Pakistani digital platforms. Urdu-speaking communities in India, the diaspora, or other regions may use lexical and stylistic patterns underrepresented in the corpus.
Not a clinical tool. While this model has clear applications in mental health research, it is a text classifier and has not been validated for clinical decision-making. Any deployment in mental health-adjacent systems requires appropriate clinical oversight and independent ethical review.
Upcoming Work
- Research paper (in preparation): Full benchmark evaluation against multilingual baselines, inter-annotator agreement (IAA) analysis, ablation studies on architecture and training decisions, and error analysis on the disgust–anger boundary
- Emotion corpus release:
Khubaib01/RomanUrdu-NLP-Emotion-Corpus— the 28k human-labeled training corpus will be released publicly to support reproducible research and downstream development - Sarcasm-aware variant: Planned extension incorporating sarcasm detection to improve performance at the disgust–anger boundary
- Interactive demo: HuggingFace Spaces demo for real-time testing
Citation
If you use this model in your research, please cite:
@misc{muhammad_khubaib_ahmad_2026,
author = { Muhammad Khubaib Ahmad and Khadija Faisal },
title = { roman-urdu-emotion-xlmr-v2 (Revision 7cd7dd2) },
year = 2026,
url = { https://huggingface.co/Khubaib01/roman-urdu-emotion-xlmr-v2 },
doi = { 10.57967/hf/8347 },
publisher = { Hugging Face }
note = {Accompanying research paper in preparation}
}
An updated BibTeX entry for the research paper will be added to this card upon publication.
License
Released under the Apache License 2.0.
You are free to use, modify, and distribute this model for both research and commercial purposes with attribution. See LICENSE for full terms.
Built for Roman Urdu. Built for scale. Built open.
If this work is useful to your research or application, please consider starring the repository and citing the paper when it is published.
- Downloads last month
- 245
Model tree for Khubaib01/roman-urdu-emotion-xlmr-v2
Base model
FacebookAI/xlm-roberta-baseEvaluation results
- Accuracy on Roman Urdu Emotion Dataset v2 (28k)test set self-reported0.990
- Macro F1 on Roman Urdu Emotion Dataset v2 (28k)test set self-reported0.990
- Weighted F1 on Roman Urdu Emotion Dataset v2 (28k)test set self-reported0.990

