🇬🇧 English Sentiment Analysis — Fine-tuned RoBERTa-Large
A Fine-tuned RoBERTa-Large Model for 3-Class English Sentiment Classification
Trained on educational and consumer feedback data for high-precision sentiment understanding
1. Model Overview
This model is a fine-tuned RoBERTa-Large classifier for English sentiment analysis, adapted from siebert/sentiment-roberta-large-english. It classifies English text into three sentiment categories — Positive, Neutral, and Negative — with high confidence and strong generalization across diverse feedback domains.
The model was trained on a curated dataset of English-language educational and consumer feedback, and achieves 92.12% accuracy on the held-out test set, making it well-suited for production-grade NLP pipelines.
2. Model Details
| Attribute | Value |
|---|---|
| Base Model | siebert/sentiment-roberta-large-english |
| Architecture | RoBERTa-Large (Transformer Encoder) |
| Task | 3-class Sentiment Classification |
| Classes | Negative (0), Neutral (1), Positive (2) |
| Total Parameters | 355,362,819 (~355M) |
| Model Size | ~1.42 GB |
| Max Sequence Length | 256 tokens |
| Training Framework | Hugging Face Transformers + PyTorch |
| Compute | Mixed Precision (FP16) |
| Repository | tahamueed23/english-sentiment-roberta-large |
| Last Updated | February 22, 2026 |
| Author | Taha Mueed |
3. Intended Use
This model is designed for any downstream task that requires classifying the sentiment of English text into three categories: Positive, Neutral, or Negative.
| Domain | Application |
|---|---|
| 🎓 Education | Student feedback analysis, course evaluation, teacher reviews |
| 🛒 E-commerce | Product review classification, customer satisfaction scoring |
| 📱 Social Media | Comment moderation, brand monitoring, opinion mining |
| 🏢 Customer Support | Ticket sentiment tagging, call centre transcript analysis |
| 🏥 Healthcare | Patient experience reviews, hospital feedback analysis |
| 📊 Market Research | Consumer opinion tracking, NPS survey classification |
| 🛡️ Content Moderation | Detecting hostile or negative user-generated content |
Out-of-Scope Use
This model is designed for English-only input. It is not suitable for multilingual text, code-switched content, or non-English languages. It was also not trained for tasks like emotion detection beyond sentiment polarity, toxicity classification, or stance detection.
4. Dataset Description
The training data was sourced from a multilingual academic feedback dataset hosted on Google Sheets, containing 50,056 records across five languages. Only English-language entries were selected for this model's training.
Full Dataset Composition (N = 50,056)
| Language | Count | Percentage |
|---|---|---|
| Roman Urdu | 16,916 | 33.79% |
| Urdu | 16,782 | 33.52% |
| English | 16,355 | 32.67% |
| Mixed / Other | 3 | 0.01% |
English Subset — After Filtering and Cleaning (N = 15,650)
| Sentiment Class | Count | Percentage |
|---|---|---|
| Positive | 7,160 | 45.75% |
| Negative | 5,364 | 34.27% |
| Neutral | 3,126 | 19.97% |
The dataset exhibits a moderate class imbalance, with Neutral being the least represented class. This was addressed through class-weighted loss during training (see Section 7).
Train / Validation / Test Split
| Split | Samples | Percentage |
|---|---|---|
| Training | 10,955 | 70.0% |
| Validation | 2,347 | 15.0% |
| Test | 2,348 | 15.0% |
| Total | 15,650 | 100% |
Splits were created using stratified sampling to preserve the original class distribution across all three subsets.
5. Data Preprocessing
The following cleaning and normalization steps were applied to the raw English text before training:
- Duplicate removal — 701 duplicate entries were removed based on exact match of the feedback field.
- Null filtering — Records with missing
feedbackorsenti_manualvalues were dropped. - Whitespace normalization — Consecutive whitespace characters were collapsed using regex substitution.
- Short text removal — Texts shorter than 5 characters were discarded as uninformative.
- Empty string removal — Any remaining blank entries after cleaning were excluded.
- Label standardization — Sentiment labels were title-cased and mapped to integers:
Negative → 0,Neutral → 1,Positive → 2. - Label validation — Rows with labels outside the valid set
{0, 1, 2}were dropped.
After preprocessing, the final cleaned dataset contained 15,650 samples, reduced from 16,355 raw English entries.
6. Training Procedure
The model was initialized from siebert/sentiment-roberta-large-english (originally a binary classifier) and adapted to a 3-class classification head by re-initializing the output projection layer (classifier.out_proj) to output 3 logits instead of 2.
Training used a custom weighted loss Trainer (WeightedLossTrainer) that applies per-class weights during cross-entropy loss computation to compensate for the Neutral class underrepresentation.
Class Weights Applied During Training
| Class | Weight |
|---|---|
| Negative | 0.9725 |
| Neutral | 1.6688 |
| Positive | 0.7286 |
Weights were computed using sklearn.utils.class_weight.compute_class_weight with the balanced strategy.
Additional training strategies included gradient checkpointing to reduce GPU memory usage, FP16 mixed precision for faster computation, and early stopping with a patience of 3 evaluation cycles to prevent overfitting. Model checkpoints were evaluated every 100 steps on the validation set using weighted F1 as the selection criterion, and the best checkpoint was loaded at the end of training.
Total training time: 4,332.72 seconds (~72 minutes) on an NVIDIA Tesla T4 (15.64 GB VRAM).
7. Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 4 |
| Per-device Train Batch Size | 8 |
| Per-device Eval Batch Size | 16 |
| Gradient Accumulation Steps | 2 (effective batch size = 16) |
| Learning Rate | 2e-5 |
| LR Scheduler | Cosine |
| Warmup Steps | 500 |
| Weight Decay | 0.01 |
| Optimizer | AdamW (adamw_torch) |
| Mixed Precision | FP16 |
| Gradient Checkpointing | Yes |
| Max Sequence Length | 256 tokens |
| Early Stopping Patience | 3 evaluation cycles |
| Eval / Save Strategy | Every 100 steps |
| Best Model Selection Metric | Weighted F1 |
| Random Seed | 42 |
8. Evaluation Metrics
All metrics reported are computed on the held-out test set (2,348 samples) unless stated otherwise.
Overall Performance
| Metric | Score |
|---|---|
| Accuracy | 92.12% |
| F1-Score (Weighted) | 91.91% |
| Precision (Weighted) | 91.98% |
| Recall (Weighted) | 92.12% |
Validation Set Performance
| Metric | Score |
|---|---|
| Validation Loss | 0.4273 |
| Accuracy | 92.54% |
| F1-Score (Weighted) | 92.42% |
| Precision (Weighted) | 92.46% |
| Recall (Weighted) | 92.54% |
9. Results
Per-Class Performance (Test Set)
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Negative | 91.73% | 96.40% | 94.00% | 805 |
| Neutral | 87.94% | 74.63% | 80.74% | 469 |
| Positive | 93.93% | 96.55% | 95.22% | 1,074 |
| Weighted Avg | 91.98% | 92.12% | 91.91% | 2,348 |
Confusion Matrix
| Predicted: Negative | Predicted: Neutral | Predicted: Positive | |
|---|---|---|---|
| Actual: Negative | 776 | 22 | 7 |
| Actual: Neutral | 59 | 350 | 60 |
| Actual: Positive | 11 | 26 | 1,037 |
The model performs exceptionally well on the Positive and Negative classes, with F1 scores above 94%. The Neutral class, as expected given its lower training representation and linguistic ambiguity, shows a slightly lower recall (74.63%), indicating that some neutral texts are misclassified as positive — a common challenge in 3-class sentiment tasks.
Error Rate: 185 misclassifications out of 2,348 samples (7.88%)
Sample Predictions
| Input Text | Predicted Sentiment | Confidence |
|---|---|---|
| "This product is absolutely amazing! Best purchase ever!" | Positive 🟢 | 99.88% |
| "Terrible quality and poor customer service." | Negative 🔴 | 99.64% |
| "It's okay, nothing special but does the job." | Neutral 🟡 | 99.49% |
| "The outdoor seating areas are pleasant and encourage socializing." | Positive 🟢 | 99.90% |
| "Very disappointed with this purchase." | Negative 🔴 | 99.63% |
| "Average service, could be better but could be worse." | Neutral 🟡 | 99.68% |
| "Exceptional quality and fantastic customer support!" | Positive 🟢 | 99.90% |
💻 How to Use
Method 1: Pipeline API (Quickest)
from transformers import pipeline
# Load the model
classifier = pipeline(
"sentiment-analysis",
model="tahamueed23/english-sentiment-roberta-large"
)
# Single prediction
result = classifier("This is an amazing product!")[0]
# Map label to sentiment
sentiment_map = {'LABEL_2': 'Positive', 'LABEL_1': 'Neutral', 'LABEL_0': 'Negative'}
print(f"Sentiment: {sentiment_map[result['label']]}")
print(f"Confidence: {result['score']:.2%}")
Method 2: Manual Inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "tahamueed23/english-sentiment-roberta-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "The lecture was well-structured and easy to follow."
inputs = tokenizer(text, return_tensors="pt", padding=True,
truncation=True, max_length=256)
model.eval()
with torch.no_grad():
outputs = model(**inputs)
probabilities = torch.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probabilities, dim=1).item()
confidence = probabilities[0][predicted_class].item()
sentiment_map = {0: "Negative", 1: "Neutral", 2: "Positive"}
print(f"Text: {text}")
print(f"Sentiment: {sentiment_map[predicted_class]}")
print(f"Confidence: {confidence:.2%}")
# All class probabilities
for i, label in sentiment_map.items():
print(f" {label}: {probabilities[0][i]:.2%}")
Method 3: Batch Processing
from transformers import pipeline
classifier = pipeline(
"sentiment-analysis",
model="tahamueed23/english-sentiment-roberta-large"
)
texts = [
"Best experience I've had with this service!",
"Very disappointing — would not recommend.",
"It's acceptable, nothing remarkable.",
"The support team resolved my issue quickly.",
"The product broke after two days."
]
results = classifier(texts, batch_size=16)
sentiment_map = {'LABEL_2': 'Positive', 'LABEL_1': 'Neutral', 'LABEL_0': 'Negative'}
for text, result in zip(texts, results):
sentiment = sentiment_map[result['label']]
print(f" {text[:60]:<60} → {sentiment:<8} ({result['score']:.2%})")
Method 4: Hugging Face Inference API
import requests
API_URL = "https://api-inference.huggingface.co/models/tahamueed23/english-sentiment-roberta-large"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}
def query(text):
response = requests.post(API_URL, headers=headers, json={"inputs": text})
return response.json()
output = query("The course material was very well organized.")
print(output)
10. Limitations
Language and Input Constraints
| Limitation | Severity | Notes |
|---|---|---|
| Non-English text | ❌ High | Model is designed exclusively for English input |
| Very short texts (< 5 chars) | ❌ High | Insufficient context for reliable classification |
| Code-switched or mixed-language text | ⚠️ Medium | May produce unreliable predictions |
| Sarcasm and irony | ⚠️ Medium | Nuanced text may be misclassified |
| Domain-specific jargon | ⚠️ Low | Performance may degrade on highly specialized content |
| Emojis only (no text) | ❌ High | Requires at least some textual content |
Technical Constraints
| Constraint | Value | Notes |
|---|---|---|
| Max Sequence Length | 256 tokens | Longer texts will be truncated |
| Model Size | ~1.42 GB | Quantization recommended for low-memory environments |
| CPU Inference Speed | ~450ms/sample | GPU strongly recommended for production use |
| GPU Inference Speed | ~45ms/sample | Real-time capable on modern GPUs |
Known Weaknesses
The model shows its weakest performance on the Neutral class (F1: 80.74%, Recall: 74.63%), which is a known challenge in 3-class sentiment tasks. Texts that contain both positive and negative elements, or those that are factually descriptive without strong polarity, are most likely to be misclassified. Error analysis reveals that the most common mistake is misclassifying neutral educational feedback as positive, particularly when the text contains surface-level positive language without an explicit sentiment intent.
11. Ethical Considerations
This model was trained on feedback collected from educational and consumer contexts. Users should be aware of the following considerations before deploying it:
Bias. The training data reflects the language patterns and sentiments present in specific feedback domains (primarily academic and product reviews). The model may not generalize equitably across all demographic groups, writing styles, or cultural contexts. It could underperform on dialects or informal registers underrepresented in the training corpus.
Responsible deployment. Sentiment predictions should not be used as the sole basis for high-stakes decisions affecting individuals (e.g., employee performance review, student grading, or content banning). Human review is strongly recommended for sensitive applications.
Data privacy. Ensure that any text submitted for inference does not contain personally identifiable information (PII), as text is processed through the model at inference time.
Feedback misinterpretation. The Neutral class has lower recall, meaning some genuinely neutral or ambiguous feedback may be incorrectly flagged as Positive or Negative. Downstream consumers should be designed to tolerate this level of uncertainty.
12. Future Improvements
| Planned Update | Version | Description |
|---|---|---|
| Quantized Version | v1.0.1 | INT8 quantization for ~75% model size reduction |
| Improved Neutral Detection | v1.1.0 | Data augmentation and focal loss to address class imbalance |
| Domain Adaptation | v1.2.0 | Specialized fine-tuning for healthcare, legal, or financial text |
| 4-Class Emotion Detection | v2.0.0 | Extending to: Joy, Anger, Sadness, Neutral |
| Multilingual Support | v2.1.0 | Cross-lingual model supporting English + Urdu + Roman Urdu |
| Gradio Demo | — | Interactive web demo on Hugging Face Spaces |
📚 Citation
If you use this model in your research or project, please cite:
@misc{english_sentiment_roberta_2026,
author = {Taha Mueed},
title = {English Sentiment Analysis: A Fine-tuned RoBERTa-Large Model for 3-Class Sentiment Classification},
year = {2026},
publisher = {Hugging Face Hub},
howpublished = {\url{https://huggingface.co/tahamueed23/english-sentiment-roberta-large}},
note = {Version 1.0, Test Accuracy: 92.12\%, F1: 91.91\%}
}
📄 License
This model is released under the MIT License.
MIT License
Copyright (c) 2026 Taha Mueed
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
📬 Contact & Support
| Channel | Details | Response Time |
|---|---|---|
| 🤗 Hugging Face | tahamueed23/english-sentiment-roberta-large | < 24h |
| tahamueed23@gmail.com | < 72h |
🙏 Acknowledgments
- Hugging Face — Transformers library and model hosting infrastructure
- siebert / Cardiff NLP — Base RoBERTa-Large sentiment model and research
- PyTorch Team — Deep learning framework
- scikit-learn — Evaluation metrics and class weight computation
- Downloads last month
- 11
Model tree for tahamueed23/english-sentiment-roberta-large
Base model
siebert/sentiment-roberta-large-englishSpace using tahamueed23/english-sentiment-roberta-large 1
Evaluation results
- Accuracy on Custom English Feedback Datasettest set self-reported0.921
- F1-Score (Weighted) on Custom English Feedback Datasettest set self-reported0.919
- Precision (Weighted) on Custom English Feedback Datasettest set self-reported0.920
- Recall (Weighted) on Custom English Feedback Datasettest set self-reported0.921