XLM-RoBERTa Fine-tuned for Khmer News Classification
Model Description
This model is a fine-tuned version of FacebookAI/xlm-roberta-base for Khmer-language news text classification. It was trained to categorize Cambodian news articles into predefined topic categories.
- Model type: Text Classification
- Language: Khmer (km)
- License: MIT
- Base model: XLM-RoBERTa Base
Intended Use
Direct Use
- Classifying Khmer news articles by topic/category
- Building Khmer-language news aggregators or filters
- Research on low-resource language NLP
Out-of-Scope Use
- Non-Khmer languages (performance not guaranteed)
- Tasks other than news classification
- Hate speech or harmful content detection
Training Details
Training Data
- Dataset: [train_set: 5.14K rows, val_set: 661 rows, test_set: 1.54k rows]
- Classes: Politics, Economic, Entertainment, Technology, Sport, Life
Training Procedure
- Base model: xlm-roberta-base
- Framework: Hugging Face Transformers
- Training epochs: 8
- Batch size: 32
- Learning rate: 2e-5
- Optimizer: AdamW
- LR scheduler: Cosine decay
- Warmup ratio: 0.1
- Weight decay: 0.01
- Hardware: NVIDIA T4 GPU (Google Colab)
Evaluation
Metrics
| Metric | Value |
|---|---|
| Accuracy | 94% |
| F1 (macro) | 94% |
| AUC | 0.9933 |
| Error rate | 0.056 |
Per-class Results (optional but recommended)
| Category | Precision | Recall | F1 |
|---|---|---|---|
| economic | 0.91 | 0.94 | 0.93 |
| entertainment | 0.94 | 0.97 | 0.96 |
| life | 0.87 | 0.82 | 0.85 |
| politic | 0.97 | 0.96 | 0.97 |
| sport | 0.97 | 0.99 | 0.98 |
| technology | 0.93 | 0.92 | 0.92 |
How to Use
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="kidkidmoon/xlm-r-khmer-news-classification"
)
text = "ααΆαααααααααααααΈααΆαααααααα
αααα»αααααα·ααΈαααΆαααααααΆα" # Khmer text
result = classifier(text)
print(result)
Or manually with the tokenizer:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "kidkidmoon/xlm-r-khmer-news-classification"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
inputs = tokenizer("your Khmer text here", return_tensors="pt", truncation=True)
with torch.no_grad():
logits = model(**inputs).logits
predicted_class = logits.argmax().item()
print(model.config.id2label[predicted_class])
Limitations
- Performance may degrade on informal or social media Khmer text
- Trained on news domain only β may not generalize to other text types
- Class imbalance in training data may affect minority categories
Citation
If you use this model, please cite:
@misc{kimlangsrun2025khmer,
author = {Srun Kimlang},
title = {XLM-RoBERTa Fine-tuned for Khmer News Classification},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/kidkidmoon/xlm-r-khmer-news-classification}
}
- Downloads last month
- 26
Model tree for kidkidmoon/xlm-r-khmer-news-classification
Base model
FacebookAI/xlm-roberta-base