XLM-RoBERTa Fine-tuned for Khmer News Classification

Model Description

This model is a fine-tuned version of FacebookAI/xlm-roberta-base for Khmer-language news text classification. It was trained to categorize Cambodian news articles into predefined topic categories.

Model type: Text Classification
Language: Khmer (km)
License: MIT
Base model: XLM-RoBERTa Base

Intended Use

Direct Use

Classifying Khmer news articles by topic/category
Building Khmer-language news aggregators or filters
Research on low-resource language NLP

Out-of-Scope Use

Non-Khmer languages (performance not guaranteed)
Tasks other than news classification
Hate speech or harmful content detection

Training Details

Training Data

Dataset: [train_set: 5.14K rows, val_set: 661 rows, test_set: 1.54k rows]
Classes: Politics, Economic, Entertainment, Technology, Sport, Life

Training Procedure

Base model: xlm-roberta-base
Framework: Hugging Face Transformers
Training epochs: 8
Batch size: 32
Learning rate: 2e-5
Optimizer: AdamW
LR scheduler: Cosine decay
Warmup ratio: 0.1
Weight decay: 0.01
Hardware: NVIDIA T4 GPU (Google Colab)

Evaluation

Metrics

Metric	Value
Accuracy	94%
F1 (macro)	94%
AUC	0.9933
Error rate	0.056

Per-class Results (optional but recommended)

Category	Precision	Recall	F1
economic	0.91	0.94	0.93
entertainment	0.94	0.97	0.96
life	0.87	0.82	0.85
politic	0.97	0.96	0.97
sport	0.97	0.99	0.98
technology	0.93	0.92	0.92

How to Use

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="kidkidmoon/xlm-r-khmer-news-classification"
)

text = "នាយករដ្ឋមន្ត្រីបានថ្លែងនៅក្នុងសន្និសីទសារព័ត៌មាន"  # Khmer text
result = classifier(text)
print(result)

Or manually with the tokenizer:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "kidkidmoon/xlm-r-khmer-news-classification"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

inputs = tokenizer("your Khmer text here", return_tensors="pt", truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class = logits.argmax().item()
print(model.config.id2label[predicted_class])

Limitations

Performance may degrade on informal or social media Khmer text
Trained on news domain only — may not generalize to other text types
Class imbalance in training data may affect minority categories

Citation

If you use this model, please cite:

@misc{kimlangsrun2025khmer,
  author    = {Srun Kimlang},
  title     = {XLM-RoBERTa Fine-tuned for Khmer News Classification},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/kidkidmoon/xlm-r-khmer-news-classification}
}

Downloads last month: 26

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for kidkidmoon/xlm-r-khmer-news-classification

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3889)

this model

kidkidmoon
/

xlm-r-khmer-news-classification