XLM-RoBERTa Fine-tuned for Khmer News Classification

Model Description

This model is a fine-tuned version of FacebookAI/xlm-roberta-base for Khmer-language news text classification. It was trained to categorize Cambodian news articles into predefined topic categories.

  • Model type: Text Classification
  • Language: Khmer (km)
  • License: MIT
  • Base model: XLM-RoBERTa Base

Intended Use

Direct Use

  • Classifying Khmer news articles by topic/category
  • Building Khmer-language news aggregators or filters
  • Research on low-resource language NLP

Out-of-Scope Use

  • Non-Khmer languages (performance not guaranteed)
  • Tasks other than news classification
  • Hate speech or harmful content detection

Training Details

Training Data

  • Dataset: [train_set: 5.14K rows, val_set: 661 rows, test_set: 1.54k rows]
  • Classes: Politics, Economic, Entertainment, Technology, Sport, Life

Training Procedure

  • Base model: xlm-roberta-base
  • Framework: Hugging Face Transformers
  • Training epochs: 8
  • Batch size: 32
  • Learning rate: 2e-5
  • Optimizer: AdamW
  • LR scheduler: Cosine decay
  • Warmup ratio: 0.1
  • Weight decay: 0.01
  • Hardware: NVIDIA T4 GPU (Google Colab)

Evaluation

Metrics

Metric Value
Accuracy 94%
F1 (macro) 94%
AUC 0.9933
Error rate 0.056

Per-class Results (optional but recommended)

Category Precision Recall F1
economic 0.91 0.94 0.93
entertainment 0.94 0.97 0.96
life 0.87 0.82 0.85
politic 0.97 0.96 0.97
sport 0.97 0.99 0.98
technology 0.93 0.92 0.92

How to Use

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="kidkidmoon/xlm-r-khmer-news-classification"
)

text = "αž“αžΆαž™αž€αžšαžŠαŸ’αž‹αž˜αž“αŸ’αžαŸ’αžšαžΈαž”αžΆαž“αžαŸ’αž›αŸ‚αž„αž“αŸ…αž€αŸ’αž“αž»αž„αžŸαž“αŸ’αž“αž·αžŸαžΈαž‘αžŸαžΆαžšαž–αŸαžαŸŒαž˜αžΆαž“"  # Khmer text
result = classifier(text)
print(result)

Or manually with the tokenizer:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "kidkidmoon/xlm-r-khmer-news-classification"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

inputs = tokenizer("your Khmer text here", return_tensors="pt", truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class = logits.argmax().item()
print(model.config.id2label[predicted_class])

Limitations

  • Performance may degrade on informal or social media Khmer text
  • Trained on news domain only β€” may not generalize to other text types
  • Class imbalance in training data may affect minority categories

Citation

If you use this model, please cite:

@misc{kimlangsrun2025khmer,
  author    = {Srun Kimlang},
  title     = {XLM-RoBERTa Fine-tuned for Khmer News Classification},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/kidkidmoon/xlm-r-khmer-news-classification}
}
Downloads last month
26
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kidkidmoon/xlm-r-khmer-news-classification

Finetuned
(3889)
this model

Dataset used to train kidkidmoon/xlm-r-khmer-news-classification