🌍 Health Topic Question Classification
Multilingual 🇮🇩🇬🇧 | XLM-RoBERTa-Base
A fine-tuned xlm-roberta-base model for health topic question classification on health-related questions from online forums.
This model supports Bahasa Indonesia and English, making it suitable for social listening, content moderation, and healthcare Q&A pipelines.
🔍 Model Overview
- Architecture:
xlm-roberta-base - Task: Text Classification (Health Topic)
- Languages: Indonesian, English
- Domain: Health forum questions / social media informal health text
- Training Date: 2026-03-16
🏷️ Supported Health Topic Labels
| Label | Description |
|---|---|
| 0 | Alergi / Allergy |
| 1 | Obat / Medication |
| 2 | Menstruasi / Menstruation |
📊 Model Performance
Evaluated on held-out validation dataset:
| Metric | Score |
|---|---|
| F1 Score | 0.94 |
| Precision | 0.94 |
| Recall | 0.94 |
| Accuracy | 0.94 |
| Training Loss | 0.1477 |
| Validation Loss | 0.2057 |
🏗️ Training Configuration
| Parameter | Value |
|---|---|
| Base Model | xlm-roberta-base |
| Training Samples | From online health forums (Indonesia) |
| Epochs | 3 |
| Learning Rate | 2e-5 |
| Batch Size | 16 (train), 32 (eval) |
| Optimizer | AdamW |
| Framework | Hugging Face Transformers |
| Training Speed | 24.08 samples/sec |
🚀 Usage
Preprocessing Configuration
import re
def clean_text(text):
if not isinstance(text, str):
return text
text = text.replace("#", "<hashtag>")
text = re.sub(r"https?://\S+|www\.\S+", "<link>", text)
text = re.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<email>", text)
text = re.sub(r"@\w+", "<user>", text)
text = text.replace('"', "").replace("'", "")
text = text.replace("\n", " ")
text = text.replace("\\n", " ")
text = re.sub(r"\s+", " ", text).strip()
return text
Quick Inference (Single Text)
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
MODEL_NAME = "iqbalpurba26/topic-classification-xlmr"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
config = AutoConfig.from_pretrained(
MODEL_NAME,
num_labels=3,
problem_type="single_label_classification"
)
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME,
config=config
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
label_cols = ["alergi", "obat", "menstruasi"]
def predict_health_topic(text):
text = clean_text(text)
inputs = tokenizer(text, padding=True, truncation=True, max_length=128, return_tensors="pt").to(device)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]
top_idx = np.argmax(probs)
return {
"text": text,
"predicted_label": label_cols[top_idx],
"probability": float(probs[top_idx])
}
text = "Saya mengalami gatal-gatal setelah mengonsumsi obat baru."
print(predict_health_topic(text))
Quick Inference (Batch)
def predict_batch(texts, batch_size=16, max_length=128):
results = []
for i in range(0, len(texts), batch_size):
batch_texts = texts[i:i+batch_size]
encodings = tokenizer(batch_texts, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to(device)
with torch.no_grad():
logits = model(**encodings).logits
probs = torch.softmax(logits, dim=-1).cpu().numpy()
for text, prob in zip(batch_texts, probs):
top_idx = np.argmax(prob)
results.append({
"text": text,
"predicted_label": label_cols[top_idx],
"probability": float(prob[top_idx])
})
return results
🎯 Intended Use Cases
- Health question classification
- Social media health content monitoring
- Forum and Q&A content organization
- Healthcare chatbots and assistance
- Multilingual health data pipelines
⚠️ Limitations
- Supports only the defined health topic labels:
["alergi", "obat", "menstruasi"] - Not optimized for:
- Complex medical records
- Extremely short or ambiguous questions
- Highly slang-heavy or sarcastic expressions
- Performance may vary for heavily code-mixed sentences
- Model may inherit bias from the training forum data
⚖️ Ethical Considerations
- Not intended to replace professional medical advice.
- Human-in-the-loop review recommended for sensitive deployments.
- Reflects possible biases from online forum data.
🖥️ Hardware Recommendations
- Recommended: GPU (≥ 8GB VRAM)
- CPU inference supported but slower
- FP16 mixed precision supported for faster inference
📜 License
Released under the Apache 2.0 License.
Free for research and commercial use.
📚 Citation
@misc{iqbalpurba20032026healthtopicxlm,
author = {M. Iqbal Purba},
title = {Multilingual Health Topic Question Classification},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/iqbalpurba26/topic-classification-xlmr}
}
- Downloads last month
- 6
Model tree for iqbalpurba26/health-topic-classification
Base model
FacebookAI/xlm-roberta-base