🌍 Health Topic Question Classification

Multilingual 🇮🇩🇬🇧 | XLM-RoBERTa-Base

A fine-tuned xlm-roberta-base model for health topic question classification on health-related questions from online forums.

This model supports Bahasa Indonesia and English, making it suitable for social listening, content moderation, and healthcare Q&A pipelines.

🔍 Model Overview

Architecture: xlm-roberta-base
Task: Text Classification (Health Topic)
Languages: Indonesian, English
Domain: Health forum questions / social media informal health text
Training Date: 2026-03-16

🏷️ Supported Health Topic Labels

Label	Description
0	Alergi / Allergy
1	Obat / Medication
2	Menstruasi / Menstruation

📊 Model Performance

Evaluated on held-out validation dataset:

Metric	Score
F1 Score	0.94
Precision	0.94
Recall	0.94
Accuracy	0.94
Training Loss	0.1477
Validation Loss	0.2057

🏗️ Training Configuration

Parameter	Value
Base Model	xlm-roberta-base
Training Samples	From online health forums (Indonesia)
Epochs	3
Learning Rate	2e-5
Batch Size	16 (train), 32 (eval)
Optimizer	AdamW
Framework	Hugging Face Transformers
Training Speed	24.08 samples/sec

🚀 Usage

Preprocessing Configuration

import re

def clean_text(text):
    if not isinstance(text, str):
        return text
    text = text.replace("#", "<hashtag>")
    text = re.sub(r"https?://\S+|www\.\S+", "<link>", text)
    text = re.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<email>", text)
    text = re.sub(r"@\w+", "<user>", text)
    text = text.replace('"', "").replace("'", "")
    text = text.replace("\n", " ")
    text = text.replace("\\n", " ")
    text = re.sub(r"\s+", " ", text).strip()
    return text

Quick Inference (Single Text)

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig

MODEL_NAME = "iqbalpurba26/topic-classification-xlmr"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
config = AutoConfig.from_pretrained(
    MODEL_NAME,
    num_labels=3,
    problem_type="single_label_classification"
)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    config=config
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

label_cols = ["alergi", "obat", "menstruasi"]

def predict_health_topic(text):
    text = clean_text(text)
    inputs = tokenizer(text, padding=True, truncation=True, max_length=128, return_tensors="pt").to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]
    top_idx = np.argmax(probs)
    return {
        "text": text,
        "predicted_label": label_cols[top_idx],
        "probability": float(probs[top_idx])
    }

text = "Saya mengalami gatal-gatal setelah mengonsumsi obat baru."
print(predict_health_topic(text))

Quick Inference (Batch)

def predict_batch(texts, batch_size=16, max_length=128):
    results = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        encodings = tokenizer(batch_texts, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to(device)
        with torch.no_grad():
            logits = model(**encodings).logits
        probs = torch.softmax(logits, dim=-1).cpu().numpy()
        for text, prob in zip(batch_texts, probs):
            top_idx = np.argmax(prob)
            results.append({
                "text": text,
                "predicted_label": label_cols[top_idx],
                "probability": float(prob[top_idx])
            })
    return results

🎯 Intended Use Cases

Health question classification
Social media health content monitoring
Forum and Q&A content organization
Healthcare chatbots and assistance
Multilingual health data pipelines

⚠️ Limitations

Supports only the defined health topic labels: ["alergi", "obat", "menstruasi"]
Not optimized for:
- Complex medical records
- Extremely short or ambiguous questions
- Highly slang-heavy or sarcastic expressions
Performance may vary for heavily code-mixed sentences
Model may inherit bias from the training forum data

⚖️ Ethical Considerations

Not intended to replace professional medical advice.
Human-in-the-loop review recommended for sensitive deployments.
Reflects possible biases from online forum data.

🖥️ Hardware Recommendations

Recommended: GPU (≥ 8GB VRAM)
CPU inference supported but slower
FP16 mixed precision supported for faster inference

📜 License

Released under the Apache 2.0 License.
Free for research and commercial use.

📚 Citation

@misc{iqbalpurba20032026healthtopicxlm,
  author    = {M. Iqbal Purba},
  title     = {Multilingual Health Topic Question Classification},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/iqbalpurba26/topic-classification-xlmr}
}

Downloads last month: 6

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for iqbalpurba26/health-topic-classification

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3878)

this model