🌍 Health Topic Question Classification

Multilingual 🇮🇩🇬🇧 | XLM-RoBERTa-Base

A fine-tuned xlm-roberta-base model for health topic question classification on health-related questions from online forums.

This model supports Bahasa Indonesia and English, making it suitable for social listening, content moderation, and healthcare Q&A pipelines.


🔍 Model Overview

  • Architecture: xlm-roberta-base
  • Task: Text Classification (Health Topic)
  • Languages: Indonesian, English
  • Domain: Health forum questions / social media informal health text
  • Training Date: 2026-03-16

🏷️ Supported Health Topic Labels

Label Description
0 Alergi / Allergy
1 Obat / Medication
2 Menstruasi / Menstruation

📊 Model Performance

Evaluated on held-out validation dataset:

Metric Score
F1 Score 0.94
Precision 0.94
Recall 0.94
Accuracy 0.94
Training Loss 0.1477
Validation Loss 0.2057

🏗️ Training Configuration

Parameter Value
Base Model xlm-roberta-base
Training Samples From online health forums (Indonesia)
Epochs 3
Learning Rate 2e-5
Batch Size 16 (train), 32 (eval)
Optimizer AdamW
Framework Hugging Face Transformers
Training Speed 24.08 samples/sec

🚀 Usage

Preprocessing Configuration

import re

def clean_text(text):
    if not isinstance(text, str):
        return text
    text = text.replace("#", "<hashtag>")
    text = re.sub(r"https?://\S+|www\.\S+", "<link>", text)
    text = re.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<email>", text)
    text = re.sub(r"@\w+", "<user>", text)
    text = text.replace('"', "").replace("'", "")
    text = text.replace("\n", " ")
    text = text.replace("\\n", " ")
    text = re.sub(r"\s+", " ", text).strip()
    return text

Quick Inference (Single Text)

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig

MODEL_NAME = "iqbalpurba26/topic-classification-xlmr"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
config = AutoConfig.from_pretrained(
    MODEL_NAME,
    num_labels=3,
    problem_type="single_label_classification"
)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    config=config
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

label_cols = ["alergi", "obat", "menstruasi"]

def predict_health_topic(text):
    text = clean_text(text)
    inputs = tokenizer(text, padding=True, truncation=True, max_length=128, return_tensors="pt").to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1).cpu().numpy()[0]
    top_idx = np.argmax(probs)
    return {
        "text": text,
        "predicted_label": label_cols[top_idx],
        "probability": float(probs[top_idx])
    }

text = "Saya mengalami gatal-gatal setelah mengonsumsi obat baru."
print(predict_health_topic(text))

Quick Inference (Batch)

def predict_batch(texts, batch_size=16, max_length=128):
    results = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        encodings = tokenizer(batch_texts, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to(device)
        with torch.no_grad():
            logits = model(**encodings).logits
        probs = torch.softmax(logits, dim=-1).cpu().numpy()
        for text, prob in zip(batch_texts, probs):
            top_idx = np.argmax(prob)
            results.append({
                "text": text,
                "predicted_label": label_cols[top_idx],
                "probability": float(prob[top_idx])
            })
    return results

🎯 Intended Use Cases

  • Health question classification
  • Social media health content monitoring
  • Forum and Q&A content organization
  • Healthcare chatbots and assistance
  • Multilingual health data pipelines

⚠️ Limitations

  • Supports only the defined health topic labels: ["alergi", "obat", "menstruasi"]
  • Not optimized for:
    • Complex medical records
    • Extremely short or ambiguous questions
    • Highly slang-heavy or sarcastic expressions
  • Performance may vary for heavily code-mixed sentences
  • Model may inherit bias from the training forum data

⚖️ Ethical Considerations

  • Not intended to replace professional medical advice.
  • Human-in-the-loop review recommended for sensitive deployments.
  • Reflects possible biases from online forum data.

🖥️ Hardware Recommendations

  • Recommended: GPU (≥ 8GB VRAM)
  • CPU inference supported but slower
  • FP16 mixed precision supported for faster inference

📜 License

Released under the Apache 2.0 License.
Free for research and commercial use.


📚 Citation

@misc{iqbalpurba20032026healthtopicxlm,
  author    = {M. Iqbal Purba},
  title     = {Multilingual Health Topic Question Classification},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/iqbalpurba26/topic-classification-xlmr}
}
Downloads last month
6
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for iqbalpurba26/health-topic-classification

Finetuned
(3878)
this model