xlm-roberta-large-online-counseling-oncoco

Fine-tuned XLM-RoBERTa large model for fine-grained message classification in psychosocial online counseling conversations. Trained on the OnCoCo 1.0 dataset.

Try it out: OnCoCo Message Classifier Space

Model Description

This model classifies individual messages from online counseling conversations into one of 66 fine-grained categories — 38 counselor and 28 client categories — covering communication acts such as empathic reflection, problem exploration, motivational interviewing techniques, resource activation, and emotional support.

Messages are prefixed with the speaker role (Counselor: / Client: in English, Berater: / Klient: in German) to allow the model to resolve the role context. At inference time, logits for the other speaker's categories are masked so predictions always fall within the correct role-specific category set.

The model was developed as part of the OnCoCo project at Technische Hochschule Nürnberg.
The best model we trained on this dataset is this one.

Evaluation Results

Evaluated on a held-out 20% test split of the OnCoCo 1.0 dataset (bilingual DE+EN):

Metric Score
Top-1 Accuracy 0.79
Top-1 Macro F1 0.72
Top-2 Accuracy 0.88
Top-2 Macro F1 0.83

Training Details

  • Base model: FacebookAI/xlm-roberta-large
  • Dataset: th-nuernberg/OnCoCoV1 — 5,556 messages (2,778 DE + 2,778 EN translations), 66 categories
  • Split: 80/20 stratified train/test
  • Languages: German (original) and English (GPT-4o translated, manually verified)
  • Role prefixes: Messages are prefixed with Counselor: / Client: (EN) or Berater: / Klient: (DE)

Usage

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "th-nuernberg/xlm-roberta-large-online-counseling-oncoco"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

text = "Counselor: It sounds like you're feeling overwhelmed. Can you tell me more?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    probs = F.softmax(model(**inputs).logits, dim=-1).squeeze()

top3 = probs.argsort(descending=True)[:3]
for i in top3:
    print(f"{model.config.id2label[i.item()]}: {probs[i].item():.4f}")

To resolve category codes to human-readable descriptions:

import json
from huggingface_hub import hf_hub_download

path = hf_hub_download("th-nuernberg/OnCoCoV1", "code_to_category.json", repo_type="dataset")
with open(path) as f:
    code2cat = json.load(f)

for i in top3:
    code = model.config.id2label[i.item()]
    print(f"{code}{code2cat.get(code, '?')}: {probs[i].item():.4f}")

Category Taxonomy

The 66 categories are organized hierarchically for both speaker roles:

Counselor (38 categories)

  • Formalities (opening, closing)
  • Moderation
  • Impact factors: analysis & clarification of problems (13), objectives (2), motivation (4), resource activation (5), problem solving (8)
  • Other statements

Client (28 categories)

  • Formalities (opening, closing)
  • Empathy expression (3)
  • Impact factors: problem analysis (8), objectives (2), motivation (2), resource activation (2), coping assistance (6)
  • Other statements

Full label descriptions are available via the code_to_category.json file in the dataset repository.

Intended Use

  • Automated content analysis of online counseling conversations
  • Research on counselor–client communication patterns
  • Educational feedback tools for counselor training
  • Conversational AI research in the mental health domain

Limitations

  • Performance varies across categories; rare categories with few training examples show lower F1 scores
  • Some semantically overlapping categories (e.g., problem statement vs. problem definition) are harder to distinguish
  • English texts are machine-translated from German; some translation artifacts may affect performance on native English counseling texts

Citation

If you use this model, please cite the OnCoCo paper:

@inproceedings{albrecht-etal-2026-oncoco,
    title = "{O}n{C}o{C}o 1.0: A Public Dataset for Fine-Grained Message Classification in Online Counseling Conversations",
    author = "Albrecht, Jens and Lehmann, Robert and Poltermann, Aleksandra and Rudolph, Eric and Steigerwald, Philipp and Stieler, Mara",
    booktitle = "Proceedings of the Joint Workshop on Social Context (SoCon) and Integrating NLP and Psychology to Study Social Interactions (NLPSI) at LREC-COLING 2026",
    month = may,
    year = "2026",
    address = "Palma de Mallorca, Spain",
    publisher = "ELRA and ICCL",
}

ArXiv preprint: arXiv:2512.09804

License

CC BY-SA 4.0 — Technische Hochschule Nürnberg

Downloads last month
35
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for th-nuernberg/xlm-roberta-large-online-counseling-oncoco

Finetuned
(929)
this model

Dataset used to train th-nuernberg/xlm-roberta-large-online-counseling-oncoco

Paper for th-nuernberg/xlm-roberta-large-online-counseling-oncoco

Evaluation results