Mental health NLP research · NCI H9DAI

Multi-Model Framework for Depression Classification and Suicide Risk Detection from Social Media Text

A parallel ensemble of 12 classifiers across 3 clinical datasets — extending Tumaliuan et al. (2024) with modern transformers and SMOTE balancing.

RQ1
Which machine learning model provides the highest Accuracy for identifying depression and suicide risk?
RQ2
Does training on the full dataset (232K), a half split (116K), or a sample (50K) provide a significant boost in Accuracy?
0
Datasets
0
Models trained
0
D3 Accuracy (binary)
0
vs Base Paper ↑

Extending prior work
Our work vs Tumaliuan et al. (2024)

Dataset 1 is structurally equivalent to the base paper's Filipino Twitter corpus — same 6-class task, same clinical annotation method — making a direct F1 comparison valid.

Tumaliuan et al. — 2024
Filipino Twitter Depression
Frontiers in Computer Science · word2vec pipeline
Used word2vec (2013) — static embeddings, no negation handling
SVM never tested — absent from evaluation despite being NLP gold standard
XGBoost never tested — gradient boosting entirely absent
Class imbalance listed as limitation — never resolved
Restricted dataset — requires author permission to access
Accuracy not verified — no reproducible baseline reported
Best Accuracy~81%
+11.4%
accuracy gain (D1)
MindScan — 2026
English Twitter + Reddit
Zenodo (Nusrat 2024) · XLM-RoBERTa + SVM + XGBoost
XLM-RoBERTa (2019) — contextual embeddings, understands negation
SVM added — best D1 accuracy 92.36%, beats transformer (90.52%)
XGBoost added — accuracy 91.76%, gradient boosting for imbalanced data
SMOTE applied — imbalance resolved, all 6 classes equalised to 2,997
Public dataset — fully reproducible, anyone can verify results
Accuracy verified on held-out 20% test set, same 6-class task
Best Accuracy (D1 SVM)92.4%

Methodology
Three-step pipeline

CRISP-DM applied across all three datasets — from raw social media text to parallel ensemble predictions.

01
Data
3 clinical datasets spanning Twitter and Reddit, covering depression types, binary detection, and suicide risk.
02
Preprocessing
6-stage text cleaning pipeline + SMOTE oversampling to address class imbalance left unresolved by the base paper.
03
Modelling
Parallel ensemble of 12 classifiers — all run independently on every prediction, never as a sequential cascade.
Dataset Overview
D1 — Depression Types (Zenodo 14233292)
14,983 tweets · 6 classes — Postpartum (3,746), Major Depressive (2,517), Bipolar (2,443), Psychotic (2,312), No Depression (1,985), Atypical (1,980). Psychiatrist-verified labels. Class imbalance ratio: 1.89×.
D2 — Binary Depression (Kaggle: albertobellardini)
10,314 tweets · 2 classes — Not Depressed (8,000) / Depressed (2,314). Severe class imbalance: 3.46×. Twitter short-form text. SMOTE applied to training set (8,251 → 12,800 samples). Trained on Twitter affect patterns — may underdetect atypical presentations.
D3 — Suicide Risk (Kaggle: nikhileswarkomati)
232,074 Reddit posts · 2 classes — Suicide / Non-Suicide (perfectly balanced, 116,037 each). Suicide posts average 200.8 words (mean), non-suicide posts 63 words. We sample 50K posts and compare against full/half splits to answer RQ2.
Business Context
A clinically-motivated framework for social media monitoring — applicable to platform-level moderation, mental health triage, and early intervention systems. Complements rather than replaces clinical assessment.
Preprocessing Pipeline
6-Stage Text Cleaning
1. Lowercase · 2. Strip URLs & http links · 3. Remove @mentions · 4. Remove # symbols · 5. Strip punctuation · 6. Collapse whitespace. Applied identically across all three datasets for consistency.
SMOTE — Synthetic Oversampling
Applied to D1 and D2 training sets only (D3 is pre-balanced). D1: 11,986 → 17,982 samples. D2: 8,251 → 12,800 samples. Creates synthetic clinical neighbours in TF-IDF feature space. Directly addresses the base paper's (Tumaliuan 2024) biggest limitation — they trained on raw imbalanced data.
Feature Extraction — TF-IDF
TF-IDF vectoriser with unigrams + bigrams, fitted per-dataset on training data only. Captures frequency-weighted term co-occurrence patterns, well-suited for short Twitter text.
Feature Extraction — Tokeniser
XLM-RoBERTa tokeniser (max 128 tokens D1/D2, 256 tokens D3) with padding. Pre-trained multilingual contextual embeddings capture semantic meaning and long-range dependencies — critical for Reddit's longer posts.
Ensemble Strategy & Architecture
4 Models per Dataset (12 total)
Logistic Regression — L2 regularised, max_iter=1000. SVM — LinearSVC, C=1.0. XGBoost — 300 estimators, max_depth=6. XLM-RoBERTa — fine-tuned multilingual transformer, 278M parameters, lr=2e-5, 3 epochs.
Ensemble Vote — Risk Flag Logic
All 12 models run simultaneously on every input. A sequential design (check depression first, then suicide risk) would miss masked suicidality — a clinically documented pre-crisis pattern where affect appears normal but intent is resolved. Parallelism is a safety requirement, not a design preference.
XGBoost Algorithm Collapse
XGBoost accuracy on D3: 91.6% (50K sample) → 70.5% (Full 232K) → 60.1% (H1 116K). Performance degrades as training data grows. The H1/H2 results are also inconsistent (60.1% vs 71.0%) — gradient boosting is highly sensitive to data distribution shifts at this scale, making it unreliable for large Reddit corpora.
D3 Split Study (RQ2)
D3 trained on 4 configurations: Full (232K), Half 1 (116K), Half 2 (116K), Sample (50K). XLM-RoBERTa accuracy: 98.1% (50K) → 97.8% (H1) → 98.0% (H2/Full). Δ = 0.3% across 4× more data. Kolmogorov-Smirnov tests confirm all splits share identical distributions (p > 0.49), validating the comparison.

Core evaluation
Accuracy Evidence Matrix

All 4 models evaluated across all dataset splits. Bold = winner per row. Red = XGBoost collapse on larger training sets. — How metrics are computed

Dataset / Split Logistic Regression SVM XGBoost XLM-RoBERTa
D1 Depression Types
91.5% 92.4% 91.8% 90.5%
D2 Binary Depression
98.9% 97.1% 99.3% 99.9%
D3 Full (232K)
94.3% 94.6% 98.0%
D3 Half 1 (116K)
93.8% 94.2% 97.8%
D3 Half 2 (116K)
93.7% 94.2% 98.0%
D3 Sample (50K) ★
93.2% 93.7% 91.6% 98.1%

Note: Full performance evaluation including Macro F1-Score, Cohen's Kappa, and per-class metrics are documented in the Final IEEE Report. Accuracy is shown here as the primary comparative metric for cross-dataset validation.


Key findings
What the results show

Four insights that directly answer the research questions.

01
SVM is the best model for short-form text
On 6-class depression type classification (D1), SVM achieves the highest Accuracy of 92.4%. Tweets average 31 words — too short for transformer contextual embeddings to gain advantage over TF-IDF bigrams.
D1 Accuracy: SVM 92.4%
02
XLM-RoBERTa is the best model for long-form text
On Reddit suicide risk posts (D3), XLM-RoBERTa achieves 98.1% Accuracy with the 50K sample. Suicide posts average 200.8 words — rich enough context for transformer embeddings to dominate every competitor. D2 (Twitter, ~31 words) tells the opposite story.
D3 Accuracy: XLM-RoBERTa 98.1%
03
Increasing data size provided no significant gain
Scaling from 50K to 232K samples produced only a 0.1% change in XLM-RoBERTa Accuracy (98.1% → 98.0%). Adding 182,000 more training examples gave no meaningful improvement, validating the 50K sample.
50K → 232K: Δ Accuracy = 0.1%
04
Social media affect ≠ clinical presentation
D2 was trained on Twitter-style emotional language (explicit distress, slang). Clinical presentations — anhedonia ("nothing feels enjoyable"), fatigue, flat affect — use a different lexicon and are systematically under-flagged. This is the documented Affective vs. Clinical Lexicon Gap: models trained on social media affect fail to recognise diagnostic-criteria language.
D2 limitation — documented failure mode
05
Parallel architecture is the safety net
When D2 misses a clinical presentation, D1 and D3 can still catch it. When classical D3 models over-flag depressive vocabulary, XLM-RoBERTa's contextual understanding overrides them. No single model is sufficient — the parallel ensemble exists precisely because each model's failure mode is different and partially compensated by the others.
Multi-task learning precedent — Zogan et al. 2024

Conclusions
Research verdict

Direct answers to both research questions, and the key limitations of the study.

RQ1 — Best model
No single model wins across all tasks
SVM (92.4%) wins on short-form Twitter text (D1) where TF-IDF bigrams capture enough signal. XLM-RoBERTa wins on long-form Reddit posts (D2: 99.9%, D3: 98.1%) where contextual embeddings dominate. Model selection must be text-length aware.
SVM for short text · XLM-RoBERTa for long text
RQ2 — Dataset size
More data gave no meaningful gain
Scaling from 50K to 232K training samples produced only a 0.1% change in XLM-RoBERTa Accuracy (98.1% → 98.0%). For this task and model, the 50K sample captures the full signal — there is no statistically significant benefit from 4× more data.
50K sample is sufficient · Δ = 0.1%
Limitation 1 — Affective vs. Clinical Lexicon Gap
Social media affect ≠ clinical diagnostic criteria
D2 was trained on Twitter explicit emotional language. Clinical presentations using diagnostic vocabulary — anhedonia ("nothing feels enjoyable"), psychomotor fatigue, flat affect — do not match that training distribution and are systematically under-flagged. This is empirical evidence of the domain gap between self-reported social media affect and clinical language, not a model defect.
Documented domain gap — Finding 04
Limitation 2 — Classical model lexical overfitting
TF-IDF ignores word order and context
Classical D3 models (LR, SVM, XGBoost) use TF-IDF bag-of-words features. Vocabulary overlapping with r/SuicideWatch posts (e.g. "exhausted", "nothing feels enjoyable") triggers false-positive suicide flags — the model sees matching tokens without understanding the sentence context. XLM-RoBERTa's contextual embeddings override these false positives, demonstrating why the transformer is the reliable D3 winner.
TF-IDF lexical overfitting — defer to XLM-RoBERTa

Live inference
Try it — winner model per task

Sample 3 demonstrates masked suicidality. Try typing clinical-style depressive language ("I feel exhausted, nothing feels enjoyable") to observe the Affective vs. Clinical Lexicon Gap documented in Finding 04.

How the demo works: Flask → HuggingFace proxy · predict_all() inference flow

Research prototype only. Not a clinical tool. If you or someone you know is in crisis, please contact a mental health professional or emergency services immediately.
0 characters
Analysis results
D3 — Immediate Risk · XLM-RoBERTa
98.1% Accuracy on D3
D2 — Depressed? · XLM-RoBERTa
99.9% Accuracy on D2
D1 — Depression type · SVM
92.4% Accuracy on D1
Code
Why
Output

Defence prep
Frequently asked questions

Click any question to expand the answer. Grouped by topic for quick navigation during Q&A.

Data & Datasets
D1 — 6-class depression type classification (atypical, bipolar, major depressive, no depression, postpartum, psychotic) from Kaggle. Twitter-length text, 11,986 samples. D2 — binary depressed/not-depressed from Twitter (10,314 samples, severe 3.46× imbalance). D3 — binary suicide/non-suicide from Reddit (232K samples, perfectly balanced 116,037 each — we use a 50K sample of 25K per class). Each dataset has a different task, different text length, and different vocabulary domain — which is precisely why running all three in parallel is informative.
D1 had 1.89× imbalance (atypical class), D2 had 3.46× imbalance. We applied SMOTE to training data only — never the test set. SMOTE interpolates new synthetic samples in TF-IDF feature space between existing minority-class examples. Class weighting was also evaluated; SMOTE showed equal or better Macro F1 in cross-validation. D3 was pre-balanced and required no oversampling.
No. The train/test split (stratified 80/20) is performed first. SMOTE is then applied only to the training portion. The TF-IDF vocabulary is fitted on training data only and applied as a read-only transform to the test set. XLM-RoBERTa uses a fixed pretrained tokeniser. No test sample was ever used to inform any training decision.
Methodology & Models
Each captures a different inductive bias: Logistic Regression (linear decision boundary), SVM (maximum-margin), Random Forest/XGBoost (non-linear tree ensembles), XLM-RoBERTa (contextual transformer). Disagreement between models is itself a signal. On D1, SVM (92.4%) beats XLM-RoBERTa (90.5%) — short tweets don't give the transformer enough context to gain advantage. On D3 (200.8-word Reddit posts), XLM-RoBERTa (98.1%) dominates every classical model.
max_features=50,000 — covers the full relevant vocabulary without noise. ngram_range=(1,2) — unigrams + bigrams capture local phrases ("not happy", "kill myself") that unigrams miss. sublinear_tf=True — applies log(1+tf) to dampen high-frequency word dominance. min_df=2 — removes hapax legomena (words appearing only once) that add noise.
Standard sequence classification fine-tuning: Adam optimiser, lr=2e-5, 3 epochs, linear warmup scheduler. Max token length: 128 for D1/D2 (Twitter-length text), 256 for D3 (Reddit posts average 200.8 words). Cross-entropy loss. Best checkpoint saved by validation accuracy. 278M parameters — multilingual pretraining covers 100 languages.
On the 50K sample, XGBoost achieves 91.6% — competitive. At full scale (232K), it collapses to 70.52% (Macro F1: 0.6998). This is TF-IDF lexical overfitting: vocabulary overlap between "suicide" and "non-suicide" Reddit posts increases with scale — words like "exhausted", "hopeless", "nothing matters" appear in both classes. Boosted trees memorise these majority-class token patterns instead of learning discriminative boundaries. H1 (116K) drops further to 60.1%, and H1 vs H2 are inconsistent (60.1% vs 70.9%), confirming XGBoost is unstable at this data scale. XLM-RoBERTa stays at 98.1% across all splits.
Results & Evaluation
Text length. D1 tweets average ~31 words. Transformers need rich context to outperform classical methods — contextual embeddings add little value on ~40-token inputs. TF-IDF bigrams on short explicit text (like tweets) already capture the full signal. This is Finding 01 and one of the key research conclusions: model selection must be text-length aware.
The dashboard shows accuracy for accessibility (non-specialist audience). After SMOTE, all training classes are equalised — so accuracy and Macro F1 are closely aligned. The full Macro F1, Cohen's Kappa, and per-class precision/recall are reported in the IEEE technical report. The evidence matrix footnote notes this explicitly.
No — XLM-RoBERTa: 98.1% (50K, NB2) · 98.02% (Full 232K) · 97.78% (H1) · 98.02% (H2). Maximum delta = 0.32%. KS tests across the three split study splits (Full, H1, H2) confirm identical distributions: suicide class p=0.4967 (H1 vs H2), p=0.9758 (Full vs H1); non-suicide class p=0.8125 (H1 vs H2), p=0.9992 (Full vs H1). All well above the p=0.05 threshold — distribution shift is not driving the results. This is Finding 03.
Architecture & Live Demo
Real models. The Flask app proxies every request to a HuggingFace Space (esvanth-mindscan.hf.space) which runs predict.py with all 12 loaded models. There is no hardcoded data — every input goes through the full pipeline. If the Space is sleeping it auto-wakes within ~60 seconds.
It means classical D3 models (LR/SVM/XGBoost) flagged suicide risk by majority vote, but XLM-RoBERTa — the best model at 98.1% accuracy — disagrees. A pure majority vote could trigger false alarms on metaphorical language ("I'm dying of embarrassment"). The amber state expresses uncertainty rather than forcing a binary decision, which maps directly to "escalate for human review" — the appropriate clinical-conservative response.
This is the Affective vs. Clinical Lexicon Gap (Finding 04, documented in NAACL 2024). D2 was trained on Twitter emotional language — explicit distress, slang, emotional punctuation. Clinical presentations use diagnostic vocabulary: anhedonia ("nothing feels enjoyable"), psychomotor fatigue, flat affect. These words are absent from D2's training distribution. This is not a bug — it is an empirical finding about the domain gap between social media affect and clinical language.
Replace TF-IDF classical models with MentalBERT/MentalRoBERTa (Ji et al. 2022) pretrained on mental health forum data. Combine all three tasks in a true multi-task learning setup with a shared encoder and task-specific heads — following the MTL precedent from Zogan et al. (2024). This would address both documented limitations (Affective Lexicon Gap and TF-IDF overfitting) simultaneously.