MindScan — Multi-Model Framework for Depression & Suicide Risk Detection

Extending prior work

Our work vs Tumaliuan et al. (2024)

Dataset 1 is structurally equivalent to the base paper's Filipino Twitter corpus — same 6-class task, same clinical annotation method — making a direct F1 comparison valid.

Tumaliuan et al. — 2024

Filipino Twitter Depression

Frontiers in Computer Science · word2vec pipeline

Used word2vec (2013) — static embeddings, no negation handling

SVM never tested — absent from evaluation despite being NLP gold standard

XGBoost never tested — gradient boosting entirely absent

Class imbalance listed as limitation — never resolved

Restricted dataset — requires author permission to access

Accuracy not verified — no reproducible baseline reported

Best Accuracy~81%

→

+11.4%

accuracy gain (D1)

MindScan — 2026

English Twitter + Reddit

Zenodo (Nusrat 2024) · XLM-RoBERTa + SVM + XGBoost

✓

XLM-RoBERTa (2019) — contextual embeddings, understands negation

✓

SVM added — best D1 accuracy 92.36%, beats transformer (90.52%)

✓

XGBoost added — accuracy 91.76%, gradient boosting for imbalanced data

✓

SMOTE applied — imbalance resolved, all 6 classes equalised to 2,997

✓

Public dataset — fully reproducible, anyone can verify results

✓

Accuracy verified on held-out 20% test set, same 6-class task

Best Accuracy (D1 SVM)92.4%

Methodology

Three-step pipeline

CRISP-DM applied across all three datasets — from raw social media text to parallel ensemble predictions.

Data

3 clinical datasets spanning Twitter and Reddit, covering depression types, binary detection, and suicide risk.

Preprocessing

6-stage text cleaning pipeline + SMOTE oversampling to address class imbalance left unresolved by the base paper.

Modelling

Parallel ensemble of 12 classifiers — all run independently on every prediction, never as a sequential cascade.

Dataset Overview

D1 — Depression Types (Zenodo 14233292)

14,983 tweets · 6 classes — Postpartum (3,746), Major Depressive (2,517), Bipolar (2,443), Psychotic (2,312), No Depression (1,985), Atypical (1,980). Psychiatrist-verified labels. Class imbalance ratio: 1.89×.

D2 — Binary Depression (Kaggle: albertobellardini)

10,314 tweets · 2 classes — Not Depressed (8,000) / Depressed (2,314). Severe class imbalance: 3.46×. Twitter short-form text. SMOTE applied to training set (8,251 → 12,800 samples). Trained on Twitter affect patterns — may underdetect atypical presentations.

D3 — Suicide Risk (Kaggle: nikhileswarkomati)

232,074 Reddit posts · 2 classes — Suicide / Non-Suicide (perfectly balanced, 116,037 each). Suicide posts average 200.8 words (mean), non-suicide posts 63 words. We sample 50K posts and compare against full/half splits to answer RQ2.

Business Context

A clinically-motivated framework for social media monitoring — applicable to platform-level moderation, mental health triage, and early intervention systems. Complements rather than replaces clinical assessment.

Preprocessing Pipeline

6-Stage Text Cleaning

1. Lowercase · 2. Strip URLs & http links · 3. Remove @mentions · 4. Remove # symbols · 5. Strip punctuation · 6. Collapse whitespace. Applied identically across all three datasets for consistency.

SMOTE — Synthetic Oversampling

Applied to D1 and D2 training sets only (D3 is pre-balanced). D1: 11,986 → 17,982 samples. D2: 8,251 → 12,800 samples. Creates synthetic clinical neighbours in TF-IDF feature space. Directly addresses the base paper's (Tumaliuan 2024) biggest limitation — they trained on raw imbalanced data.

Feature Extraction — TF-IDF

TF-IDF vectoriser with unigrams + bigrams, fitted per-dataset on training data only. Captures frequency-weighted term co-occurrence patterns, well-suited for short Twitter text.

Feature Extraction — Tokeniser

XLM-RoBERTa tokeniser (max 128 tokens D1/D2, 256 tokens D3) with padding. Pre-trained multilingual contextual embeddings capture semantic meaning and long-range dependencies — critical for Reddit's longer posts.

Ensemble Strategy & Architecture

4 Models per Dataset (12 total)

Logistic Regression — L2 regularised, max_iter=1000. SVM — LinearSVC, C=1.0. XGBoost — 300 estimators, max_depth=6. XLM-RoBERTa — fine-tuned multilingual transformer, 278M parameters, lr=2e-5, 3 epochs.

Ensemble Vote — Risk Flag Logic

All 12 models run simultaneously on every input. A sequential design (check depression first, then suicide risk) would miss masked suicidality — a clinically documented pre-crisis pattern where affect appears normal but intent is resolved. Parallelism is a safety requirement, not a design preference.

XGBoost Algorithm Collapse

XGBoost accuracy on D3: 91.6% (50K sample) → 70.5% (Full 232K) → 60.1% (H1 116K). Performance degrades as training data grows. The H1/H2 results are also inconsistent (60.1% vs 71.0%) — gradient boosting is highly sensitive to data distribution shifts at this scale, making it unreliable for large Reddit corpora.

D3 Split Study (RQ2)

D3 trained on 4 configurations: Full (232K), Half 1 (116K), Half 2 (116K), Sample (50K). XLM-RoBERTa accuracy: 98.1% (50K) → 97.8% (H1) → 98.0% (H2/Full). Δ = 0.3% across 4× more data. Kolmogorov-Smirnov tests confirm all splits share identical distributions (p > 0.49), validating the comparison.

Dataset / Split	Logistic Regression	SVM	XGBoost	XLM-RoBERTa
D1 Depression Types	91.5%	92.4%	91.8%	90.5%
D2 Binary Depression	98.9%	97.1%	99.3%	99.9%
D3 Full (232K)	94.3%	94.6%	70.5%	98.0%
D3 Half 1 (116K)	93.8%	94.2%	60.1%	97.8%
D3 Half 2 (116K)	93.7%	94.2%	71.0%	98.0%
D3 Sample (50K) ★	93.2%	93.7%	91.6%	98.1%

Key findings

What the results show

Four insights that directly answer the research questions.

SVM is the best model for short-form text

On 6-class depression type classification (D1), SVM achieves the highest Accuracy of 92.4%. Tweets average 31 words — too short for transformer contextual embeddings to gain advantage over TF-IDF bigrams.

D1 Accuracy: SVM 92.4%

XLM-RoBERTa is the best model for long-form text

On Reddit suicide risk posts (D3), XLM-RoBERTa achieves 98.1% Accuracy with the 50K sample. Suicide posts average 200.8 words — rich enough context for transformer embeddings to dominate every competitor. D2 (Twitter, ~31 words) tells the opposite story.

D3 Accuracy: XLM-RoBERTa 98.1%

Increasing data size provided no significant gain

Scaling from 50K to 232K samples produced only a 0.1% change in XLM-RoBERTa Accuracy (98.1% → 98.0%). Adding 182,000 more training examples gave no meaningful improvement, validating the 50K sample.

50K → 232K: Δ Accuracy = 0.1%

Social media affect ≠ clinical presentation

D2 was trained on Twitter-style emotional language (explicit distress, slang). Clinical presentations — anhedonia ("nothing feels enjoyable"), fatigue, flat affect — use a different lexicon and are systematically under-flagged. This is the documented Affective vs. Clinical Lexicon Gap: models trained on social media affect fail to recognise diagnostic-criteria language.

D2 limitation — documented failure mode

Parallel architecture is the safety net

When D2 misses a clinical presentation, D1 and D3 can still catch it. When classical D3 models over-flag depressive vocabulary, XLM-RoBERTa's contextual understanding overrides them. No single model is sufficient — the parallel ensemble exists precisely because each model's failure mode is different and partially compensated by the others.

Multi-task learning precedent — Zogan et al. 2024

Conclusions

Research verdict

Direct answers to both research questions, and the key limitations of the study.

RQ1 — Best model

No single model wins across all tasks

SVM (92.4%) wins on short-form Twitter text (D1) where TF-IDF bigrams capture enough signal. XLM-RoBERTa wins on long-form Reddit posts (D2: 99.9%, D3: 98.1%) where contextual embeddings dominate. Model selection must be text-length aware.

SVM for short text · XLM-RoBERTa for long text

RQ2 — Dataset size

More data gave no meaningful gain

Scaling from 50K to 232K training samples produced only a 0.1% change in XLM-RoBERTa Accuracy (98.1% → 98.0%). For this task and model, the 50K sample captures the full signal — there is no statistically significant benefit from 4× more data.

50K sample is sufficient · Δ = 0.1%

Limitation 1 — Affective vs. Clinical Lexicon Gap

Social media affect ≠ clinical diagnostic criteria

D2 was trained on Twitter explicit emotional language. Clinical presentations using diagnostic vocabulary — anhedonia ("nothing feels enjoyable"), psychomotor fatigue, flat affect — do not match that training distribution and are systematically under-flagged. This is empirical evidence of the domain gap between self-reported social media affect and clinical language, not a model defect.

Documented domain gap — Finding 04

Limitation 2 — Classical model lexical overfitting

TF-IDF ignores word order and context

Classical D3 models (LR, SVM, XGBoost) use TF-IDF bag-of-words features. Vocabulary overlapping with r/SuicideWatch posts (e.g. "exhausted", "nothing feels enjoyable") triggers false-positive suicide flags — the model sees matching tokens without understanding the sentence context. XLM-RoBERTa's contextual embeddings override these false positives, demonstrating why the transformer is the reliable D3 winner.

TF-IDF lexical overfitting — defer to XLM-RoBERTa

Defence prep

Frequently asked questions

Click any question to expand the answer. Grouped by topic for quick navigation during Q&A.

Data & Datasets

D1 — 6-class depression type classification (atypical, bipolar, major depressive, no depression, postpartum, psychotic) from Kaggle. Twitter-length text, 11,986 samples. D2 — binary depressed/not-depressed from Twitter (10,314 samples, severe 3.46× imbalance). D3 — binary suicide/non-suicide from Reddit (232K samples, perfectly balanced 116,037 each — we use a 50K sample of 25K per class). Each dataset has a different task, different text length, and different vocabulary domain — which is precisely why running all three in parallel is informative.

D1 had 1.89× imbalance (atypical class), D2 had 3.46× imbalance. We applied SMOTE to training data only — never the test set. SMOTE interpolates new synthetic samples in TF-IDF feature space between existing minority-class examples. Class weighting was also evaluated; SMOTE showed equal or better Macro F1 in cross-validation. D3 was pre-balanced and required no oversampling.

No. The train/test split (stratified 80/20) is performed first. SMOTE is then applied only to the training portion. The TF-IDF vocabulary is fitted on training data only and applied as a read-only transform to the test set. XLM-RoBERTa uses a fixed pretrained tokeniser. No test sample was ever used to inform any training decision.

Methodology & Models

Each captures a different inductive bias: Logistic Regression (linear decision boundary), SVM (maximum-margin), Random Forest/XGBoost (non-linear tree ensembles), XLM-RoBERTa (contextual transformer). Disagreement between models is itself a signal. On D1, SVM (92.4%) beats XLM-RoBERTa (90.5%) — short tweets don't give the transformer enough context to gain advantage. On D3 (200.8-word Reddit posts), XLM-RoBERTa (98.1%) dominates every classical model.

max_features=50,000 — covers the full relevant vocabulary without noise. ngram_range=(1,2) — unigrams + bigrams capture local phrases ("not happy", "kill myself") that unigrams miss. sublinear_tf=True — applies log(1+tf) to dampen high-frequency word dominance. min_df=2 — removes hapax legomena (words appearing only once) that add noise.

Standard sequence classification fine-tuning: Adam optimiser, lr=2e-5, 3 epochs, linear warmup scheduler. Max token length: 128 for D1/D2 (Twitter-length text), 256 for D3 (Reddit posts average 200.8 words). Cross-entropy loss. Best checkpoint saved by validation accuracy. 278M parameters — multilingual pretraining covers 100 languages.

On the 50K sample, XGBoost achieves 91.6% — competitive. At full scale (232K), it collapses to 70.52% (Macro F1: 0.6998). This is TF-IDF lexical overfitting: vocabulary overlap between "suicide" and "non-suicide" Reddit posts increases with scale — words like "exhausted", "hopeless", "nothing matters" appear in both classes. Boosted trees memorise these majority-class token patterns instead of learning discriminative boundaries. H1 (116K) drops further to 60.1%, and H1 vs H2 are inconsistent (60.1% vs 70.9%), confirming XGBoost is unstable at this data scale. XLM-RoBERTa stays at 98.1% across all splits.

Results & Evaluation

Text length. D1 tweets average ~31 words. Transformers need rich context to outperform classical methods — contextual embeddings add little value on ~40-token inputs. TF-IDF bigrams on short explicit text (like tweets) already capture the full signal. This is Finding 01 and one of the key research conclusions: model selection must be text-length aware.

The dashboard shows accuracy for accessibility (non-specialist audience). After SMOTE, all training classes are equalised — so accuracy and Macro F1 are closely aligned. The full Macro F1, Cohen's Kappa, and per-class precision/recall are reported in the IEEE technical report. The evidence matrix footnote notes this explicitly.

No — XLM-RoBERTa: 98.1% (50K, NB2) · 98.02% (Full 232K) · 97.78% (H1) · 98.02% (H2). Maximum delta = 0.32%. KS tests across the three split study splits (Full, H1, H2) confirm identical distributions: suicide class p=0.4967 (H1 vs H2), p=0.9758 (Full vs H1); non-suicide class p=0.8125 (H1 vs H2), p=0.9992 (Full vs H1). All well above the p=0.05 threshold — distribution shift is not driving the results. This is Finding 03.

Architecture & Live Demo

Real models. The Flask app proxies every request to a HuggingFace Space (esvanth-mindscan.hf.space) which runs predict.py with all 12 loaded models. There is no hardcoded data — every input goes through the full pipeline. If the Space is sleeping it auto-wakes within ~60 seconds.

It means classical D3 models (LR/SVM/XGBoost) flagged suicide risk by majority vote, but XLM-RoBERTa — the best model at 98.1% accuracy — disagrees. A pure majority vote could trigger false alarms on metaphorical language ("I'm dying of embarrassment"). The amber state expresses uncertainty rather than forcing a binary decision, which maps directly to "escalate for human review" — the appropriate clinical-conservative response.

This is the Affective vs. Clinical Lexicon Gap (Finding 04, documented in NAACL 2024). D2 was trained on Twitter emotional language — explicit distress, slang, emotional punctuation. Clinical presentations use diagnostic vocabulary: anhedonia ("nothing feels enjoyable"), psychomotor fatigue, flat affect. These words are absent from D2's training distribution. This is not a bug — it is an empirical finding about the domain gap between social media affect and clinical language.

Replace TF-IDF classical models with MentalBERT/MentalRoBERTa (Ji et al. 2022) pretrained on mental health forum data. Combine all three tasks in a true multi-task learning setup with a shared encoder and task-specific heads — following the MTL precedent from Zogan et al. (2024). This would address both documented limitations (Affective Lexicon Gap and TF-IDF overfitting) simultaneously.

Multi-Model Framework for Depression Classification and Suicide Risk Detection from Social Media Text