MindScan — Mental Health Detection System

Extending prior work

Our work vs Tumaliuan et al. (2024)

Dataset 1 is structurally equivalent to the base paper's Filipino Twitter corpus — same 6-class task, same clinical annotation method — making a direct F1 comparison valid.

Tumaliuan et al. — 2024

Filipino Twitter Depression

Frontiers in Computer Science · word2vec pipeline

Used word2vec (2013) — static embeddings, no negation handling

SVM never tested — absent from evaluation despite being NLP gold standard

XGBoost never tested — gradient boosting entirely absent

Class imbalance listed as limitation — never resolved

Restricted dataset — requires author permission to access

Cohen's Kappa not reported

Best Macro F10.8100

→

+12.7%

improvement

MindScan — 2026

English Twitter + Reddit

Zenodo (Nusrat 2024) · XLM-RoBERTa + SVM + XGBoost

✓

XLM-RoBERTa (2019) — contextual embeddings, understands negation

✓

SVM added — achieves best F1 on D1 (0.9269), beats transformer

✓

XGBoost added — F1=0.9217, gradient boosting for imbalanced data

✓

SMOTE applied — imbalance resolved, all 6 classes equalised to 2,997

✓

Public dataset — fully reproducible, anyone can verify results

✓

Cohen's Kappa reported — κ=0.9072 (almost perfect agreement)

Best Macro F10.9269

Methodology

CRISP-DM pipeline

Click any stage to see exactly what happened at that step — the real numbers and decisions.

Business
Understanding

Data
Understanding

Data
Preparation

Modelling

Evaluation

Deployment

Business Understanding

Core question: can a single text input simultaneously answer three clinical questions — what type of depression, is there depression, and is there suicide risk? Parallel architecture chosen because suicidal ideation can exist without depression markers. A sequential pipeline would miss this.

Key decision: all three models run independently in parallel — never as a cascade.

Four research questions

RQ1

Can a unified NLP pipeline trained on multiple independently sourced datasets provide clinically distinct mental health signals from the same text input?

RQ2

Does replacing word2vec with contextual transformer embeddings (XLM-RoBERTa) consistently improve performance across all tasks and datasets?

RQ3

How do classical ML algorithms (SVM, XGBoost, Logistic Regression) compare against transformer-based models on imbalanced multi-class psychiatric text classification?

RQ4

Can a parallel multi-model architecture detect mental health risk cases that sequential gating would miss — specifically, suicidal ideation in the absence of classic depression markers?

Data Understanding

EDA run on all three datasets. D1 imbalance: 1.89× (postpartum 3,746 vs atypical 1,980). D2 imbalance: 3.46× (not depressed 8,000 vs depressed 2,314). D3 perfectly balanced at 116,037 each. Key EDA finding: Reddit suicide posts average 200.8 words vs 62.2 for non-suicidal — a 3.2× length difference.

This length asymmetry drove the max_length=256 decision for XLM-RoBERTa on Dataset 3.

Class imbalance per dataset

D1 imbalance ratio1.89×

D2 imbalance ratio3.46× severe

D3 imbalance ratio1.0× balanced

D3 avg words (suicide)200.8 words

Data Preparation

Same cleaning pipeline on all three: lowercase → remove URLs → strip @mentions → drop # symbol (keep word) → remove punctuation → collapse whitespace. Then 80/20 stratified split (random_state=42, matching Tumaliuan). SMOTE applied to D1 and D2 training sets only.

SMOTE result: D1 training grew 11,986→17,982. D2 grew 8,251→12,800. D3 skipped — already balanced.

Before → After cleaning (D1 example)

BEFORE:
"@user I've been so depressed 😢
check https://t.co/xyz #mentalhealth"
AFTER:
"ive been so depressed mentalhealth"

Modelling

Four algorithms per dataset: Logistic Regression (baseline), SVM/LinearSVC (absent from base paper), XGBoost (absent from base paper), XLM-RoBERTa (replaces word2vec). Classical models use TF-IDF with 50K features (D1/D2) or 60K (D3). XLM-RoBERTa fine-tuned 3 epochs on T4 GPU.

XLM-RoBERTa: 278M parameters, pre-trained on 2.5TB in 100 languages. Understands negation — "I'm not fine" ≠ "I'm fine".

Training times on Tesla T4 GPU

Classical models (all 9)~20 min total (CPU)

XLM-RoBERTa D1~9.4 min

XLM-RoBERTa D2~7.8 min

XLM-RoBERTa D3~12.2 min

Evaluation

Three metrics: Macro F1 (primary, same as base paper), Cohen's Kappa (agreement beyond chance), Accuracy. All 12 models beat the 0.81 baseline. Surprising finding: SVM beats XLM-RoBERTa on D1. XLM-RoBERTa dominates on D2 and D3 where texts are longer.

SVM wins D1 (F1=0.9269 vs 0.9117) because tweets are too short for contextual embeddings to gain advantage. The transformer's edge grows with text length.

Best model per dataset

D1 — SVM winsF1=0.9269

XLM-RoBERTa was 4th (F1=0.9117)

D2 — XLM-RoBERTa winsF1=0.9993

Near-perfect binary classification

D3 — XLM-RoBERTa winsF1=0.9810

+4.42 over SVM (longer posts)

Deployment

Flask backend loads 12 classifiers at startup (~30s). POST /predict endpoint runs all models in parallel and returns structured JSON. Frontend shows all 4 model predictions per dataset with confidence bars. Majority vote (3/4) triggers suicide risk alert. Deployed locally in VS Code, accessible at localhost:5000.

Classical models load in 6.4s. XLM-RoBERTa adds ~25s on CPU. Random Forest excluded (646 MB, worst performer).

Deployed file sizes

LR + SVM + XGBoost (all 9)~15 MB

xlmr_d1_final/1,077 MB

xlmr_d2_final/1,077 MB

xlmr_d3_final/1,077 MB

Training data

Three datasets, three questions

Each dataset is trained independently answering a different clinical dimension. Click any row to expand the full statistics.

Depression type classification — 6 classes

Nusrat et al. (2024) · Zenodo 14233292 · English Twitter · Psychiatrist-verified

14,983

tweets

classes

1.89×

imbalance

⌄

Class distribution

postpartum

3,746

major depressive

2,517

bipolar

2,443

psychotic

2,312

no depression

1,985

atypical

1,980

Key stats

Avg tweet length31.4 words

After SMOTE11,986 → 17,982

TF-IDF features34,615

XLM-RoBERTa max_len128 tokens

Best model (F1)SVM — 0.9269

Binary depression detection

albertobellardini · Kaggle · Twitter · Labels: 0 (not depressed) / 1 (depressed)

10,314

tweets

classes

3.46×

imbalance

⌄

Class distribution

not depressed (0)

8,000

depressed (1)

2,314

Key stats

Avg tweet length15.1 words

After SMOTE8,251 → 12,800

Label note0/1 mapped to readable text in UI

Best model (F1)XLM-RoBERTa — 0.9993

Suicide risk detection

nikhileswarkomati · Kaggle · Reddit (r/SuicideWatch) · 232K rows, sampled to 50K

50,000

posts used

classes

1.0×

balanced

⌄

Class distribution (sampled)

non-suicide

25,000

suicide

25,000

Key stats

Avg — suicide posts200.8 words

Avg — non-suicide62.2 words

SMOTE needed?No — already balanced

XLM-RoBERTa max_len256 tokens (2× tweets)

Best model (F1)XLM-RoBERTa — 0.9810

Trained
not deployed

Why Random Forest was trained but excluded from deployment

Random Forest was fully trained and evaluated across all three datasets. It was excluded from the deployed app for two reasons — size and performance.

Size — too large

rf_d1.pkl240.4 MB

rf_d2.pkl71.7 MB

rf_d3.pkl333.9 MB

Total646 MB

Performance — worst on key tasks

D1 — 4th of 5F1=0.9129

D2 — 3rd of 5F1=0.9880

D3 — 5th of 5F1=0.8800

Only model below 0.90 on any dataset

646 MB of pkl files would add 30–60 seconds to server startup and consume ~2 GB of RAM for a model that is outperformed on D1 and D3 by every other algorithm. Results are fully reported in the paper — the model was evaluated, not ignored.

Project structure

Every file, explained

Click any file in the tree to see what it does and why it exists.

MINDSCAN

📁MindScan/

🐍app.pyFlask

🐍predict.pyPython

📄requirements.txt

📄README.md

📁templates/

🌐index.htmlHTML

📁models/

📁classical/

⚙️*.pkl (18 files)~15 MB

📁transformers/

🤖xlmr_d1/d2/d3_final/3.2 GB

📁notebooks/

📓DA_Notebook_One.ipynb.ipynb

📓DA_2_Notebook.ipynb.ipynb

📁report/

📄mindscan_report.texIEEE

app.py

MindScan/app.py

The Flask web server. Loads all 12 models once at startup — not per request. Serves the UI at GET /, exposes POST /predict for predictions, and GET /health for status checks. Starting the server takes ~30 seconds while XLM-RoBERTa models load into CPU memory.

Flask 3.0POST /predictGET /healthstartup model load

Live inference

Try it — all 12 models

Sample 3 is the most interesting — it demonstrates masked suicidality, the key clinical finding of the project.

Research prototype only. Not a clinical tool. If you or someone you know is in crisis, please contact a mental health professional or emergency services immediately.

0 characters

⚡ This is the key research finding — masked suicidality

Dataset 2 says "Not Depressed" — there are no classic depression markers in this text. But Dataset 3 flags high suicide risk. This is clinically documented: people in the final stages of a suicide plan often present calm, resolved language rather than sadness. A sequential pipeline that gates suicide detection behind depression detection would miss this completely. This is why our parallel architecture matters.

Analysis results

Dataset 1 — Depression type

—

Winner model

Dataset 2 — Depressed?

—

Winner model

Dataset 3 — Suicide risk

—

Winner model

All model predictions

4 models per dataset · purple = winner

Dataset 1 — Type

Dataset 2 — Depressed?

Dataset 3 — Risk

XLM-RoBERTa — full probability across all 6 depression types

Key findings

What the results mean

Four insights that go beyond the numbers.

SVM beats a transformer on short text

On 6-class depression type, SVM (F1=0.9269) outperforms XLM-RoBERTa (F1=0.9117). Tweets average 31 words — too short for contextual embeddings to gain advantage over TF-IDF bigrams.

D1: SVM 0.9269 vs XLM-R 0.9117

Transformer advantage scales with text length

XLM-RoBERTa's margin over SVM grows from -1.52 points (D1, 31 words) to +4.42 points (D3, 200 words). Contextual embeddings need rich context to outperform classical methods.

D3: XLM-R 0.9810 vs SVM 0.9368

Masked suicidality — the case for parallel models

A text can be "Not Depressed" on D2 while flagging high suicide risk on D3. Clinically documented: calm, resolved language before a crisis without classic depression markers. Sequential pipeline misses this.

Try Sample 3 in the demo

SMOTE fixed the limitation base paper listed

Tumaliuan listed class imbalance as an unresolved limitation. After SMOTE, Dataset 1's rarest class (atypical, 1,980 tweets) achieved F1=0.992 with XLM-RoBERTa — the highest per-class score in the project.

Atypical class: F1=0.992

Dataset	Model	Accuracy	Macro F1	Cohen's κ
D1	SVM	92.36%	0.9269	0.9072	★ Best D1
D1	XGBoost	91.76%	0.9217	0.9000
D1	Logistic Regression	91.52%	0.9179	0.8971
D1	XLM-RoBERTa	90.52%	0.9117	0.8852	4th — SVM wins
D2	XLM-RoBERTa	99.95%	0.9993	0.9986	★ Best D2
D2	XGBoost	99.27%	0.9895	0.9789
D2	Logistic Regression	98.89%	0.9839	0.9678
D3	XLM-RoBERTa	98.10%	0.9810	0.9620	★ Best D3
D3	SVM	93.68%	0.9368	0.8736
D3	Logistic Regression	93.18%	0.9318	0.8636
Tumaliuan et al. (2024) baseline		—	0.8100	—

Professor assigned — research validation

Dataset 3 split study — does more data help?

Dataset 3 has 232,074 Reddit posts. Our deployed models trained on 50K (25K per class). The professor asked us to split the full corpus into halves and retrain to validate whether our sample was sufficient and representative.

✓

Verdict — our 50K models are validated. Keep them deployed.

Our XLM-RoBERTa 50K model (F1=0.9810) outperforms the full 232K model (F1=0.9802). Adding 182,000 more training samples gave zero meaningful gain. The KS test confirmed H1, H2 and Full are statistically identical distributions (p=0.49–0.99). XGBoost collapsed on larger splits (H1: F1=0.5521) — proving our 50K sample actually stabilised it.

Our deployed

50K

25K suicide
25K non-suicide

Split study — Full

232K

116K suicide
116K non-suicide

Split study — H1

116K

58K suicide
58K non-suicide

Split study — H2

116K

58K suicide
58K non-suicide

Split	Model	Accuracy	Macro F1	Cohen's κ	AUC-ROC	Verdict
Our 50K ★	XLM-RoBERTa	98.10%	0.9810	0.9620	—	Best overall
Our 50K	SVM	93.68%	0.9368	0.8736	0.9831
Our 50K	Logistic Regression	93.18%	0.9318	0.8636	0.9817
Our 50K	XGBoost	91.62%	0.9162	0.8324	—
Full 232K	XLM-RoBERTa	98.02%	0.9802	0.9604	—	−0.0008 vs 50K
Full 232K	SVM	94.60%	0.9460	0.8919	0.9862
Full 232K	Logistic Regression	94.34%	0.9434	0.8868	0.9858
Full 232K	XGBoost	70.52%	0.6998	0.4104	0.7064	Collapsed ↓
H1 116K	XLM-RoBERTa	97.78%	0.9778	0.9556	—
H1 116K	SVM	94.18%	0.9418	0.8836	0.9835
H1 116K	Logistic Regression	93.84%	0.9384	0.8769	0.9824
H1 116K	XGBoost	60.11%	0.5521	0.2017	0.6051	Worst result ↓
H2 116K	XLM-RoBERTa	98.02%	0.9802	0.9604	—
H2 116K	SVM	94.21%	0.9421	0.8842	0.9850
H2 116K	Logistic Regression	93.74%	0.9374	0.8748	0.9832
H2 116K	XGBoost	71.00%	0.7085	0.4201	0.6805	Collapsed ↓

XLM-RoBERTa is data-efficient

50K gives F1=0.9810. Full 232K gives F1=0.9802. Adding 182K more training samples made no meaningful difference — the model reached near-ceiling with our sample.

Gap: only 0.0008 F1

XGBoost collapses at scale

F1 drops from 0.9162 (our 50K) to 0.5521 on H1 (116K). H1 vs H2 inconsistency of 0.1564 is flagged INCONSISTENT — model instability, not data quality.

H1 vs H2 gap: 0.1564 ↑

Sample was representative

KS test p-values 0.49–0.99 across all class/split comparisons. H1, H2, and Full are statistically identical distributions. The 50K sample was not biased.

KS p=0.4967 (H1 vs H2)

Three datasets.Twelve models.One input.

Three datasets.
Twelve models.
One input.