Mental health NLP research

Three datasets.
Twelve models.
One input.

A parallel multi-model system that simultaneously analyses text across three clinical dimensions — extending Tumaliuan et al. (2024) with modern transformers, SMOTE balancing, and cross-platform generalisation.

XLM-RoBERTa SVM · XGBoost · LR CRISP-DM Flask deployment Cohen's Kappa
0
Datasets
0
Models trained
0
Best macro F1 (%)
0
Above baseline

Extending prior work
Our work vs Tumaliuan et al. (2024)

Dataset 1 is structurally equivalent to the base paper's Filipino Twitter corpus — same 6-class task, same clinical annotation method — making a direct F1 comparison valid.

Tumaliuan et al. — 2024
Filipino Twitter Depression
Frontiers in Computer Science · word2vec pipeline
Used word2vec (2013) — static embeddings, no negation handling
SVM never tested — absent from evaluation despite being NLP gold standard
XGBoost never tested — gradient boosting entirely absent
Class imbalance listed as limitation — never resolved
Restricted dataset — requires author permission to access
Cohen's Kappa not reported
Best Macro F10.8100
+12.7%
improvement
MindScan — 2026
English Twitter + Reddit
Zenodo (Nusrat 2024) · XLM-RoBERTa + SVM + XGBoost
XLM-RoBERTa (2019) — contextual embeddings, understands negation
SVM added — achieves best F1 on D1 (0.9269), beats transformer
XGBoost added — F1=0.9217, gradient boosting for imbalanced data
SMOTE applied — imbalance resolved, all 6 classes equalised to 2,997
Public dataset — fully reproducible, anyone can verify results
Cohen's Kappa reported — κ=0.9072 (almost perfect agreement)
Best Macro F10.9269

Methodology
CRISP-DM pipeline

Click any stage to see exactly what happened at that step — the real numbers and decisions.

01
Business
Understanding
02
Data
Understanding
03
Data
Preparation
04
Modelling
05
Evaluation
06
Deployment
Business Understanding
Core question: can a single text input simultaneously answer three clinical questions — what type of depression, is there depression, and is there suicide risk? Parallel architecture chosen because suicidal ideation can exist without depression markers. A sequential pipeline would miss this.
Key decision: all three models run independently in parallel — never as a cascade.
Four research questions
RQ1
Can a unified NLP pipeline trained on multiple independently sourced datasets provide clinically distinct mental health signals from the same text input?
RQ2
Does replacing word2vec with contextual transformer embeddings (XLM-RoBERTa) consistently improve performance across all tasks and datasets?
RQ3
How do classical ML algorithms (SVM, XGBoost, Logistic Regression) compare against transformer-based models on imbalanced multi-class psychiatric text classification?
RQ4
Can a parallel multi-model architecture detect mental health risk cases that sequential gating would miss — specifically, suicidal ideation in the absence of classic depression markers?
Data Understanding
EDA run on all three datasets. D1 imbalance: 1.89× (postpartum 3,746 vs atypical 1,980). D2 imbalance: 3.46× (not depressed 8,000 vs depressed 2,314). D3 perfectly balanced at 116,037 each. Key EDA finding: Reddit suicide posts average 200.8 words vs 62.2 for non-suicidal — a 3.2× length difference.
This length asymmetry drove the max_length=256 decision for XLM-RoBERTa on Dataset 3.
Class imbalance per dataset
D1 imbalance ratio1.89×
D2 imbalance ratio3.46× severe
D3 imbalance ratio1.0× balanced
D3 avg words (suicide)200.8 words
Data Preparation
Same cleaning pipeline on all three: lowercase → remove URLs → strip @mentions → drop # symbol (keep word) → remove punctuation → collapse whitespace. Then 80/20 stratified split (random_state=42, matching Tumaliuan). SMOTE applied to D1 and D2 training sets only.
SMOTE result: D1 training grew 11,986→17,982. D2 grew 8,251→12,800. D3 skipped — already balanced.
Before → After cleaning (D1 example)
BEFORE:
"@user I've been so depressed 😢
check https://t.co/xyz #mentalhealth"
AFTER:
"ive been so depressed mentalhealth"
Modelling
Four algorithms per dataset: Logistic Regression (baseline), SVM/LinearSVC (absent from base paper), XGBoost (absent from base paper), XLM-RoBERTa (replaces word2vec). Classical models use TF-IDF with 50K features (D1/D2) or 60K (D3). XLM-RoBERTa fine-tuned 3 epochs on T4 GPU.
XLM-RoBERTa: 278M parameters, pre-trained on 2.5TB in 100 languages. Understands negation — "I'm not fine" ≠ "I'm fine".
Training times on Tesla T4 GPU
Classical models (all 9)~20 min total (CPU)
XLM-RoBERTa D1~9.4 min
XLM-RoBERTa D2~7.8 min
XLM-RoBERTa D3~12.2 min
Evaluation
Three metrics: Macro F1 (primary, same as base paper), Cohen's Kappa (agreement beyond chance), Accuracy. All 12 models beat the 0.81 baseline. Surprising finding: SVM beats XLM-RoBERTa on D1. XLM-RoBERTa dominates on D2 and D3 where texts are longer.
SVM wins D1 (F1=0.9269 vs 0.9117) because tweets are too short for contextual embeddings to gain advantage. The transformer's edge grows with text length.
Best model per dataset
D1 — SVM winsF1=0.9269
XLM-RoBERTa was 4th (F1=0.9117)
D2 — XLM-RoBERTa winsF1=0.9993
Near-perfect binary classification
D3 — XLM-RoBERTa winsF1=0.9810
+4.42 over SVM (longer posts)
Deployment
Flask backend loads 12 classifiers at startup (~30s). POST /predict endpoint runs all models in parallel and returns structured JSON. Frontend shows all 4 model predictions per dataset with confidence bars. Majority vote (3/4) triggers suicide risk alert. Deployed locally in VS Code, accessible at localhost:5000.
Classical models load in 6.4s. XLM-RoBERTa adds ~25s on CPU. Random Forest excluded (646 MB, worst performer).
Deployed file sizes
LR + SVM + XGBoost (all 9)~15 MB
xlmr_d1_final/1,077 MB
xlmr_d2_final/1,077 MB
xlmr_d3_final/1,077 MB

Training data
Three datasets, three questions

Each dataset is trained independently answering a different clinical dimension. Click any row to expand the full statistics.

D1
Depression type classification — 6 classes
Nusrat et al. (2024) · Zenodo 14233292 · English Twitter · Psychiatrist-verified
14,983
tweets
6
classes
1.89×
imbalance
Class distribution
postpartum
3,746
major depressive
2,517
bipolar
2,443
psychotic
2,312
no depression
1,985
atypical
1,980
Key stats
Avg tweet length31.4 words
After SMOTE11,986 → 17,982
TF-IDF features34,615
XLM-RoBERTa max_len128 tokens
Best model (F1)SVM — 0.9269
D2
Binary depression detection
albertobellardini · Kaggle · Twitter · Labels: 0 (not depressed) / 1 (depressed)
10,314
tweets
2
classes
3.46×
imbalance
Class distribution
not depressed (0)
8,000
depressed (1)
2,314
Key stats
Avg tweet length15.1 words
After SMOTE8,251 → 12,800
Label note0/1 mapped to readable text in UI
Best model (F1)XLM-RoBERTa — 0.9993
D3
Suicide risk detection
nikhileswarkomati · Kaggle · Reddit (r/SuicideWatch) · 232K rows, sampled to 50K
50,000
posts used
2
classes
1.0×
balanced
Class distribution (sampled)
non-suicide
25,000
suicide
25,000
Key stats
Avg — suicide posts200.8 words
Avg — non-suicide62.2 words
SMOTE needed?No — already balanced
XLM-RoBERTa max_len256 tokens (2× tweets)
Best model (F1)XLM-RoBERTa — 0.9810
RF
Trained
not deployed
Why Random Forest was trained but excluded from deployment
Random Forest was fully trained and evaluated across all three datasets. It was excluded from the deployed app for two reasons — size and performance.
Size — too large
rf_d1.pkl240.4 MB
rf_d2.pkl71.7 MB
rf_d3.pkl333.9 MB
Total646 MB
Performance — worst on key tasks
D1 — 4th of 5F1=0.9129
D2 — 3rd of 5F1=0.9880
D3 — 5th of 5F1=0.8800
Only model below 0.90 on any dataset
646 MB of pkl files would add 30–60 seconds to server startup and consume ~2 GB of RAM for a model that is outperformed on D1 and D3 by every other algorithm. Results are fully reported in the paper — the model was evaluated, not ignored.

Project structure
Every file, explained

Click any file in the tree to see what it does and why it exists.

MINDSCAN
📁MindScan/
🐍app.pyFlask
🐍predict.pyPython
📄requirements.txt
📄README.md
📁templates/
🌐index.htmlHTML
📁models/
📁classical/
⚙️*.pkl (18 files)~15 MB
📁transformers/
🤖xlmr_d1/d2/d3_final/3.2 GB
📁notebooks/
📓DA_Notebook_One.ipynb.ipynb
📓DA_2_Notebook.ipynb.ipynb
📁report/
📄mindscan_report.texIEEE
app.py
MindScan/app.py
The Flask web server. Loads all 12 models once at startup — not per request. Serves the UI at GET /, exposes POST /predict for predictions, and GET /health for status checks. Starting the server takes ~30 seconds while XLM-RoBERTa models load into CPU memory.
Flask 3.0POST /predictGET /healthstartup model load

Live inference
Try it — all 12 models

Sample 3 is the most interesting — it demonstrates masked suicidality, the key clinical finding of the project.

Research prototype only. Not a clinical tool. If you or someone you know is in crisis, please contact a mental health professional or emergency services immediately.
0 characters
⚡ This is the key research finding — masked suicidality
Dataset 2 says "Not Depressed" — there are no classic depression markers in this text. But Dataset 3 flags high suicide risk. This is clinically documented: people in the final stages of a suicide plan often present calm, resolved language rather than sadness. A sequential pipeline that gates suicide detection behind depression detection would miss this completely. This is why our parallel architecture matters.
Analysis results
Dataset 1 — Depression type
Winner model
Dataset 2 — Depressed?
Winner model
Dataset 3 — Suicide risk
Winner model
All model predictions
4 models per dataset · purple = winner
Dataset 1 — Type
Dataset 2 — Depressed?
Dataset 3 — Risk
XLM-RoBERTa — full probability across all 6 depression types

Key findings
What the results mean

Four insights that go beyond the numbers.

01
SVM beats a transformer on short text
On 6-class depression type, SVM (F1=0.9269) outperforms XLM-RoBERTa (F1=0.9117). Tweets average 31 words — too short for contextual embeddings to gain advantage over TF-IDF bigrams.
D1: SVM 0.9269 vs XLM-R 0.9117
02
Transformer advantage scales with text length
XLM-RoBERTa's margin over SVM grows from -1.52 points (D1, 31 words) to +4.42 points (D3, 200 words). Contextual embeddings need rich context to outperform classical methods.
D3: XLM-R 0.9810 vs SVM 0.9368
03
Masked suicidality — the case for parallel models
A text can be "Not Depressed" on D2 while flagging high suicide risk on D3. Clinically documented: calm, resolved language before a crisis without classic depression markers. Sequential pipeline misses this.
Try Sample 3 in the demo
04
SMOTE fixed the limitation base paper listed
Tumaliuan listed class imbalance as an unresolved limitation. After SMOTE, Dataset 1's rarest class (atypical, 1,980 tweets) achieved F1=0.992 with XLM-RoBERTa — the highest per-class score in the project.
Atypical class: F1=0.992
DatasetModelAccuracyMacro F1Cohen's κ
D1SVM92.36%0.92690.9072★ Best D1
D1XGBoost91.76%0.92170.9000
D1Logistic Regression91.52%0.91790.8971
D1XLM-RoBERTa90.52%0.91170.88524th — SVM wins
D2XLM-RoBERTa99.95%0.99930.9986★ Best D2
D2XGBoost99.27%0.98950.9789
D2Logistic Regression98.89%0.98390.9678
D3XLM-RoBERTa98.10%0.98100.9620★ Best D3
D3SVM93.68%0.93680.8736
D3Logistic Regression93.18%0.93180.8636
Tumaliuan et al. (2024) baseline0.8100

Professor assigned — research validation
Dataset 3 split study — does more data help?

Dataset 3 has 232,074 Reddit posts. Our deployed models trained on 50K (25K per class). The professor asked us to split the full corpus into halves and retrain to validate whether our sample was sufficient and representative.

Verdict — our 50K models are validated. Keep them deployed.
Our XLM-RoBERTa 50K model (F1=0.9810) outperforms the full 232K model (F1=0.9802). Adding 182,000 more training samples gave zero meaningful gain. The KS test confirmed H1, H2 and Full are statistically identical distributions (p=0.49–0.99). XGBoost collapsed on larger splits (H1: F1=0.5521) — proving our 50K sample actually stabilised it.
Our deployed
50K
25K suicide
25K non-suicide
Split study — Full
232K
116K suicide
116K non-suicide
Split study — H1
116K
58K suicide
58K non-suicide
Split study — H2
116K
58K suicide
58K non-suicide
SplitModelAccuracyMacro F1Cohen's κAUC-ROCVerdict
Our 50K ★ XLM-RoBERTa98.10% 0.98100.9620 Best overall
Our 50K SVM93.68% 0.93680.87360.9831
Our 50K Logistic Regression93.18% 0.93180.86360.9817
Our 50K XGBoost91.62% 0.91620.8324
Full 232K XLM-RoBERTa98.02% 0.98020.9604 −0.0008 vs 50K
Full 232K SVM94.60% 0.94600.89190.9862
Full 232K Logistic Regression94.34% 0.94340.88680.9858
Full 232K XGBoost70.52% 0.69980.41040.7064 Collapsed ↓
H1 116K XLM-RoBERTa97.78% 0.97780.9556
H1 116K SVM94.18% 0.94180.88360.9835
H1 116K Logistic Regression93.84% 0.93840.87690.9824
H1 116K XGBoost60.11% 0.55210.20170.6051 Worst result ↓
H2 116K XLM-RoBERTa98.02% 0.98020.9604
H2 116K SVM94.21% 0.94210.88420.9850
H2 116K Logistic Regression93.74% 0.93740.87480.9832
H2 116K XGBoost71.00% 0.70850.42010.6805 Collapsed ↓
01
XLM-RoBERTa is data-efficient
50K gives F1=0.9810. Full 232K gives F1=0.9802. Adding 182K more training samples made no meaningful difference — the model reached near-ceiling with our sample.
Gap: only 0.0008 F1
02
XGBoost collapses at scale
F1 drops from 0.9162 (our 50K) to 0.5521 on H1 (116K). H1 vs H2 inconsistency of 0.1564 is flagged INCONSISTENT — model instability, not data quality.
H1 vs H2 gap: 0.1564 ↑
03
Sample was representative
KS test p-values 0.49–0.99 across all class/split comparisons. H1, H2, and Full are statistically identical distributions. The 50K sample was not biased.
KS p=0.4967 (H1 vs H2)