A parallel ensemble of 12 classifiers across 3 clinical datasets — extending Tumaliuan et al. (2024) with modern transformers and SMOTE balancing.
Dataset 1 is structurally equivalent to the base paper's Filipino Twitter corpus — same 6-class task, same clinical annotation method — making a direct F1 comparison valid.
CRISP-DM applied across all three datasets — from raw social media text to parallel ensemble predictions.
All 4 models evaluated across all dataset splits. Bold = winner per row. Red = XGBoost collapse on larger training sets. — How metrics are computed
| Dataset / Split | Logistic Regression | SVM | XGBoost | XLM-RoBERTa |
|---|---|---|---|---|
D1 Depression Types |
91.5% | 92.4% | 91.8% | 90.5% |
D2 Binary Depression |
98.9% | 97.1% | 99.3% | 99.9% |
D3 Full (232K) |
94.3% | 94.6% | 70.5% | 98.0% |
D3 Half 1 (116K) |
93.8% | 94.2% | 60.1% | 97.8% |
D3 Half 2 (116K) |
93.7% | 94.2% | 71.0% | 98.0% |
D3 Sample (50K) ★ |
93.2% | 93.7% | 91.6% | 98.1% |
Note: Full performance evaluation including Macro F1-Score, Cohen's Kappa, and per-class metrics are documented in the Final IEEE Report. Accuracy is shown here as the primary comparative metric for cross-dataset validation.
Four insights that directly answer the research questions.
Direct answers to both research questions, and the key limitations of the study.
Sample 3 demonstrates masked suicidality. Try typing clinical-style depressive language ("I feel exhausted, nothing feels enjoyable") to observe the Affective vs. Clinical Lexicon Gap documented in Finding 04.
How the demo works: Flask → HuggingFace proxy · predict_all() inference flow
Click any question to expand the answer. Grouped by topic for quick navigation during Q&A.
SMOTE to training data only — never the test set. SMOTE interpolates new synthetic samples in TF-IDF feature space between existing minority-class examples. Class weighting was also evaluated; SMOTE showed equal or better Macro F1 in cross-validation. D3 was pre-balanced and required no oversampling.max_features=50,000 — covers the full relevant vocabulary without noise. ngram_range=(1,2) — unigrams + bigrams capture local phrases ("not happy", "kill myself") that unigrams miss. sublinear_tf=True — applies log(1+tf) to dampen high-frequency word dominance. min_df=2 — removes hapax legomena (words appearing only once) that add noise.lr=2e-5, 3 epochs, linear warmup scheduler. Max token length: 128 for D1/D2 (Twitter-length text), 256 for D3 (Reddit posts average 200.8 words). Cross-entropy loss. Best checkpoint saved by validation accuracy. 278M parameters — multilingual pretraining covers 100 languages.esvanth-mindscan.hf.space) which runs predict.py with all 12 loaded models. There is no hardcoded data — every input goes through the full pipeline. If the Space is sleeping it auto-wakes within ~60 seconds.