MindScan β Complete Project Reference
NCI H9DAI Β· MSc Artificial Intelligence Β· 2026
Disclaimer: This is a research prototype built for academic coursework. It is not a clinical tool and must never be used for actual medical diagnosis or mental health assessment.
Table of Contents
- Project Purpose
- Tech Stack
- Directory Structure
- Core Features
- Components & Modules
- Data Models & API Contract
- API Endpoints
- Configuration & Setup
- AI / ML Architecture
- Entry Points & Running the App
- Dependencies
- Testing & Evaluation
- Key Findings & Anomalies
- Research Context
1. Project Purpose
MindScan is a multi-model mental health text analysis system. It runs 12 machine learning classifiers simultaneously across 3 independent datasets, returning three clinically distinct assessments from a single text input:
| Assessment | Task | Classes |
|---|---|---|
| Depression Type | Multi-class classification | postpartum, major depressive, bipolar, psychotic, no depression, atypical |
| Binary Depression | Binary classification | Depressed / Not Depressed |
| Suicide Risk | Binary classification | Suicide Risk / No Suicide Risk |
Research goal: Extend Tumaliuan et al. (2024) with modern transformer embeddings (XLM-RoBERTa), classical ML gold standards (SVM), and SMOTE balancing β achieving a +12.7% F1 improvement (0.81 β 0.9269) over the baseline.
Key architectural decision β parallel, not sequential: All 3 datasets run independently. Suicidal ideation can exist without depression markers; a sequential pipeline would gate out those cases entirely. Research Question 4 (RQ4) explicitly tests whether the parallel design catches cases that sequential would miss.
2. Tech Stack
| Layer | Technology | Version |
|---|---|---|
| Web framework | Flask | 3.0.3 |
| Classical ML | scikit-learn | 1.6.1 |
| Gradient boosting | XGBoost | 2.0.3 |
| Transformer models | HuggingFace Transformers | 4.41.2 |
| Deep learning runtime | PyTorch | 2.3.0 |
| Model serialization | joblib | 1.4.2 |
| Numerical ops | NumPy | 1.26.4 |
| Frontend | HTML5 + vanilla JS + CSS3 | β |
| Fonts | Instrument Serif, Geist, DM Mono | β |
No database. No frontend framework. All model state is held in memory after startup.
3. Directory Structure
MindScan/
βββ app.py # Flask entry point (94 lines)
βββ predict.py # Core prediction logic (303 lines)
βββ requirements.txt # Python dependencies (7 packages)
βββ README.md # Quick-start guide
βββ ABOUT.md # This file β full project reference
βββ .gitignore # models/ excluded (too large for git)
β
βββ templates/
β βββ index.html # Single-page web UI
β
βββ notebooks/
β βββ DA_Notebook_One.ipynb # Classical model training (2,269 lines)
β βββ DA_2_Notebook.ipynb # XLM-RoBERTa training (13,178 lines)
β
βββ models/
β βββ classical/
β β βββ le_d1.pkl # LabelEncoder β D1 (543 bytes)
β β βββ le_d2.pkl # LabelEncoder β D2
β β βββ le_d3.pkl # LabelEncoder β D3
β β βββ tfidf_d1.pkl # TF-IDF vectorizer β D1 (1.4 MB, 34,615 features)
β β βββ tfidf_d2.pkl # TF-IDF vectorizer β D2 (569 KB, 50,000 features)
β β βββ tfidf_d3.pkl # TF-IDF vectorizer β D3 (2.3 MB, 60,000 features)
β β βββ logistic_regression_d1.pkl # LR β D1 (1.6 MB)
β β βββ logistic_regression_d2.pkl # LR β D2 (120 KB)
β β βββ logistic_regression_d3.pkl # LR β D3 (470 KB)
β β βββ svm_d1.pkl # LinearSVC β D1 (1.6 MB)
β β βββ svm_d2.pkl # LinearSVC β D2 (120 KB)
β β βββ svm_d3.pkl # LinearSVC β D3 (470 KB)
β β βββ xgboost_d1.pkl # XGBoost β D1 (3.1 MB)
β β βββ xgboost_d2.pkl # XGBoost β D2 (362 KB)
β β βββ xgboost_d3.pkl # XGBoost β D3 (702 KB)
β β βββ random_forest_d1/d2/d3.pkl # RF β NOT deployed (241 MB + 72 MB + 334 MB)
β β βββ classical_results.csv # Performance metrics table
β β βββ *.png # Confusion matrices + EDA plots (16 images)
β β
β βββ transformers/
β βββ xlmr_d1_final/ # Fine-tuned XLM-RoBERTa β D1 (1.1 GB)
β β βββ config.json # Model architecture config
β β βββ model.safetensors # Weights (1.1 GB)
β β βββ tokenizer.json # BPE tokenizer (17 MB)
β β βββ tokenizer_config.json # Tokenizer metadata
β βββ xlmr_d2_final/ # Fine-tuned XLM-RoBERTa β D2 (1.1 GB)
β βββ xlmr_d3_final/ # Fine-tuned XLM-RoBERTa β D3 (1.1 GB)
β
βββ venv/ # Python virtual environment
βββ .venv/ # Backup venv (both in .gitignore)
Total disk usage: ~3.2 GB (dominated by 3 Γ 1.1 GB transformer weights)
4. Core Features
Four Models Per Dataset (12 Total)
Each of the 3 datasets is evaluated by 4 independent models. All run on every request:
- Logistic Regression (TF-IDF input)
- SVM / LinearSVC (TF-IDF input)
- XGBoost (TF-IDF input)
- XLM-RoBERTa fine-tuned (raw text input)
Risk Aggregation
- If 3 or more of the 4 Dataset 3 models flag suicide risk β
risk_flag = true - UI renders a red danger banner
- Response includes
"suicide_votes": "X/4 models flagged suicide risk"
Text Preprocessing Pipeline
Applied to all input before TF-IDF vectorization (raw text passed to transformers):
lowercase β remove URLs (http/www/https) β strip @mentions
β remove # symbols (word kept) β delete punctuation β normalize whitespace
UI Features (index.html)
- Live demo textarea with 5000-character limit and real-time counter
- Sample text buttons for quick testing
- Results display: winner card + 4 model confidence bars per dataset
- Class probability breakdown (expandable)
- Risk flag banner (red = danger, green = safe)
- CRISP-DM interactive timeline (6 stages, collapsible detail panels)
- Dataset explorer with class distribution bars
- Model card grid with F1 scores
- Project folder tree with file detail pane
- Animated stat counters in hero section
- Comparison panel vs Tumaliuan et al. (2024) baseline
5. Components & Modules
app.py (94 lines)
Flask application. Responsibilities:
- Initializes Flask app
- Calls
load_all_models()at startup (blocks until complete) - Defines 3 routes:
/,/predict,/health - Input validation: max 5000 chars, non-empty, valid JSON
- Prints startup progress with emoji checkmarks to console
- Serves on
0.0.0.0:5000(accessible on local network)
predict.py (303 lines)
Core prediction engine. Key functions:
| Function | Purpose |
|---|---|
load_all_models() |
Loads all 12 models + encoders + tokenizer into _models global dict |
clean_text(text) |
Regex-based text cleaning (same logic used in both training notebooks) |
predict_classical(text, ds) |
TF-IDF vectorization + sklearn predict / decision_function |
predict_transformer(text, ds) |
Tokenization β forward pass β softmax probabilities |
predict_all(raw_text) |
Main orchestrator: cleans text, runs all 12 models, returns full result dict |
Confidence normalization: All models normalize to 0β1:
- LR / XGBoost:
predict_proba()(native) - SVM:
softmax(decision_function())(LinearSVC has no proba by default) - Transformer:
softmax(logits)
D2 label mapping: Raw integer labels (0, 1) mapped to "Not Depressed" / "Depressed". Handles both str and int types for robustness.
templates/index.html
Single-page application. All UI logic in vanilla JS:
fetch('/predict', { method: 'POST', ... })β AJAX prediction call- Tab switching, progress bar animation, accordion expand/collapse
- Counter animations for stats in hero section
- No build step, no bundler, no external JS framework
6. Data Models & API Contract
Request
POST /predict
Content-Type: application/json
{ "text": "string β max 5000 characters" }
Response
{
"dataset1": {
"task": "Depression Type (6 Classes)",
"models": {
"Logistic Regression": { "label": "postpartum", "confidence": 0.958 },
"SVM": { "label": "postpartum", "confidence": 0.828 },
"XGBoost": { "label": "postpartum", "confidence": 0.999 },
"XLM-RoBERTa": { "label": "postpartum", "confidence": 0.997 }
},
"winner_model": "XGBoost",
"winner_prediction": "postpartum",
"winner_confidence": 0.999,
"class_probs": {
"postpartum": 0.997,
"bipolar": 0.001,
"major depressive": 0.001,
"psychotic": 0.0,
"no depression": 0.0,
"atypical": 0.001
}
},
"dataset2": {
"task": "Binary Depression Detection",
"models": { ... },
"winner_model": "XLM-RoBERTa",
"winner_prediction": "Depressed",
"winner_confidence": 0.998,
"class_probs": { "Depressed": 0.998, "Not Depressed": 0.002 }
},
"dataset3": {
"task": "Suicide Risk Assessment",
"models": { ... },
"winner_model": "XLM-RoBERTa",
"winner_prediction": "Suicide Risk",
"winner_confidence": 0.993,
"class_probs": { "Suicide Risk": 0.993, "No Suicide Risk": 0.007 }
},
"risk_flag": true,
"suicide_votes": "4/4 models flagged suicide risk",
"processing_time_ms": 2341
}
Internal Model State (_models dict in predict.py)
| Key | Type | Description |
|---|---|---|
le_d1/d2/d3 |
LabelEncoder |
Decodes integer predictions to class names |
tfidf_d1/d2/d3 |
TfidfVectorizer |
Converts cleaned text to sparse feature vectors |
logistic_regression_d1/d2/d3 |
LogisticRegression |
Linear baseline |
svm_d1/d2/d3 |
LinearSVC |
SVM classifier |
xgboost_d1/d2/d3 |
XGBClassifier |
Gradient boosting |
tokenizer |
XLMRobertaTokenizer |
Shared BPE tokenizer (all 3 transformer models) |
xlmr_d1/d2/d3 |
XLMRobertaForSequenceClassification |
Fine-tuned transformer |
xlmr_d1/d2/d3_len |
int |
Max token length: 128 / 128 / 256 |
device |
str |
'cuda' or 'cpu' |
7. API Endpoints
GET /
Returns index.html. No parameters.
POST /predict
| Scenario | HTTP Status | Response |
|---|---|---|
| Success | 200 | Full prediction JSON (see above) |
Missing text field |
400 | { "error": "..." } |
| Empty text | 400 | { "error": "..." } |
| Text > 5000 chars | 400 | { "error": "..." } |
| Models not loaded yet | 503 | { "error": "..." } |
| Prediction exception | 500 | { "error": "..." } |
Typical latency: ~2β3 seconds on CPU (XLM-RoBERTa dominates inference time)
GET /health
{ "status": "ok", "models_ready": true }
Use for polling during the ~30-second startup window.
8. Configuration & Setup
Environment Variables
None required. All paths computed relative to app.py using os.path.dirname(__file__).
Setup Steps
# 1. Download models from Google Drive β place in models/classical/ and models/transformers/
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Mac / Linux
venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txt
# 4. Start server
python app.py
# 5. Open browser
# http://localhost:5000
Expected Startup Console Output
=======================================================
MindScan β Starting up
=======================================================
Loading models... (XLM-RoBERTa takes ~30s on CPU)
β Loaded encoders/tfidf for d1
β Loaded encoders/tfidf for d2
β Loaded encoders/tfidf for d3
β Loaded logistic_regression_d1
β Loaded svm_d1
β Loaded xgboost_d1
... (repeated for d2, d3)
β Using device: cpu
β Tokeniser loaded
β Loaded XLM-RoBERTa d1 (max_length=128)
β Loaded XLM-RoBERTa d2 (max_length=128)
β Loaded XLM-RoBERTa d3 (max_length=256)
β
All models ready
π Open: http://localhost:5000
=======================================================
.gitignore Exclusions
models/β All trained model files (too large for git, download separately)venv/,.venv/β Virtual environments__pycache__/,*.pyc,*.pyo.ipynb_checkpoints/.DS_Store,Thumbs.db
9. AI / ML Architecture
The Three Datasets
| D1 | D2 | D3 | |
|---|---|---|---|
| Source | Nusrat et al. (2024) β Zenodo 14233292 | albertobellardini β Kaggle | nikhileswarkomati β Kaggle |
| Platform | |||
| Size | 14,983 tweets | 10,314 tweets | 50,000 posts |
| Task | 6-class depression type | Binary depression | Binary suicide risk |
| Avg text length | 31.4 words | ~30 words | 62β200 words |
| Class balance | Imbalanced (1.89Γ) | Severely imbalanced (3.46Γ) | Balanced (1.0Γ) |
| SMOTE applied | Yes β 11,986 β 17,982 | Yes β 8,251 β 12,800 | No |
All datasets use stratified 80/20 train/test split, random_state=42. Test sets are never touched by SMOTE (realistic evaluation).
TF-IDF Feature Dimensions
| Dataset | Features |
|---|---|
| D1 | 34,615 |
| D2 | 50,000 |
| D3 | 60,000 |
XLM-RoBERTa Architecture
- Base model:
xlm-roberta-base - Parameters: 278 million
- Layers: 12 transformer layers, 12 attention heads
- Pre-training data: 2.5 TB of text across 100 languages
- Fine-tuning: 3 epochs on Google Colab T4 GPU
- Weight format:
safetensors(more secure and efficient than.bin) - Max token lengths: D1 = 128, D2 = 128, D3 = 256
- D3 uses 256 because Reddit posts average 200.8 words for the suicide class (3.2Γ longer than non-suicidal posts at 62.2 words)
- Tokenizer: Shared single instance across all 3 models (not duplicated)
Why SVM Beats Transformer on D1
XLM-RoBERTa's contextual embeddings require sufficient token sequence length to demonstrate advantage over bag-of-words TF-IDF. D1 tweets average only 31.4 words β too short for context to matter. SVM achieves F1=0.9269 vs XLM-RoBERTa's lower score on D1. The transformer's advantage grows with text length and dominates on D3 (avg 200+ words).
Why Random Forest Was Excluded from Deployment
| Model | D1 F1 | D3 F1 | Aggregate Size |
|---|---|---|---|
| XGBoost | competitive | competitive | ~4.2 MB |
| Random Forest | worst on D1 | worst on D3 | 647 MB |
Size penalty not justified by performance. Random Forest .pkl files remain in models/classical/ but are never loaded by predict.py.
Performance Results
| Dataset | Model | Macro F1 | Cohen's ΞΊ |
|---|---|---|---|
| D1 β Depression Type | SVM | 0.9269 | 0.9072 |
| D1 β Depression Type | XGBoost | ~0.90 | β |
| D1 β Depression Type | XLM-RoBERTa | lower | β |
| D2 β Binary Depression | XLM-RoBERTa | 0.9993 | 0.9986 |
| D3 β Suicide Risk | XLM-RoBERTa | 0.9810 | 0.9620 |
| Baseline (Tumaliuan 2024) | β | 0.81 | β |
Improvement over baseline: +12.7%
10. Entry Points & Running the App
Production (Local)
python app.py
Opens on http://localhost:5000. Server also accessible on local network via http://<your-ip>:5000.
Training (Notebooks β Google Colab Only)
| Notebook | Purpose | Runtime Required |
|---|---|---|
notebooks/DA_Notebook_One.ipynb |
Train LR, SVM, XGBoost on all 3 datasets; generate metrics CSV and confusion matrix PNGs | CPU (Colab free tier) |
notebooks/DA_2_Notebook.ipynb |
Fine-tune XLM-RoBERTa on all 3 datasets; run full model comparison | T4 GPU (Colab) |
Both notebooks save outputs to Google Drive at MindScan_Models/.
11. Dependencies
flask==3.0.3 Web framework + routing
scikit-learn==1.6.1 LR, LinearSVC, TfidfVectorizer, LabelEncoder, SMOTE metrics
xgboost==2.0.3 Gradient boosting classifier
transformers==4.41.2 XLM-RoBERTa model + tokenizer (HuggingFace)
torch==2.3.0 PyTorch runtime (GPU optional β CUDA auto-detected)
joblib==1.4.2 Pickle serialization for large sklearn objects
numpy==1.26.4 Numerical operations, softmax computation
No Node.js / npm dependencies. Pure Python backend, vanilla JS frontend (no build step).
12. Testing & Evaluation
No Automated Test Suite
The project has no pytest, unittest, or CI/CD pipeline. Evaluation is:
Quantitative (offline, notebook):
- Macro F1 score (primary metric β handles class imbalance)
- Cohen's Kappa (measures agreement beyond chance β reported for D1)
- Accuracy
- Confusion matrices (saved as PNG to
models/classical/) classical_results.csvβ full metrics table for all classical models
Visual (EDA plots):
eda_d1.png,eda_d2.png,eda_d3.pngβ class distributions and text length histograms
Manual (UI):
- Sample text buttons in the live demo for smoke-testing the prediction pipeline
- All 4 model predictions + confidence bars shown simultaneously
13. Key Findings & Anomalies
SVM beats XLM-RoBERTa on D1 β Short tweets (31.4 words avg) don't provide enough context for transformer embeddings to outperform TF-IDF bag-of-words. Classical ML is not always inferior to modern deep learning.
D3 text length asymmetry β Suicide posts (200.8 words avg) are 3.2Γ longer than non-suicidal posts (62.2 words). This drove the max_length=256 decision for the D3 transformer.
Near-perfect D2 score (F1=0.9993) β Binary depression on tweets is almost perfectly separable with XLM-RoBERTa, likely due to strong lexical signals in the dataset.
Parallel architecture prevents missed cases β Sequential gating (e.g., only check suicide if depression detected) would miss suicidal ideation in people who show no depression markers. All 3 tasks always run.
Confidence computation differs by model type β SVM uses
softmax(decision_function())becauseLinearSVClacks native probability calibration. All outputs are normalized to 0β1 for UI consistency.Transformer weights in safetensors format β Newer, more secure format vs. PyTorch
.bin. Resists pickle deserialization attacks.SMOTE only on training data β Oversampling is applied only to training splits. Test sets remain unmodified to reflect real-world class distributions.
Random Forest technically present but never loaded β The
.pklfiles exist inmodels/classical/butpredict.pyhas no code path that loads them.
14. Research Context
Project: NCI H9DAI β Data Analytics for Artificial Intelligence
Degree: MSc Artificial Intelligence
Year: 2026
Methodology: CRISP-DM (6 stages: Business Understanding β Data Understanding β Data Preparation β Modelling β Evaluation β Deployment)
Baseline paper: Tumaliuan et al. (2024) β depression detection on Filipino Twitter, F1=0.81
Research Questions:
- RQ1: Can classical ML (SVM, LR, XGBoost) exceed the 0.81 baseline?
- RQ2: Can XLM-RoBERTa further improve on classical ML?
- RQ3: Does SMOTE balancing improve F1 on imbalanced datasets?
- RQ4: Does the parallel architecture catch cases a sequential pipeline would miss?
Datasets used (English/multilingual, broader than baseline's Filipino-only scope):
- D1: Zenodo 14233292 (Nusrat et al.)
- D2: Kaggle β albertobellardini
- D3: Kaggle β nikhileswarkomati
Key contributions over baseline:
- Multi-dataset parallel evaluation (vs. single dataset)
- XLM-RoBERTa multilingual transformer (vs. no transformer)
- SMOTE balancing (vs. no balancing strategy)
- Cohen's Kappa reporting (vs. accuracy/F1 only)
- Explainable per-model confidence scores in UI
NCI H9DAI Β· Data Analytics for Artificial Intelligence Β· MSc Artificial Intelligence Β· 2026