Brain / ABOUT.md
Esvanth's picture
Upload folder using huggingface_hub
016c645 verified

MindScan β€” Complete Project Reference

NCI H9DAI Β· MSc Artificial Intelligence Β· 2026

Disclaimer: This is a research prototype built for academic coursework. It is not a clinical tool and must never be used for actual medical diagnosis or mental health assessment.


Table of Contents

  1. Project Purpose
  2. Tech Stack
  3. Directory Structure
  4. Core Features
  5. Components & Modules
  6. Data Models & API Contract
  7. API Endpoints
  8. Configuration & Setup
  9. AI / ML Architecture
  10. Entry Points & Running the App
  11. Dependencies
  12. Testing & Evaluation
  13. Key Findings & Anomalies
  14. Research Context

1. Project Purpose

MindScan is a multi-model mental health text analysis system. It runs 12 machine learning classifiers simultaneously across 3 independent datasets, returning three clinically distinct assessments from a single text input:

Assessment Task Classes
Depression Type Multi-class classification postpartum, major depressive, bipolar, psychotic, no depression, atypical
Binary Depression Binary classification Depressed / Not Depressed
Suicide Risk Binary classification Suicide Risk / No Suicide Risk

Research goal: Extend Tumaliuan et al. (2024) with modern transformer embeddings (XLM-RoBERTa), classical ML gold standards (SVM), and SMOTE balancing β€” achieving a +12.7% F1 improvement (0.81 β†’ 0.9269) over the baseline.

Key architectural decision β€” parallel, not sequential: All 3 datasets run independently. Suicidal ideation can exist without depression markers; a sequential pipeline would gate out those cases entirely. Research Question 4 (RQ4) explicitly tests whether the parallel design catches cases that sequential would miss.


2. Tech Stack

Layer Technology Version
Web framework Flask 3.0.3
Classical ML scikit-learn 1.6.1
Gradient boosting XGBoost 2.0.3
Transformer models HuggingFace Transformers 4.41.2
Deep learning runtime PyTorch 2.3.0
Model serialization joblib 1.4.2
Numerical ops NumPy 1.26.4
Frontend HTML5 + vanilla JS + CSS3 β€”
Fonts Instrument Serif, Geist, DM Mono β€”

No database. No frontend framework. All model state is held in memory after startup.


3. Directory Structure

MindScan/
β”œβ”€β”€ app.py                              # Flask entry point (94 lines)
β”œβ”€β”€ predict.py                          # Core prediction logic (303 lines)
β”œβ”€β”€ requirements.txt                    # Python dependencies (7 packages)
β”œβ”€β”€ README.md                           # Quick-start guide
β”œβ”€β”€ ABOUT.md                            # This file β€” full project reference
β”œβ”€β”€ .gitignore                          # models/ excluded (too large for git)
β”‚
β”œβ”€β”€ templates/
β”‚   └── index.html                      # Single-page web UI
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ DA_Notebook_One.ipynb           # Classical model training (2,269 lines)
β”‚   └── DA_2_Notebook.ipynb             # XLM-RoBERTa training (13,178 lines)
β”‚
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ classical/
β”‚   β”‚   β”œβ”€β”€ le_d1.pkl                   # LabelEncoder β€” D1 (543 bytes)
β”‚   β”‚   β”œβ”€β”€ le_d2.pkl                   # LabelEncoder β€” D2
β”‚   β”‚   β”œβ”€β”€ le_d3.pkl                   # LabelEncoder β€” D3
β”‚   β”‚   β”œβ”€β”€ tfidf_d1.pkl                # TF-IDF vectorizer β€” D1 (1.4 MB, 34,615 features)
β”‚   β”‚   β”œβ”€β”€ tfidf_d2.pkl                # TF-IDF vectorizer β€” D2 (569 KB, 50,000 features)
β”‚   β”‚   β”œβ”€β”€ tfidf_d3.pkl                # TF-IDF vectorizer β€” D3 (2.3 MB, 60,000 features)
β”‚   β”‚   β”œβ”€β”€ logistic_regression_d1.pkl  # LR β€” D1 (1.6 MB)
β”‚   β”‚   β”œβ”€β”€ logistic_regression_d2.pkl  # LR β€” D2 (120 KB)
β”‚   β”‚   β”œβ”€β”€ logistic_regression_d3.pkl  # LR β€” D3 (470 KB)
β”‚   β”‚   β”œβ”€β”€ svm_d1.pkl                  # LinearSVC β€” D1 (1.6 MB)
β”‚   β”‚   β”œβ”€β”€ svm_d2.pkl                  # LinearSVC β€” D2 (120 KB)
β”‚   β”‚   β”œβ”€β”€ svm_d3.pkl                  # LinearSVC β€” D3 (470 KB)
β”‚   β”‚   β”œβ”€β”€ xgboost_d1.pkl              # XGBoost β€” D1 (3.1 MB)
β”‚   β”‚   β”œβ”€β”€ xgboost_d2.pkl              # XGBoost β€” D2 (362 KB)
β”‚   β”‚   β”œβ”€β”€ xgboost_d3.pkl              # XGBoost β€” D3 (702 KB)
β”‚   β”‚   β”œβ”€β”€ random_forest_d1/d2/d3.pkl  # RF β€” NOT deployed (241 MB + 72 MB + 334 MB)
β”‚   β”‚   β”œβ”€β”€ classical_results.csv       # Performance metrics table
β”‚   β”‚   └── *.png                       # Confusion matrices + EDA plots (16 images)
β”‚   β”‚
β”‚   └── transformers/
β”‚       β”œβ”€β”€ xlmr_d1_final/              # Fine-tuned XLM-RoBERTa β€” D1 (1.1 GB)
β”‚       β”‚   β”œβ”€β”€ config.json             # Model architecture config
β”‚       β”‚   β”œβ”€β”€ model.safetensors       # Weights (1.1 GB)
β”‚       β”‚   β”œβ”€β”€ tokenizer.json          # BPE tokenizer (17 MB)
β”‚       β”‚   └── tokenizer_config.json   # Tokenizer metadata
β”‚       β”œβ”€β”€ xlmr_d2_final/              # Fine-tuned XLM-RoBERTa β€” D2 (1.1 GB)
β”‚       └── xlmr_d3_final/              # Fine-tuned XLM-RoBERTa β€” D3 (1.1 GB)
β”‚
β”œβ”€β”€ venv/                               # Python virtual environment
└── .venv/                              # Backup venv (both in .gitignore)

Total disk usage: ~3.2 GB (dominated by 3 Γ— 1.1 GB transformer weights)


4. Core Features

Four Models Per Dataset (12 Total)

Each of the 3 datasets is evaluated by 4 independent models. All run on every request:

  1. Logistic Regression (TF-IDF input)
  2. SVM / LinearSVC (TF-IDF input)
  3. XGBoost (TF-IDF input)
  4. XLM-RoBERTa fine-tuned (raw text input)

Risk Aggregation

  • If 3 or more of the 4 Dataset 3 models flag suicide risk β†’ risk_flag = true
  • UI renders a red danger banner
  • Response includes "suicide_votes": "X/4 models flagged suicide risk"

Text Preprocessing Pipeline

Applied to all input before TF-IDF vectorization (raw text passed to transformers):

lowercase β†’ remove URLs (http/www/https) β†’ strip @mentions
β†’ remove # symbols (word kept) β†’ delete punctuation β†’ normalize whitespace

UI Features (index.html)

  • Live demo textarea with 5000-character limit and real-time counter
  • Sample text buttons for quick testing
  • Results display: winner card + 4 model confidence bars per dataset
  • Class probability breakdown (expandable)
  • Risk flag banner (red = danger, green = safe)
  • CRISP-DM interactive timeline (6 stages, collapsible detail panels)
  • Dataset explorer with class distribution bars
  • Model card grid with F1 scores
  • Project folder tree with file detail pane
  • Animated stat counters in hero section
  • Comparison panel vs Tumaliuan et al. (2024) baseline

5. Components & Modules

app.py (94 lines)

Flask application. Responsibilities:

  • Initializes Flask app
  • Calls load_all_models() at startup (blocks until complete)
  • Defines 3 routes: /, /predict, /health
  • Input validation: max 5000 chars, non-empty, valid JSON
  • Prints startup progress with emoji checkmarks to console
  • Serves on 0.0.0.0:5000 (accessible on local network)

predict.py (303 lines)

Core prediction engine. Key functions:

Function Purpose
load_all_models() Loads all 12 models + encoders + tokenizer into _models global dict
clean_text(text) Regex-based text cleaning (same logic used in both training notebooks)
predict_classical(text, ds) TF-IDF vectorization + sklearn predict / decision_function
predict_transformer(text, ds) Tokenization β†’ forward pass β†’ softmax probabilities
predict_all(raw_text) Main orchestrator: cleans text, runs all 12 models, returns full result dict

Confidence normalization: All models normalize to 0–1:

  • LR / XGBoost: predict_proba() (native)
  • SVM: softmax(decision_function()) (LinearSVC has no proba by default)
  • Transformer: softmax(logits)

D2 label mapping: Raw integer labels (0, 1) mapped to "Not Depressed" / "Depressed". Handles both str and int types for robustness.

templates/index.html

Single-page application. All UI logic in vanilla JS:

  • fetch('/predict', { method: 'POST', ... }) β€” AJAX prediction call
  • Tab switching, progress bar animation, accordion expand/collapse
  • Counter animations for stats in hero section
  • No build step, no bundler, no external JS framework

6. Data Models & API Contract

Request

POST /predict
Content-Type: application/json

{ "text": "string β€” max 5000 characters" }

Response

{
  "dataset1": {
    "task": "Depression Type (6 Classes)",
    "models": {
      "Logistic Regression": { "label": "postpartum", "confidence": 0.958 },
      "SVM":                  { "label": "postpartum", "confidence": 0.828 },
      "XGBoost":              { "label": "postpartum", "confidence": 0.999 },
      "XLM-RoBERTa":         { "label": "postpartum", "confidence": 0.997 }
    },
    "winner_model": "XGBoost",
    "winner_prediction": "postpartum",
    "winner_confidence": 0.999,
    "class_probs": {
      "postpartum": 0.997,
      "bipolar": 0.001,
      "major depressive": 0.001,
      "psychotic": 0.0,
      "no depression": 0.0,
      "atypical": 0.001
    }
  },
  "dataset2": {
    "task": "Binary Depression Detection",
    "models": { ... },
    "winner_model": "XLM-RoBERTa",
    "winner_prediction": "Depressed",
    "winner_confidence": 0.998,
    "class_probs": { "Depressed": 0.998, "Not Depressed": 0.002 }
  },
  "dataset3": {
    "task": "Suicide Risk Assessment",
    "models": { ... },
    "winner_model": "XLM-RoBERTa",
    "winner_prediction": "Suicide Risk",
    "winner_confidence": 0.993,
    "class_probs": { "Suicide Risk": 0.993, "No Suicide Risk": 0.007 }
  },
  "risk_flag": true,
  "suicide_votes": "4/4 models flagged suicide risk",
  "processing_time_ms": 2341
}

Internal Model State (_models dict in predict.py)

Key Type Description
le_d1/d2/d3 LabelEncoder Decodes integer predictions to class names
tfidf_d1/d2/d3 TfidfVectorizer Converts cleaned text to sparse feature vectors
logistic_regression_d1/d2/d3 LogisticRegression Linear baseline
svm_d1/d2/d3 LinearSVC SVM classifier
xgboost_d1/d2/d3 XGBClassifier Gradient boosting
tokenizer XLMRobertaTokenizer Shared BPE tokenizer (all 3 transformer models)
xlmr_d1/d2/d3 XLMRobertaForSequenceClassification Fine-tuned transformer
xlmr_d1/d2/d3_len int Max token length: 128 / 128 / 256
device str 'cuda' or 'cpu'

7. API Endpoints

GET /

Returns index.html. No parameters.

POST /predict

Scenario HTTP Status Response
Success 200 Full prediction JSON (see above)
Missing text field 400 { "error": "..." }
Empty text 400 { "error": "..." }
Text > 5000 chars 400 { "error": "..." }
Models not loaded yet 503 { "error": "..." }
Prediction exception 500 { "error": "..." }

Typical latency: ~2–3 seconds on CPU (XLM-RoBERTa dominates inference time)

GET /health

{ "status": "ok", "models_ready": true }

Use for polling during the ~30-second startup window.


8. Configuration & Setup

Environment Variables

None required. All paths computed relative to app.py using os.path.dirname(__file__).

Setup Steps

# 1. Download models from Google Drive β†’ place in models/classical/ and models/transformers/

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate        # Mac / Linux
venv\Scripts\activate           # Windows

# 3. Install dependencies
pip install -r requirements.txt

# 4. Start server
python app.py

# 5. Open browser
# http://localhost:5000

Expected Startup Console Output

=======================================================
  MindScan β€” Starting up
=======================================================
  Loading models... (XLM-RoBERTa takes ~30s on CPU)
  βœ“ Loaded encoders/tfidf for d1
  βœ“ Loaded encoders/tfidf for d2
  βœ“ Loaded encoders/tfidf for d3
  βœ“ Loaded logistic_regression_d1
  βœ“ Loaded svm_d1
  βœ“ Loaded xgboost_d1
  ... (repeated for d2, d3)
  βœ“ Using device: cpu
  βœ“ Tokeniser loaded
  βœ“ Loaded XLM-RoBERTa d1 (max_length=128)
  βœ“ Loaded XLM-RoBERTa d2 (max_length=128)
  βœ“ Loaded XLM-RoBERTa d3 (max_length=256)
  βœ… All models ready
  🌐 Open: http://localhost:5000
=======================================================

.gitignore Exclusions

  • models/ β€” All trained model files (too large for git, download separately)
  • venv/, .venv/ β€” Virtual environments
  • __pycache__/, *.pyc, *.pyo
  • .ipynb_checkpoints/
  • .DS_Store, Thumbs.db

9. AI / ML Architecture

The Three Datasets

D1 D2 D3
Source Nusrat et al. (2024) β€” Zenodo 14233292 albertobellardini β€” Kaggle nikhileswarkomati β€” Kaggle
Platform Twitter Twitter Reddit
Size 14,983 tweets 10,314 tweets 50,000 posts
Task 6-class depression type Binary depression Binary suicide risk
Avg text length 31.4 words ~30 words 62–200 words
Class balance Imbalanced (1.89Γ—) Severely imbalanced (3.46Γ—) Balanced (1.0Γ—)
SMOTE applied Yes β€” 11,986 β†’ 17,982 Yes β€” 8,251 β†’ 12,800 No

All datasets use stratified 80/20 train/test split, random_state=42. Test sets are never touched by SMOTE (realistic evaluation).

TF-IDF Feature Dimensions

Dataset Features
D1 34,615
D2 50,000
D3 60,000

XLM-RoBERTa Architecture

  • Base model: xlm-roberta-base
  • Parameters: 278 million
  • Layers: 12 transformer layers, 12 attention heads
  • Pre-training data: 2.5 TB of text across 100 languages
  • Fine-tuning: 3 epochs on Google Colab T4 GPU
  • Weight format: safetensors (more secure and efficient than .bin)
  • Max token lengths: D1 = 128, D2 = 128, D3 = 256
    • D3 uses 256 because Reddit posts average 200.8 words for the suicide class (3.2Γ— longer than non-suicidal posts at 62.2 words)
  • Tokenizer: Shared single instance across all 3 models (not duplicated)

Why SVM Beats Transformer on D1

XLM-RoBERTa's contextual embeddings require sufficient token sequence length to demonstrate advantage over bag-of-words TF-IDF. D1 tweets average only 31.4 words β€” too short for context to matter. SVM achieves F1=0.9269 vs XLM-RoBERTa's lower score on D1. The transformer's advantage grows with text length and dominates on D3 (avg 200+ words).

Why Random Forest Was Excluded from Deployment

Model D1 F1 D3 F1 Aggregate Size
XGBoost competitive competitive ~4.2 MB
Random Forest worst on D1 worst on D3 647 MB

Size penalty not justified by performance. Random Forest .pkl files remain in models/classical/ but are never loaded by predict.py.

Performance Results

Dataset Model Macro F1 Cohen's ΞΊ
D1 β€” Depression Type SVM 0.9269 0.9072
D1 β€” Depression Type XGBoost ~0.90 β€”
D1 β€” Depression Type XLM-RoBERTa lower β€”
D2 β€” Binary Depression XLM-RoBERTa 0.9993 0.9986
D3 β€” Suicide Risk XLM-RoBERTa 0.9810 0.9620
Baseline (Tumaliuan 2024) β€” 0.81 β€”

Improvement over baseline: +12.7%


10. Entry Points & Running the App

Production (Local)

python app.py

Opens on http://localhost:5000. Server also accessible on local network via http://<your-ip>:5000.

Training (Notebooks β€” Google Colab Only)

Notebook Purpose Runtime Required
notebooks/DA_Notebook_One.ipynb Train LR, SVM, XGBoost on all 3 datasets; generate metrics CSV and confusion matrix PNGs CPU (Colab free tier)
notebooks/DA_2_Notebook.ipynb Fine-tune XLM-RoBERTa on all 3 datasets; run full model comparison T4 GPU (Colab)

Both notebooks save outputs to Google Drive at MindScan_Models/.


11. Dependencies

flask==3.0.3          Web framework + routing
scikit-learn==1.6.1   LR, LinearSVC, TfidfVectorizer, LabelEncoder, SMOTE metrics
xgboost==2.0.3        Gradient boosting classifier
transformers==4.41.2  XLM-RoBERTa model + tokenizer (HuggingFace)
torch==2.3.0          PyTorch runtime (GPU optional β€” CUDA auto-detected)
joblib==1.4.2         Pickle serialization for large sklearn objects
numpy==1.26.4         Numerical operations, softmax computation

No Node.js / npm dependencies. Pure Python backend, vanilla JS frontend (no build step).


12. Testing & Evaluation

No Automated Test Suite

The project has no pytest, unittest, or CI/CD pipeline. Evaluation is:

Quantitative (offline, notebook):

  • Macro F1 score (primary metric β€” handles class imbalance)
  • Cohen's Kappa (measures agreement beyond chance β€” reported for D1)
  • Accuracy
  • Confusion matrices (saved as PNG to models/classical/)
  • classical_results.csv β€” full metrics table for all classical models

Visual (EDA plots):

  • eda_d1.png, eda_d2.png, eda_d3.png β€” class distributions and text length histograms

Manual (UI):

  • Sample text buttons in the live demo for smoke-testing the prediction pipeline
  • All 4 model predictions + confidence bars shown simultaneously

13. Key Findings & Anomalies

  1. SVM beats XLM-RoBERTa on D1 β€” Short tweets (31.4 words avg) don't provide enough context for transformer embeddings to outperform TF-IDF bag-of-words. Classical ML is not always inferior to modern deep learning.

  2. D3 text length asymmetry β€” Suicide posts (200.8 words avg) are 3.2Γ— longer than non-suicidal posts (62.2 words). This drove the max_length=256 decision for the D3 transformer.

  3. Near-perfect D2 score (F1=0.9993) β€” Binary depression on tweets is almost perfectly separable with XLM-RoBERTa, likely due to strong lexical signals in the dataset.

  4. Parallel architecture prevents missed cases β€” Sequential gating (e.g., only check suicide if depression detected) would miss suicidal ideation in people who show no depression markers. All 3 tasks always run.

  5. Confidence computation differs by model type β€” SVM uses softmax(decision_function()) because LinearSVC lacks native probability calibration. All outputs are normalized to 0–1 for UI consistency.

  6. Transformer weights in safetensors format β€” Newer, more secure format vs. PyTorch .bin. Resists pickle deserialization attacks.

  7. SMOTE only on training data β€” Oversampling is applied only to training splits. Test sets remain unmodified to reflect real-world class distributions.

  8. Random Forest technically present but never loaded β€” The .pkl files exist in models/classical/ but predict.py has no code path that loads them.


14. Research Context

Project: NCI H9DAI β€” Data Analytics for Artificial Intelligence
Degree: MSc Artificial Intelligence
Year: 2026
Methodology: CRISP-DM (6 stages: Business Understanding β†’ Data Understanding β†’ Data Preparation β†’ Modelling β†’ Evaluation β†’ Deployment)

Baseline paper: Tumaliuan et al. (2024) β€” depression detection on Filipino Twitter, F1=0.81

Research Questions:

  • RQ1: Can classical ML (SVM, LR, XGBoost) exceed the 0.81 baseline?
  • RQ2: Can XLM-RoBERTa further improve on classical ML?
  • RQ3: Does SMOTE balancing improve F1 on imbalanced datasets?
  • RQ4: Does the parallel architecture catch cases a sequential pipeline would miss?

Datasets used (English/multilingual, broader than baseline's Filipino-only scope):

  • D1: Zenodo 14233292 (Nusrat et al.)
  • D2: Kaggle β€” albertobellardini
  • D3: Kaggle β€” nikhileswarkomati

Key contributions over baseline:

  • Multi-dataset parallel evaluation (vs. single dataset)
  • XLM-RoBERTa multilingual transformer (vs. no transformer)
  • SMOTE balancing (vs. no balancing strategy)
  • Cohen's Kappa reporting (vs. accuracy/F1 only)
  • Explainable per-model confidence scores in UI

NCI H9DAI Β· Data Analytics for Artificial Intelligence Β· MSc Artificial Intelligence Β· 2026