Brain / ABOUT.md

Upload folder using huggingface_hub

016c645 verified about 1 month ago

preview code

raw

history blame contribute delete

20.9 kB

MindScan — Complete Project Reference

NCI H9DAI · MSc Artificial Intelligence · 2026

Disclaimer: This is a research prototype built for academic coursework. It is not a clinical tool and must never be used for actual medical diagnosis or mental health assessment.

Project Purpose
Tech Stack
Directory Structure
Core Features
Components & Modules
Data Models & API Contract
API Endpoints
Configuration & Setup
AI / ML Architecture
Entry Points & Running the App
Dependencies
Testing & Evaluation
Key Findings & Anomalies
Research Context

1. Project Purpose

MindScan is a multi-model mental health text analysis system. It runs 12 machine learning classifiers simultaneously across 3 independent datasets, returning three clinically distinct assessments from a single text input:

Assessment	Task	Classes
Depression Type	Multi-class classification	postpartum, major depressive, bipolar, psychotic, no depression, atypical
Binary Depression	Binary classification	Depressed / Not Depressed
Suicide Risk	Binary classification	Suicide Risk / No Suicide Risk

Research goal: Extend Tumaliuan et al. (2024) with modern transformer embeddings (XLM-RoBERTa), classical ML gold standards (SVM), and SMOTE balancing — achieving a +12.7% F1 improvement (0.81 → 0.9269) over the baseline.

Key architectural decision — parallel, not sequential: All 3 datasets run independently. Suicidal ideation can exist without depression markers; a sequential pipeline would gate out those cases entirely. Research Question 4 (RQ4) explicitly tests whether the parallel design catches cases that sequential would miss.

2. Tech Stack

Layer	Technology	Version
Web framework	Flask	3.0.3
Classical ML	scikit-learn	1.6.1
Gradient boosting	XGBoost	2.0.3
Transformer models	HuggingFace Transformers	4.41.2
Deep learning runtime	PyTorch	2.3.0
Model serialization	joblib	1.4.2
Numerical ops	NumPy	1.26.4
Frontend	HTML5 + vanilla JS + CSS3	—
Fonts	Instrument Serif, Geist, DM Mono	—

No database. No frontend framework. All model state is held in memory after startup.

3. Directory Structure

MindScan/
├── app.py                              # Flask entry point (94 lines)
├── predict.py                          # Core prediction logic (303 lines)
├── requirements.txt                    # Python dependencies (7 packages)
├── README.md                           # Quick-start guide
├── ABOUT.md                            # This file — full project reference
├── .gitignore                          # models/ excluded (too large for git)
│
├── templates/
│   └── index.html                      # Single-page web UI
│
├── notebooks/
│   ├── DA_Notebook_One.ipynb           # Classical model training (2,269 lines)
│   └── DA_2_Notebook.ipynb             # XLM-RoBERTa training (13,178 lines)
│
├── models/
│   ├── classical/
│   │   ├── le_d1.pkl                   # LabelEncoder — D1 (543 bytes)
│   │   ├── le_d2.pkl                   # LabelEncoder — D2
│   │   ├── le_d3.pkl                   # LabelEncoder — D3
│   │   ├── tfidf_d1.pkl                # TF-IDF vectorizer — D1 (1.4 MB, 34,615 features)
│   │   ├── tfidf_d2.pkl                # TF-IDF vectorizer — D2 (569 KB, 50,000 features)
│   │   ├── tfidf_d3.pkl                # TF-IDF vectorizer — D3 (2.3 MB, 60,000 features)
│   │   ├── logistic_regression_d1.pkl  # LR — D1 (1.6 MB)
│   │   ├── logistic_regression_d2.pkl  # LR — D2 (120 KB)
│   │   ├── logistic_regression_d3.pkl  # LR — D3 (470 KB)
│   │   ├── svm_d1.pkl                  # LinearSVC — D1 (1.6 MB)
│   │   ├── svm_d2.pkl                  # LinearSVC — D2 (120 KB)
│   │   ├── svm_d3.pkl                  # LinearSVC — D3 (470 KB)
│   │   ├── xgboost_d1.pkl              # XGBoost — D1 (3.1 MB)
│   │   ├── xgboost_d2.pkl              # XGBoost — D2 (362 KB)
│   │   ├── xgboost_d3.pkl              # XGBoost — D3 (702 KB)
│   │   ├── random_forest_d1/d2/d3.pkl  # RF — NOT deployed (241 MB + 72 MB + 334 MB)
│   │   ├── classical_results.csv       # Performance metrics table
│   │   └── *.png                       # Confusion matrices + EDA plots (16 images)
│   │
│   └── transformers/
│       ├── xlmr_d1_final/              # Fine-tuned XLM-RoBERTa — D1 (1.1 GB)
│       │   ├── config.json             # Model architecture config
│       │   ├── model.safetensors       # Weights (1.1 GB)
│       │   ├── tokenizer.json          # BPE tokenizer (17 MB)
│       │   └── tokenizer_config.json   # Tokenizer metadata
│       ├── xlmr_d2_final/              # Fine-tuned XLM-RoBERTa — D2 (1.1 GB)
│       └── xlmr_d3_final/              # Fine-tuned XLM-RoBERTa — D3 (1.1 GB)
│
├── venv/                               # Python virtual environment
└── .venv/                              # Backup venv (both in .gitignore)

Total disk usage: ~3.2 GB (dominated by 3 × 1.1 GB transformer weights)

4. Core Features

Four Models Per Dataset (12 Total)

Each of the 3 datasets is evaluated by 4 independent models. All run on every request:

Logistic Regression (TF-IDF input)
SVM / LinearSVC (TF-IDF input)
XGBoost (TF-IDF input)
XLM-RoBERTa fine-tuned (raw text input)

Risk Aggregation

If 3 or more of the 4 Dataset 3 models flag suicide risk → risk_flag = true
UI renders a red danger banner
Response includes "suicide_votes": "X/4 models flagged suicide risk"

Text Preprocessing Pipeline

Applied to all input before TF-IDF vectorization (raw text passed to transformers):

lowercase → remove URLs (http/www/https) → strip @mentions
→ remove # symbols (word kept) → delete punctuation → normalize whitespace

UI Features (index.html)

Live demo textarea with 5000-character limit and real-time counter
Sample text buttons for quick testing
Results display: winner card + 4 model confidence bars per dataset
Class probability breakdown (expandable)
Risk flag banner (red = danger, green = safe)
CRISP-DM interactive timeline (6 stages, collapsible detail panels)
Dataset explorer with class distribution bars
Model card grid with F1 scores
Project folder tree with file detail pane
Animated stat counters in hero section
Comparison panel vs Tumaliuan et al. (2024) baseline

5. Components & Modules

`app.py` (94 lines)

Flask application. Responsibilities:

Initializes Flask app
Calls load_all_models() at startup (blocks until complete)
Defines 3 routes: /, /predict, /health
Input validation: max 5000 chars, non-empty, valid JSON
Prints startup progress with emoji checkmarks to console
Serves on 0.0.0.0:5000 (accessible on local network)

`predict.py` (303 lines)

Core prediction engine. Key functions:

Function	Purpose
`load_all_models()`	Loads all 12 models + encoders + tokenizer into `_models` global dict
`clean_text(text)`	Regex-based text cleaning (same logic used in both training notebooks)
`predict_classical(text, ds)`	TF-IDF vectorization + sklearn predict / decision_function
`predict_transformer(text, ds)`	Tokenization → forward pass → softmax probabilities
`predict_all(raw_text)`	Main orchestrator: cleans text, runs all 12 models, returns full result dict

Confidence normalization: All models normalize to 0–1:

LR / XGBoost: predict_proba() (native)
SVM: softmax(decision_function()) (LinearSVC has no proba by default)
Transformer: softmax(logits)

D2 label mapping: Raw integer labels (0, 1) mapped to "Not Depressed" / "Depressed". Handles both str and int types for robustness.

`templates/index.html`

Single-page application. All UI logic in vanilla JS:

fetch('/predict', { method: 'POST', ... }) — AJAX prediction call
Tab switching, progress bar animation, accordion expand/collapse
Counter animations for stats in hero section
No build step, no bundler, no external JS framework

6. Data Models & API Contract

Request

POST /predict
Content-Type: application/json

{ "text": "string — max 5000 characters" }

Response

{
  "dataset1": {
    "task": "Depression Type (6 Classes)",
    "models": {
      "Logistic Regression": { "label": "postpartum", "confidence": 0.958 },
      "SVM":                  { "label": "postpartum", "confidence": 0.828 },
      "XGBoost":              { "label": "postpartum", "confidence": 0.999 },
      "XLM-RoBERTa":         { "label": "postpartum", "confidence": 0.997 }
    },
    "winner_model": "XGBoost",
    "winner_prediction": "postpartum",
    "winner_confidence": 0.999,
    "class_probs": {
      "postpartum": 0.997,
      "bipolar": 0.001,
      "major depressive": 0.001,
      "psychotic": 0.0,
      "no depression": 0.0,
      "atypical": 0.001
    }
  },
  "dataset2": {
    "task": "Binary Depression Detection",
    "models": { ... },
    "winner_model": "XLM-RoBERTa",
    "winner_prediction": "Depressed",
    "winner_confidence": 0.998,
    "class_probs": { "Depressed": 0.998, "Not Depressed": 0.002 }
  },
  "dataset3": {
    "task": "Suicide Risk Assessment",
    "models": { ... },
    "winner_model": "XLM-RoBERTa",
    "winner_prediction": "Suicide Risk",
    "winner_confidence": 0.993,
    "class_probs": { "Suicide Risk": 0.993, "No Suicide Risk": 0.007 }
  },
  "risk_flag": true,
  "suicide_votes": "4/4 models flagged suicide risk",
  "processing_time_ms": 2341
}

Internal Model State (`_models` dict in predict.py)

Key	Type	Description
`le_d1/d2/d3`	`LabelEncoder`	Decodes integer predictions to class names
`tfidf_d1/d2/d3`	`TfidfVectorizer`	Converts cleaned text to sparse feature vectors
`logistic_regression_d1/d2/d3`	`LogisticRegression`	Linear baseline
`svm_d1/d2/d3`	`LinearSVC`	SVM classifier
`xgboost_d1/d2/d3`	`XGBClassifier`	Gradient boosting
`tokenizer`	`XLMRobertaTokenizer`	Shared BPE tokenizer (all 3 transformer models)
`xlmr_d1/d2/d3`	`XLMRobertaForSequenceClassification`	Fine-tuned transformer
`xlmr_d1/d2/d3_len`	`int`	Max token length: 128 / 128 / 256
`device`	`str`	`'cuda'` or `'cpu'`

7. API Endpoints

`GET /`

Returns index.html. No parameters.

`POST /predict`

Scenario	HTTP Status	Response
Success	200	Full prediction JSON (see above)
Missing `text` field	400	`{ "error": "..." }`
Empty text	400	`{ "error": "..." }`
Text > 5000 chars	400	`{ "error": "..." }`
Models not loaded yet	503	`{ "error": "..." }`
Prediction exception	500	`{ "error": "..." }`

Typical latency: ~2–3 seconds on CPU (XLM-RoBERTa dominates inference time)

`GET /health`

{ "status": "ok", "models_ready": true }

Use for polling during the ~30-second startup window.

8. Configuration & Setup

Environment Variables

None required. All paths computed relative to app.py using os.path.dirname(__file__).

Setup Steps

# 1. Download models from Google Drive → place in models/classical/ and models/transformers/

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate        # Mac / Linux
venv\Scripts\activate           # Windows

# 3. Install dependencies
pip install -r requirements.txt

# 4. Start server
python app.py

# 5. Open browser
# http://localhost:5000

Expected Startup Console Output

=======================================================
  MindScan — Starting up
=======================================================
  Loading models... (XLM-RoBERTa takes ~30s on CPU)
  ✓ Loaded encoders/tfidf for d1
  ✓ Loaded encoders/tfidf for d2
  ✓ Loaded encoders/tfidf for d3
  ✓ Loaded logistic_regression_d1
  ✓ Loaded svm_d1
  ✓ Loaded xgboost_d1
  ... (repeated for d2, d3)
  ✓ Using device: cpu
  ✓ Tokeniser loaded
  ✓ Loaded XLM-RoBERTa d1 (max_length=128)
  ✓ Loaded XLM-RoBERTa d2 (max_length=128)
  ✓ Loaded XLM-RoBERTa d3 (max_length=256)
  ✅ All models ready
  🌐 Open: http://localhost:5000
=======================================================

`.gitignore` Exclusions

models/ — All trained model files (too large for git, download separately)
venv/, .venv/ — Virtual environments
__pycache__/, *.pyc, *.pyo
.ipynb_checkpoints/
.DS_Store, Thumbs.db

9. AI / ML Architecture

The Three Datasets

	D1	D2	D3
Source	Nusrat et al. (2024) — Zenodo 14233292	albertobellardini — Kaggle	nikhileswarkomati — Kaggle
Platform	Twitter	Twitter	Reddit
Size	14,983 tweets	10,314 tweets	50,000 posts
Task	6-class depression type	Binary depression	Binary suicide risk
Avg text length	31.4 words	~30 words	62–200 words
Class balance	Imbalanced (1.89×)	Severely imbalanced (3.46×)	Balanced (1.0×)
SMOTE applied	Yes — 11,986 → 17,982	Yes — 8,251 → 12,800	No

All datasets use stratified 80/20 train/test split, random_state=42. Test sets are never touched by SMOTE (realistic evaluation).

TF-IDF Feature Dimensions

Dataset	Features
D1	34,615
D2	50,000
D3	60,000

XLM-RoBERTa Architecture

Base model: xlm-roberta-base
Parameters: 278 million
Layers: 12 transformer layers, 12 attention heads
Pre-training data: 2.5 TB of text across 100 languages
Fine-tuning: 3 epochs on Google Colab T4 GPU
Weight format: safetensors (more secure and efficient than .bin)
Max token lengths: D1 = 128, D2 = 128, D3 = 256
- D3 uses 256 because Reddit posts average 200.8 words for the suicide class (3.2× longer than non-suicidal posts at 62.2 words)
Tokenizer: Shared single instance across all 3 models (not duplicated)

Why SVM Beats Transformer on D1

XLM-RoBERTa's contextual embeddings require sufficient token sequence length to demonstrate advantage over bag-of-words TF-IDF. D1 tweets average only 31.4 words — too short for context to matter. SVM achieves F1=0.9269 vs XLM-RoBERTa's lower score on D1. The transformer's advantage grows with text length and dominates on D3 (avg 200+ words).

Why Random Forest Was Excluded from Deployment

Model	D1 F1	D3 F1	Aggregate Size
XGBoost	competitive	competitive	~4.2 MB
Random Forest	worst on D1	worst on D3	647 MB

Size penalty not justified by performance. Random Forest .pkl files remain in models/classical/ but are never loaded by predict.py.

Performance Results

Dataset	Model	Macro F1	Cohen's κ
D1 — Depression Type	SVM	0.9269	0.9072
D1 — Depression Type	XGBoost	~0.90	—
D1 — Depression Type	XLM-RoBERTa	lower	—
D2 — Binary Depression	XLM-RoBERTa	0.9993	0.9986
D3 — Suicide Risk	XLM-RoBERTa	0.9810	0.9620
Baseline (Tumaliuan 2024)	—	0.81	—

Improvement over baseline: +12.7%

10. Entry Points & Running the App

Production (Local)

python app.py

Opens on http://localhost:5000. Server also accessible on local network via http://<your-ip>:5000.

Training (Notebooks — Google Colab Only)

Notebook	Purpose	Runtime Required
`notebooks/DA_Notebook_One.ipynb`	Train LR, SVM, XGBoost on all 3 datasets; generate metrics CSV and confusion matrix PNGs	CPU (Colab free tier)
`notebooks/DA_2_Notebook.ipynb`	Fine-tune XLM-RoBERTa on all 3 datasets; run full model comparison	T4 GPU (Colab)

Both notebooks save outputs to Google Drive at MindScan_Models/.

11. Dependencies

flask==3.0.3          Web framework + routing
scikit-learn==1.6.1   LR, LinearSVC, TfidfVectorizer, LabelEncoder, SMOTE metrics
xgboost==2.0.3        Gradient boosting classifier
transformers==4.41.2  XLM-RoBERTa model + tokenizer (HuggingFace)
torch==2.3.0          PyTorch runtime (GPU optional — CUDA auto-detected)
joblib==1.4.2         Pickle serialization for large sklearn objects
numpy==1.26.4         Numerical operations, softmax computation

No Node.js / npm dependencies. Pure Python backend, vanilla JS frontend (no build step).

12. Testing & Evaluation

No Automated Test Suite

The project has no pytest, unittest, or CI/CD pipeline. Evaluation is:

Quantitative (offline, notebook):

Macro F1 score (primary metric — handles class imbalance)
Cohen's Kappa (measures agreement beyond chance — reported for D1)
Accuracy
Confusion matrices (saved as PNG to models/classical/)
classical_results.csv — full metrics table for all classical models

Visual (EDA plots):

eda_d1.png, eda_d2.png, eda_d3.png — class distributions and text length histograms

Manual (UI):

Sample text buttons in the live demo for smoke-testing the prediction pipeline
All 4 model predictions + confidence bars shown simultaneously

13. Key Findings & Anomalies

SVM beats XLM-RoBERTa on D1 — Short tweets (31.4 words avg) don't provide enough context for transformer embeddings to outperform TF-IDF bag-of-words. Classical ML is not always inferior to modern deep learning.
D3 text length asymmetry — Suicide posts (200.8 words avg) are 3.2× longer than non-suicidal posts (62.2 words). This drove the max_length=256 decision for the D3 transformer.
Near-perfect D2 score (F1=0.9993) — Binary depression on tweets is almost perfectly separable with XLM-RoBERTa, likely due to strong lexical signals in the dataset.
Parallel architecture prevents missed cases — Sequential gating (e.g., only check suicide if depression detected) would miss suicidal ideation in people who show no depression markers. All 3 tasks always run.
Confidence computation differs by model type — SVM uses softmax(decision_function()) because LinearSVC lacks native probability calibration. All outputs are normalized to 0–1 for UI consistency.
Transformer weights in safetensors format — Newer, more secure format vs. PyTorch .bin. Resists pickle deserialization attacks.
SMOTE only on training data — Oversampling is applied only to training splits. Test sets remain unmodified to reflect real-world class distributions.
Random Forest technically present but never loaded — The .pkl files exist in models/classical/ but predict.py has no code path that loads them.

14. Research Context

Project: NCI H9DAI — Data Analytics for Artificial Intelligence
Degree: MSc Artificial Intelligence
Year: 2026
Methodology: CRISP-DM (6 stages: Business Understanding → Data Understanding → Data Preparation → Modelling → Evaluation → Deployment)

Baseline paper: Tumaliuan et al. (2024) — depression detection on Filipino Twitter, F1=0.81

Research Questions:

RQ1: Can classical ML (SVM, LR, XGBoost) exceed the 0.81 baseline?
RQ2: Can XLM-RoBERTa further improve on classical ML?
RQ3: Does SMOTE balancing improve F1 on imbalanced datasets?
RQ4: Does the parallel architecture catch cases a sequential pipeline would miss?

Datasets used (English/multilingual, broader than baseline's Filipino-only scope):

D1: Zenodo 14233292 (Nusrat et al.)
D2: Kaggle — albertobellardini
D3: Kaggle — nikhileswarkomati

Key contributions over baseline:

Multi-dataset parallel evaluation (vs. single dataset)
XLM-RoBERTa multilingual transformer (vs. no transformer)
SMOTE balancing (vs. no balancing strategy)
Cohen's Kappa reporting (vs. accuracy/F1 only)
Explainable per-model confidence scores in UI

NCI H9DAI · Data Analytics for Artificial Intelligence · MSc Artificial Intelligence · 2026