Brain / ABOUT.md
Esvanth's picture
Upload folder using huggingface_hub
016c645 verified
# MindScan β€” Complete Project Reference
### NCI H9DAI Β· MSc Artificial Intelligence Β· 2026
> **Disclaimer:** This is a research prototype built for academic coursework. It is **not** a clinical tool and must never be used for actual medical diagnosis or mental health assessment.
---
## Table of Contents
1. [Project Purpose](#1-project-purpose)
2. [Tech Stack](#2-tech-stack)
3. [Directory Structure](#3-directory-structure)
4. [Core Features](#4-core-features)
5. [Components & Modules](#5-components--modules)
6. [Data Models & API Contract](#6-data-models--api-contract)
7. [API Endpoints](#7-api-endpoints)
8. [Configuration & Setup](#8-configuration--setup)
9. [AI / ML Architecture](#9-ai--ml-architecture)
10. [Entry Points & Running the App](#10-entry-points--running-the-app)
11. [Dependencies](#11-dependencies)
12. [Testing & Evaluation](#12-testing--evaluation)
13. [Key Findings & Anomalies](#13-key-findings--anomalies)
14. [Research Context](#14-research-context)
---
## 1. Project Purpose
MindScan is a multi-model mental health text analysis system. It runs **12 machine learning classifiers simultaneously** across **3 independent datasets**, returning three clinically distinct assessments from a single text input:
| Assessment | Task | Classes |
|---|---|---|
| Depression Type | Multi-class classification | postpartum, major depressive, bipolar, psychotic, no depression, atypical |
| Binary Depression | Binary classification | Depressed / Not Depressed |
| Suicide Risk | Binary classification | Suicide Risk / No Suicide Risk |
**Research goal:** Extend Tumaliuan et al. (2024) with modern transformer embeddings (XLM-RoBERTa), classical ML gold standards (SVM), and SMOTE balancing β€” achieving a **+12.7% F1 improvement** (0.81 β†’ 0.9269) over the baseline.
**Key architectural decision β€” parallel, not sequential:** All 3 datasets run independently. Suicidal ideation can exist without depression markers; a sequential pipeline would gate out those cases entirely. Research Question 4 (RQ4) explicitly tests whether the parallel design catches cases that sequential would miss.
---
## 2. Tech Stack
| Layer | Technology | Version |
|---|---|---|
| Web framework | Flask | 3.0.3 |
| Classical ML | scikit-learn | 1.6.1 |
| Gradient boosting | XGBoost | 2.0.3 |
| Transformer models | HuggingFace Transformers | 4.41.2 |
| Deep learning runtime | PyTorch | 2.3.0 |
| Model serialization | joblib | 1.4.2 |
| Numerical ops | NumPy | 1.26.4 |
| Frontend | HTML5 + vanilla JS + CSS3 | β€” |
| Fonts | Instrument Serif, Geist, DM Mono | β€” |
No database. No frontend framework. All model state is held in memory after startup.
---
## 3. Directory Structure
```
MindScan/
β”œβ”€β”€ app.py # Flask entry point (94 lines)
β”œβ”€β”€ predict.py # Core prediction logic (303 lines)
β”œβ”€β”€ requirements.txt # Python dependencies (7 packages)
β”œβ”€β”€ README.md # Quick-start guide
β”œβ”€β”€ ABOUT.md # This file β€” full project reference
β”œβ”€β”€ .gitignore # models/ excluded (too large for git)
β”‚
β”œβ”€β”€ templates/
β”‚ └── index.html # Single-page web UI
β”‚
β”œβ”€β”€ notebooks/
β”‚ β”œβ”€β”€ DA_Notebook_One.ipynb # Classical model training (2,269 lines)
β”‚ └── DA_2_Notebook.ipynb # XLM-RoBERTa training (13,178 lines)
β”‚
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ classical/
β”‚ β”‚ β”œβ”€β”€ le_d1.pkl # LabelEncoder β€” D1 (543 bytes)
β”‚ β”‚ β”œβ”€β”€ le_d2.pkl # LabelEncoder β€” D2
β”‚ β”‚ β”œβ”€β”€ le_d3.pkl # LabelEncoder β€” D3
β”‚ β”‚ β”œβ”€β”€ tfidf_d1.pkl # TF-IDF vectorizer β€” D1 (1.4 MB, 34,615 features)
β”‚ β”‚ β”œβ”€β”€ tfidf_d2.pkl # TF-IDF vectorizer β€” D2 (569 KB, 50,000 features)
β”‚ β”‚ β”œβ”€β”€ tfidf_d3.pkl # TF-IDF vectorizer β€” D3 (2.3 MB, 60,000 features)
β”‚ β”‚ β”œβ”€β”€ logistic_regression_d1.pkl # LR β€” D1 (1.6 MB)
β”‚ β”‚ β”œβ”€β”€ logistic_regression_d2.pkl # LR β€” D2 (120 KB)
β”‚ β”‚ β”œβ”€β”€ logistic_regression_d3.pkl # LR β€” D3 (470 KB)
β”‚ β”‚ β”œβ”€β”€ svm_d1.pkl # LinearSVC β€” D1 (1.6 MB)
β”‚ β”‚ β”œβ”€β”€ svm_d2.pkl # LinearSVC β€” D2 (120 KB)
β”‚ β”‚ β”œβ”€β”€ svm_d3.pkl # LinearSVC β€” D3 (470 KB)
β”‚ β”‚ β”œβ”€β”€ xgboost_d1.pkl # XGBoost β€” D1 (3.1 MB)
β”‚ β”‚ β”œβ”€β”€ xgboost_d2.pkl # XGBoost β€” D2 (362 KB)
β”‚ β”‚ β”œβ”€β”€ xgboost_d3.pkl # XGBoost β€” D3 (702 KB)
β”‚ β”‚ β”œβ”€β”€ random_forest_d1/d2/d3.pkl # RF β€” NOT deployed (241 MB + 72 MB + 334 MB)
β”‚ β”‚ β”œβ”€β”€ classical_results.csv # Performance metrics table
β”‚ β”‚ └── *.png # Confusion matrices + EDA plots (16 images)
β”‚ β”‚
β”‚ └── transformers/
β”‚ β”œβ”€β”€ xlmr_d1_final/ # Fine-tuned XLM-RoBERTa β€” D1 (1.1 GB)
β”‚ β”‚ β”œβ”€β”€ config.json # Model architecture config
β”‚ β”‚ β”œβ”€β”€ model.safetensors # Weights (1.1 GB)
β”‚ β”‚ β”œβ”€β”€ tokenizer.json # BPE tokenizer (17 MB)
β”‚ β”‚ └── tokenizer_config.json # Tokenizer metadata
β”‚ β”œβ”€β”€ xlmr_d2_final/ # Fine-tuned XLM-RoBERTa β€” D2 (1.1 GB)
β”‚ └── xlmr_d3_final/ # Fine-tuned XLM-RoBERTa β€” D3 (1.1 GB)
β”‚
β”œβ”€β”€ venv/ # Python virtual environment
└── .venv/ # Backup venv (both in .gitignore)
```
**Total disk usage:** ~3.2 GB (dominated by 3 Γ— 1.1 GB transformer weights)
---
## 4. Core Features
### Four Models Per Dataset (12 Total)
Each of the 3 datasets is evaluated by 4 independent models. All run on every request:
1. Logistic Regression (TF-IDF input)
2. SVM / LinearSVC (TF-IDF input)
3. XGBoost (TF-IDF input)
4. XLM-RoBERTa fine-tuned (raw text input)
### Risk Aggregation
- If **3 or more of the 4 Dataset 3 models** flag suicide risk β†’ `risk_flag = true`
- UI renders a red danger banner
- Response includes `"suicide_votes": "X/4 models flagged suicide risk"`
### Text Preprocessing Pipeline
Applied to all input before TF-IDF vectorization (raw text passed to transformers):
```
lowercase β†’ remove URLs (http/www/https) β†’ strip @mentions
β†’ remove # symbols (word kept) β†’ delete punctuation β†’ normalize whitespace
```
### UI Features (index.html)
- Live demo textarea with 5000-character limit and real-time counter
- Sample text buttons for quick testing
- Results display: winner card + 4 model confidence bars per dataset
- Class probability breakdown (expandable)
- Risk flag banner (red = danger, green = safe)
- CRISP-DM interactive timeline (6 stages, collapsible detail panels)
- Dataset explorer with class distribution bars
- Model card grid with F1 scores
- Project folder tree with file detail pane
- Animated stat counters in hero section
- Comparison panel vs Tumaliuan et al. (2024) baseline
---
## 5. Components & Modules
### `app.py` (94 lines)
Flask application. Responsibilities:
- Initializes Flask app
- Calls `load_all_models()` at startup (blocks until complete)
- Defines 3 routes: `/`, `/predict`, `/health`
- Input validation: max 5000 chars, non-empty, valid JSON
- Prints startup progress with emoji checkmarks to console
- Serves on `0.0.0.0:5000` (accessible on local network)
### `predict.py` (303 lines)
Core prediction engine. Key functions:
| Function | Purpose |
|---|---|
| `load_all_models()` | Loads all 12 models + encoders + tokenizer into `_models` global dict |
| `clean_text(text)` | Regex-based text cleaning (same logic used in both training notebooks) |
| `predict_classical(text, ds)` | TF-IDF vectorization + sklearn predict / decision_function |
| `predict_transformer(text, ds)` | Tokenization β†’ forward pass β†’ softmax probabilities |
| `predict_all(raw_text)` | Main orchestrator: cleans text, runs all 12 models, returns full result dict |
**Confidence normalization:** All models normalize to 0–1:
- LR / XGBoost: `predict_proba()` (native)
- SVM: `softmax(decision_function())` (LinearSVC has no proba by default)
- Transformer: `softmax(logits)`
**D2 label mapping:** Raw integer labels (0, 1) mapped to `"Not Depressed"` / `"Depressed"`. Handles both `str` and `int` types for robustness.
### `templates/index.html`
Single-page application. All UI logic in vanilla JS:
- `fetch('/predict', { method: 'POST', ... })` β€” AJAX prediction call
- Tab switching, progress bar animation, accordion expand/collapse
- Counter animations for stats in hero section
- No build step, no bundler, no external JS framework
---
## 6. Data Models & API Contract
### Request
```json
POST /predict
Content-Type: application/json
{ "text": "string β€” max 5000 characters" }
```
### Response
```json
{
"dataset1": {
"task": "Depression Type (6 Classes)",
"models": {
"Logistic Regression": { "label": "postpartum", "confidence": 0.958 },
"SVM": { "label": "postpartum", "confidence": 0.828 },
"XGBoost": { "label": "postpartum", "confidence": 0.999 },
"XLM-RoBERTa": { "label": "postpartum", "confidence": 0.997 }
},
"winner_model": "XGBoost",
"winner_prediction": "postpartum",
"winner_confidence": 0.999,
"class_probs": {
"postpartum": 0.997,
"bipolar": 0.001,
"major depressive": 0.001,
"psychotic": 0.0,
"no depression": 0.0,
"atypical": 0.001
}
},
"dataset2": {
"task": "Binary Depression Detection",
"models": { ... },
"winner_model": "XLM-RoBERTa",
"winner_prediction": "Depressed",
"winner_confidence": 0.998,
"class_probs": { "Depressed": 0.998, "Not Depressed": 0.002 }
},
"dataset3": {
"task": "Suicide Risk Assessment",
"models": { ... },
"winner_model": "XLM-RoBERTa",
"winner_prediction": "Suicide Risk",
"winner_confidence": 0.993,
"class_probs": { "Suicide Risk": 0.993, "No Suicide Risk": 0.007 }
},
"risk_flag": true,
"suicide_votes": "4/4 models flagged suicide risk",
"processing_time_ms": 2341
}
```
### Internal Model State (`_models` dict in predict.py)
| Key | Type | Description |
|---|---|---|
| `le_d1/d2/d3` | `LabelEncoder` | Decodes integer predictions to class names |
| `tfidf_d1/d2/d3` | `TfidfVectorizer` | Converts cleaned text to sparse feature vectors |
| `logistic_regression_d1/d2/d3` | `LogisticRegression` | Linear baseline |
| `svm_d1/d2/d3` | `LinearSVC` | SVM classifier |
| `xgboost_d1/d2/d3` | `XGBClassifier` | Gradient boosting |
| `tokenizer` | `XLMRobertaTokenizer` | Shared BPE tokenizer (all 3 transformer models) |
| `xlmr_d1/d2/d3` | `XLMRobertaForSequenceClassification` | Fine-tuned transformer |
| `xlmr_d1/d2/d3_len` | `int` | Max token length: 128 / 128 / 256 |
| `device` | `str` | `'cuda'` or `'cpu'` |
---
## 7. API Endpoints
### `GET /`
Returns `index.html`. No parameters.
### `POST /predict`
| Scenario | HTTP Status | Response |
|---|---|---|
| Success | 200 | Full prediction JSON (see above) |
| Missing `text` field | 400 | `{ "error": "..." }` |
| Empty text | 400 | `{ "error": "..." }` |
| Text > 5000 chars | 400 | `{ "error": "..." }` |
| Models not loaded yet | 503 | `{ "error": "..." }` |
| Prediction exception | 500 | `{ "error": "..." }` |
**Typical latency:** ~2–3 seconds on CPU (XLM-RoBERTa dominates inference time)
### `GET /health`
```json
{ "status": "ok", "models_ready": true }
```
Use for polling during the ~30-second startup window.
---
## 8. Configuration & Setup
### Environment Variables
None required. All paths computed relative to `app.py` using `os.path.dirname(__file__)`.
### Setup Steps
```bash
# 1. Download models from Google Drive β†’ place in models/classical/ and models/transformers/
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Mac / Linux
venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txt
# 4. Start server
python app.py
# 5. Open browser
# http://localhost:5000
```
### Expected Startup Console Output
```
=======================================================
MindScan β€” Starting up
=======================================================
Loading models... (XLM-RoBERTa takes ~30s on CPU)
βœ“ Loaded encoders/tfidf for d1
βœ“ Loaded encoders/tfidf for d2
βœ“ Loaded encoders/tfidf for d3
βœ“ Loaded logistic_regression_d1
βœ“ Loaded svm_d1
βœ“ Loaded xgboost_d1
... (repeated for d2, d3)
βœ“ Using device: cpu
βœ“ Tokeniser loaded
βœ“ Loaded XLM-RoBERTa d1 (max_length=128)
βœ“ Loaded XLM-RoBERTa d2 (max_length=128)
βœ“ Loaded XLM-RoBERTa d3 (max_length=256)
βœ… All models ready
🌐 Open: http://localhost:5000
=======================================================
```
### `.gitignore` Exclusions
- `models/` β€” All trained model files (too large for git, download separately)
- `venv/`, `.venv/` β€” Virtual environments
- `__pycache__/`, `*.pyc`, `*.pyo`
- `.ipynb_checkpoints/`
- `.DS_Store`, `Thumbs.db`
---
## 9. AI / ML Architecture
### The Three Datasets
| | D1 | D2 | D3 |
|---|---|---|---|
| **Source** | Nusrat et al. (2024) β€” Zenodo 14233292 | albertobellardini β€” Kaggle | nikhileswarkomati β€” Kaggle |
| **Platform** | Twitter | Twitter | Reddit |
| **Size** | 14,983 tweets | 10,314 tweets | 50,000 posts |
| **Task** | 6-class depression type | Binary depression | Binary suicide risk |
| **Avg text length** | 31.4 words | ~30 words | 62–200 words |
| **Class balance** | Imbalanced (1.89Γ—) | Severely imbalanced (3.46Γ—) | Balanced (1.0Γ—) |
| **SMOTE applied** | Yes β€” 11,986 β†’ 17,982 | Yes β€” 8,251 β†’ 12,800 | No |
All datasets use stratified 80/20 train/test split, `random_state=42`. Test sets are never touched by SMOTE (realistic evaluation).
### TF-IDF Feature Dimensions
| Dataset | Features |
|---|---|
| D1 | 34,615 |
| D2 | 50,000 |
| D3 | 60,000 |
### XLM-RoBERTa Architecture
- **Base model:** `xlm-roberta-base`
- **Parameters:** 278 million
- **Layers:** 12 transformer layers, 12 attention heads
- **Pre-training data:** 2.5 TB of text across 100 languages
- **Fine-tuning:** 3 epochs on Google Colab T4 GPU
- **Weight format:** `safetensors` (more secure and efficient than `.bin`)
- **Max token lengths:** D1 = 128, D2 = 128, D3 = 256
- D3 uses 256 because Reddit posts average 200.8 words for the suicide class (3.2Γ— longer than non-suicidal posts at 62.2 words)
- **Tokenizer:** Shared single instance across all 3 models (not duplicated)
### Why SVM Beats Transformer on D1
XLM-RoBERTa's contextual embeddings require sufficient token sequence length to demonstrate advantage over bag-of-words TF-IDF. D1 tweets average only 31.4 words β€” too short for context to matter. SVM achieves F1=0.9269 vs XLM-RoBERTa's lower score on D1. The transformer's advantage grows with text length and dominates on D3 (avg 200+ words).
### Why Random Forest Was Excluded from Deployment
| Model | D1 F1 | D3 F1 | Aggregate Size |
|---|---|---|---|
| XGBoost | competitive | competitive | ~4.2 MB |
| Random Forest | worst on D1 | worst on D3 | **647 MB** |
Size penalty not justified by performance. Random Forest `.pkl` files remain in `models/classical/` but are never loaded by `predict.py`.
### Performance Results
| Dataset | Model | Macro F1 | Cohen's ΞΊ |
|---|---|---|---|
| D1 β€” Depression Type | **SVM** | **0.9269** | **0.9072** |
| D1 β€” Depression Type | XGBoost | ~0.90 | β€” |
| D1 β€” Depression Type | XLM-RoBERTa | lower | β€” |
| D2 β€” Binary Depression | **XLM-RoBERTa** | **0.9993** | **0.9986** |
| D3 β€” Suicide Risk | **XLM-RoBERTa** | **0.9810** | **0.9620** |
| Baseline (Tumaliuan 2024) | β€” | 0.81 | β€” |
**Improvement over baseline: +12.7%**
---
## 10. Entry Points & Running the App
### Production (Local)
```bash
python app.py
```
Opens on `http://localhost:5000`. Server also accessible on local network via `http://<your-ip>:5000`.
### Training (Notebooks β€” Google Colab Only)
| Notebook | Purpose | Runtime Required |
|---|---|---|
| `notebooks/DA_Notebook_One.ipynb` | Train LR, SVM, XGBoost on all 3 datasets; generate metrics CSV and confusion matrix PNGs | CPU (Colab free tier) |
| `notebooks/DA_2_Notebook.ipynb` | Fine-tune XLM-RoBERTa on all 3 datasets; run full model comparison | **T4 GPU** (Colab) |
Both notebooks save outputs to Google Drive at `MindScan_Models/`.
---
## 11. Dependencies
```
flask==3.0.3 Web framework + routing
scikit-learn==1.6.1 LR, LinearSVC, TfidfVectorizer, LabelEncoder, SMOTE metrics
xgboost==2.0.3 Gradient boosting classifier
transformers==4.41.2 XLM-RoBERTa model + tokenizer (HuggingFace)
torch==2.3.0 PyTorch runtime (GPU optional β€” CUDA auto-detected)
joblib==1.4.2 Pickle serialization for large sklearn objects
numpy==1.26.4 Numerical operations, softmax computation
```
No Node.js / npm dependencies. Pure Python backend, vanilla JS frontend (no build step).
---
## 12. Testing & Evaluation
### No Automated Test Suite
The project has no `pytest`, `unittest`, or CI/CD pipeline. Evaluation is:
**Quantitative (offline, notebook):**
- Macro F1 score (primary metric β€” handles class imbalance)
- Cohen's Kappa (measures agreement beyond chance β€” reported for D1)
- Accuracy
- Confusion matrices (saved as PNG to `models/classical/`)
- `classical_results.csv` β€” full metrics table for all classical models
**Visual (EDA plots):**
- `eda_d1.png`, `eda_d2.png`, `eda_d3.png` β€” class distributions and text length histograms
**Manual (UI):**
- Sample text buttons in the live demo for smoke-testing the prediction pipeline
- All 4 model predictions + confidence bars shown simultaneously
---
## 13. Key Findings & Anomalies
1. **SVM beats XLM-RoBERTa on D1** β€” Short tweets (31.4 words avg) don't provide enough context for transformer embeddings to outperform TF-IDF bag-of-words. Classical ML is not always inferior to modern deep learning.
2. **D3 text length asymmetry** β€” Suicide posts (200.8 words avg) are 3.2Γ— longer than non-suicidal posts (62.2 words). This drove the max_length=256 decision for the D3 transformer.
3. **Near-perfect D2 score (F1=0.9993)** β€” Binary depression on tweets is almost perfectly separable with XLM-RoBERTa, likely due to strong lexical signals in the dataset.
4. **Parallel architecture prevents missed cases** β€” Sequential gating (e.g., only check suicide if depression detected) would miss suicidal ideation in people who show no depression markers. All 3 tasks always run.
5. **Confidence computation differs by model type** β€” SVM uses `softmax(decision_function())` because `LinearSVC` lacks native probability calibration. All outputs are normalized to 0–1 for UI consistency.
6. **Transformer weights in safetensors format** β€” Newer, more secure format vs. PyTorch `.bin`. Resists pickle deserialization attacks.
7. **SMOTE only on training data** β€” Oversampling is applied only to training splits. Test sets remain unmodified to reflect real-world class distributions.
8. **Random Forest technically present but never loaded** β€” The `.pkl` files exist in `models/classical/` but `predict.py` has no code path that loads them.
---
## 14. Research Context
**Project:** NCI H9DAI β€” Data Analytics for Artificial Intelligence
**Degree:** MSc Artificial Intelligence
**Year:** 2026
**Methodology:** CRISP-DM (6 stages: Business Understanding β†’ Data Understanding β†’ Data Preparation β†’ Modelling β†’ Evaluation β†’ Deployment)
**Baseline paper:** Tumaliuan et al. (2024) β€” depression detection on Filipino Twitter, F1=0.81
**Research Questions:**
- RQ1: Can classical ML (SVM, LR, XGBoost) exceed the 0.81 baseline?
- RQ2: Can XLM-RoBERTa further improve on classical ML?
- RQ3: Does SMOTE balancing improve F1 on imbalanced datasets?
- RQ4: Does the parallel architecture catch cases a sequential pipeline would miss?
**Datasets used (English/multilingual, broader than baseline's Filipino-only scope):**
- D1: Zenodo 14233292 (Nusrat et al.)
- D2: Kaggle β€” albertobellardini
- D3: Kaggle β€” nikhileswarkomati
**Key contributions over baseline:**
- Multi-dataset parallel evaluation (vs. single dataset)
- XLM-RoBERTa multilingual transformer (vs. no transformer)
- SMOTE balancing (vs. no balancing strategy)
- Cohen's Kappa reporting (vs. accuracy/F1 only)
- Explainable per-model confidence scores in UI
---
*NCI H9DAI Β· Data Analytics for Artificial Intelligence Β· MSc Artificial Intelligence Β· 2026*