# MindScan — Complete Project Reference ### NCI H9DAI · MSc Artificial Intelligence · 2026 > **Disclaimer:** This is a research prototype built for academic coursework. It is **not** a clinical tool and must never be used for actual medical diagnosis or mental health assessment. --- ## Table of Contents 1. [Project Purpose](#1-project-purpose) 2. [Tech Stack](#2-tech-stack) 3. [Directory Structure](#3-directory-structure) 4. [Core Features](#4-core-features) 5. [Components & Modules](#5-components--modules) 6. [Data Models & API Contract](#6-data-models--api-contract) 7. [API Endpoints](#7-api-endpoints) 8. [Configuration & Setup](#8-configuration--setup) 9. [AI / ML Architecture](#9-ai--ml-architecture) 10. [Entry Points & Running the App](#10-entry-points--running-the-app) 11. [Dependencies](#11-dependencies) 12. [Testing & Evaluation](#12-testing--evaluation) 13. [Key Findings & Anomalies](#13-key-findings--anomalies) 14. [Research Context](#14-research-context) --- ## 1. Project Purpose MindScan is a multi-model mental health text analysis system. It runs **12 machine learning classifiers simultaneously** across **3 independent datasets**, returning three clinically distinct assessments from a single text input: | Assessment | Task | Classes | |---|---|---| | Depression Type | Multi-class classification | postpartum, major depressive, bipolar, psychotic, no depression, atypical | | Binary Depression | Binary classification | Depressed / Not Depressed | | Suicide Risk | Binary classification | Suicide Risk / No Suicide Risk | **Research goal:** Extend Tumaliuan et al. (2024) with modern transformer embeddings (XLM-RoBERTa), classical ML gold standards (SVM), and SMOTE balancing — achieving a **+12.7% F1 improvement** (0.81 → 0.9269) over the baseline. **Key architectural decision — parallel, not sequential:** All 3 datasets run independently. Suicidal ideation can exist without depression markers; a sequential pipeline would gate out those cases entirely. Research Question 4 (RQ4) explicitly tests whether the parallel design catches cases that sequential would miss. --- ## 2. Tech Stack | Layer | Technology | Version | |---|---|---| | Web framework | Flask | 3.0.3 | | Classical ML | scikit-learn | 1.6.1 | | Gradient boosting | XGBoost | 2.0.3 | | Transformer models | HuggingFace Transformers | 4.41.2 | | Deep learning runtime | PyTorch | 2.3.0 | | Model serialization | joblib | 1.4.2 | | Numerical ops | NumPy | 1.26.4 | | Frontend | HTML5 + vanilla JS + CSS3 | — | | Fonts | Instrument Serif, Geist, DM Mono | — | No database. No frontend framework. All model state is held in memory after startup. --- ## 3. Directory Structure ``` MindScan/ ├── app.py # Flask entry point (94 lines) ├── predict.py # Core prediction logic (303 lines) ├── requirements.txt # Python dependencies (7 packages) ├── README.md # Quick-start guide ├── ABOUT.md # This file — full project reference ├── .gitignore # models/ excluded (too large for git) │ ├── templates/ │ └── index.html # Single-page web UI │ ├── notebooks/ │ ├── DA_Notebook_One.ipynb # Classical model training (2,269 lines) │ └── DA_2_Notebook.ipynb # XLM-RoBERTa training (13,178 lines) │ ├── models/ │ ├── classical/ │ │ ├── le_d1.pkl # LabelEncoder — D1 (543 bytes) │ │ ├── le_d2.pkl # LabelEncoder — D2 │ │ ├── le_d3.pkl # LabelEncoder — D3 │ │ ├── tfidf_d1.pkl # TF-IDF vectorizer — D1 (1.4 MB, 34,615 features) │ │ ├── tfidf_d2.pkl # TF-IDF vectorizer — D2 (569 KB, 50,000 features) │ │ ├── tfidf_d3.pkl # TF-IDF vectorizer — D3 (2.3 MB, 60,000 features) │ │ ├── logistic_regression_d1.pkl # LR — D1 (1.6 MB) │ │ ├── logistic_regression_d2.pkl # LR — D2 (120 KB) │ │ ├── logistic_regression_d3.pkl # LR — D3 (470 KB) │ │ ├── svm_d1.pkl # LinearSVC — D1 (1.6 MB) │ │ ├── svm_d2.pkl # LinearSVC — D2 (120 KB) │ │ ├── svm_d3.pkl # LinearSVC — D3 (470 KB) │ │ ├── xgboost_d1.pkl # XGBoost — D1 (3.1 MB) │ │ ├── xgboost_d2.pkl # XGBoost — D2 (362 KB) │ │ ├── xgboost_d3.pkl # XGBoost — D3 (702 KB) │ │ ├── random_forest_d1/d2/d3.pkl # RF — NOT deployed (241 MB + 72 MB + 334 MB) │ │ ├── classical_results.csv # Performance metrics table │ │ └── *.png # Confusion matrices + EDA plots (16 images) │ │ │ └── transformers/ │ ├── xlmr_d1_final/ # Fine-tuned XLM-RoBERTa — D1 (1.1 GB) │ │ ├── config.json # Model architecture config │ │ ├── model.safetensors # Weights (1.1 GB) │ │ ├── tokenizer.json # BPE tokenizer (17 MB) │ │ └── tokenizer_config.json # Tokenizer metadata │ ├── xlmr_d2_final/ # Fine-tuned XLM-RoBERTa — D2 (1.1 GB) │ └── xlmr_d3_final/ # Fine-tuned XLM-RoBERTa — D3 (1.1 GB) │ ├── venv/ # Python virtual environment └── .venv/ # Backup venv (both in .gitignore) ``` **Total disk usage:** ~3.2 GB (dominated by 3 × 1.1 GB transformer weights) --- ## 4. Core Features ### Four Models Per Dataset (12 Total) Each of the 3 datasets is evaluated by 4 independent models. All run on every request: 1. Logistic Regression (TF-IDF input) 2. SVM / LinearSVC (TF-IDF input) 3. XGBoost (TF-IDF input) 4. XLM-RoBERTa fine-tuned (raw text input) ### Risk Aggregation - If **3 or more of the 4 Dataset 3 models** flag suicide risk → `risk_flag = true` - UI renders a red danger banner - Response includes `"suicide_votes": "X/4 models flagged suicide risk"` ### Text Preprocessing Pipeline Applied to all input before TF-IDF vectorization (raw text passed to transformers): ``` lowercase → remove URLs (http/www/https) → strip @mentions → remove # symbols (word kept) → delete punctuation → normalize whitespace ``` ### UI Features (index.html) - Live demo textarea with 5000-character limit and real-time counter - Sample text buttons for quick testing - Results display: winner card + 4 model confidence bars per dataset - Class probability breakdown (expandable) - Risk flag banner (red = danger, green = safe) - CRISP-DM interactive timeline (6 stages, collapsible detail panels) - Dataset explorer with class distribution bars - Model card grid with F1 scores - Project folder tree with file detail pane - Animated stat counters in hero section - Comparison panel vs Tumaliuan et al. (2024) baseline --- ## 5. Components & Modules ### `app.py` (94 lines) Flask application. Responsibilities: - Initializes Flask app - Calls `load_all_models()` at startup (blocks until complete) - Defines 3 routes: `/`, `/predict`, `/health` - Input validation: max 5000 chars, non-empty, valid JSON - Prints startup progress with emoji checkmarks to console - Serves on `0.0.0.0:5000` (accessible on local network) ### `predict.py` (303 lines) Core prediction engine. Key functions: | Function | Purpose | |---|---| | `load_all_models()` | Loads all 12 models + encoders + tokenizer into `_models` global dict | | `clean_text(text)` | Regex-based text cleaning (same logic used in both training notebooks) | | `predict_classical(text, ds)` | TF-IDF vectorization + sklearn predict / decision_function | | `predict_transformer(text, ds)` | Tokenization → forward pass → softmax probabilities | | `predict_all(raw_text)` | Main orchestrator: cleans text, runs all 12 models, returns full result dict | **Confidence normalization:** All models normalize to 0–1: - LR / XGBoost: `predict_proba()` (native) - SVM: `softmax(decision_function())` (LinearSVC has no proba by default) - Transformer: `softmax(logits)` **D2 label mapping:** Raw integer labels (0, 1) mapped to `"Not Depressed"` / `"Depressed"`. Handles both `str` and `int` types for robustness. ### `templates/index.html` Single-page application. All UI logic in vanilla JS: - `fetch('/predict', { method: 'POST', ... })` — AJAX prediction call - Tab switching, progress bar animation, accordion expand/collapse - Counter animations for stats in hero section - No build step, no bundler, no external JS framework --- ## 6. Data Models & API Contract ### Request ```json POST /predict Content-Type: application/json { "text": "string — max 5000 characters" } ``` ### Response ```json { "dataset1": { "task": "Depression Type (6 Classes)", "models": { "Logistic Regression": { "label": "postpartum", "confidence": 0.958 }, "SVM": { "label": "postpartum", "confidence": 0.828 }, "XGBoost": { "label": "postpartum", "confidence": 0.999 }, "XLM-RoBERTa": { "label": "postpartum", "confidence": 0.997 } }, "winner_model": "XGBoost", "winner_prediction": "postpartum", "winner_confidence": 0.999, "class_probs": { "postpartum": 0.997, "bipolar": 0.001, "major depressive": 0.001, "psychotic": 0.0, "no depression": 0.0, "atypical": 0.001 } }, "dataset2": { "task": "Binary Depression Detection", "models": { ... }, "winner_model": "XLM-RoBERTa", "winner_prediction": "Depressed", "winner_confidence": 0.998, "class_probs": { "Depressed": 0.998, "Not Depressed": 0.002 } }, "dataset3": { "task": "Suicide Risk Assessment", "models": { ... }, "winner_model": "XLM-RoBERTa", "winner_prediction": "Suicide Risk", "winner_confidence": 0.993, "class_probs": { "Suicide Risk": 0.993, "No Suicide Risk": 0.007 } }, "risk_flag": true, "suicide_votes": "4/4 models flagged suicide risk", "processing_time_ms": 2341 } ``` ### Internal Model State (`_models` dict in predict.py) | Key | Type | Description | |---|---|---| | `le_d1/d2/d3` | `LabelEncoder` | Decodes integer predictions to class names | | `tfidf_d1/d2/d3` | `TfidfVectorizer` | Converts cleaned text to sparse feature vectors | | `logistic_regression_d1/d2/d3` | `LogisticRegression` | Linear baseline | | `svm_d1/d2/d3` | `LinearSVC` | SVM classifier | | `xgboost_d1/d2/d3` | `XGBClassifier` | Gradient boosting | | `tokenizer` | `XLMRobertaTokenizer` | Shared BPE tokenizer (all 3 transformer models) | | `xlmr_d1/d2/d3` | `XLMRobertaForSequenceClassification` | Fine-tuned transformer | | `xlmr_d1/d2/d3_len` | `int` | Max token length: 128 / 128 / 256 | | `device` | `str` | `'cuda'` or `'cpu'` | --- ## 7. API Endpoints ### `GET /` Returns `index.html`. No parameters. ### `POST /predict` | Scenario | HTTP Status | Response | |---|---|---| | Success | 200 | Full prediction JSON (see above) | | Missing `text` field | 400 | `{ "error": "..." }` | | Empty text | 400 | `{ "error": "..." }` | | Text > 5000 chars | 400 | `{ "error": "..." }` | | Models not loaded yet | 503 | `{ "error": "..." }` | | Prediction exception | 500 | `{ "error": "..." }` | **Typical latency:** ~2–3 seconds on CPU (XLM-RoBERTa dominates inference time) ### `GET /health` ```json { "status": "ok", "models_ready": true } ``` Use for polling during the ~30-second startup window. --- ## 8. Configuration & Setup ### Environment Variables None required. All paths computed relative to `app.py` using `os.path.dirname(__file__)`. ### Setup Steps ```bash # 1. Download models from Google Drive → place in models/classical/ and models/transformers/ # 2. Create virtual environment python -m venv venv source venv/bin/activate # Mac / Linux venv\Scripts\activate # Windows # 3. Install dependencies pip install -r requirements.txt # 4. Start server python app.py # 5. Open browser # http://localhost:5000 ``` ### Expected Startup Console Output ``` ======================================================= MindScan — Starting up ======================================================= Loading models... (XLM-RoBERTa takes ~30s on CPU) ✓ Loaded encoders/tfidf for d1 ✓ Loaded encoders/tfidf for d2 ✓ Loaded encoders/tfidf for d3 ✓ Loaded logistic_regression_d1 ✓ Loaded svm_d1 ✓ Loaded xgboost_d1 ... (repeated for d2, d3) ✓ Using device: cpu ✓ Tokeniser loaded ✓ Loaded XLM-RoBERTa d1 (max_length=128) ✓ Loaded XLM-RoBERTa d2 (max_length=128) ✓ Loaded XLM-RoBERTa d3 (max_length=256) ✅ All models ready 🌐 Open: http://localhost:5000 ======================================================= ``` ### `.gitignore` Exclusions - `models/` — All trained model files (too large for git, download separately) - `venv/`, `.venv/` — Virtual environments - `__pycache__/`, `*.pyc`, `*.pyo` - `.ipynb_checkpoints/` - `.DS_Store`, `Thumbs.db` --- ## 9. AI / ML Architecture ### The Three Datasets | | D1 | D2 | D3 | |---|---|---|---| | **Source** | Nusrat et al. (2024) — Zenodo 14233292 | albertobellardini — Kaggle | nikhileswarkomati — Kaggle | | **Platform** | Twitter | Twitter | Reddit | | **Size** | 14,983 tweets | 10,314 tweets | 50,000 posts | | **Task** | 6-class depression type | Binary depression | Binary suicide risk | | **Avg text length** | 31.4 words | ~30 words | 62–200 words | | **Class balance** | Imbalanced (1.89×) | Severely imbalanced (3.46×) | Balanced (1.0×) | | **SMOTE applied** | Yes — 11,986 → 17,982 | Yes — 8,251 → 12,800 | No | All datasets use stratified 80/20 train/test split, `random_state=42`. Test sets are never touched by SMOTE (realistic evaluation). ### TF-IDF Feature Dimensions | Dataset | Features | |---|---| | D1 | 34,615 | | D2 | 50,000 | | D3 | 60,000 | ### XLM-RoBERTa Architecture - **Base model:** `xlm-roberta-base` - **Parameters:** 278 million - **Layers:** 12 transformer layers, 12 attention heads - **Pre-training data:** 2.5 TB of text across 100 languages - **Fine-tuning:** 3 epochs on Google Colab T4 GPU - **Weight format:** `safetensors` (more secure and efficient than `.bin`) - **Max token lengths:** D1 = 128, D2 = 128, D3 = 256 - D3 uses 256 because Reddit posts average 200.8 words for the suicide class (3.2× longer than non-suicidal posts at 62.2 words) - **Tokenizer:** Shared single instance across all 3 models (not duplicated) ### Why SVM Beats Transformer on D1 XLM-RoBERTa's contextual embeddings require sufficient token sequence length to demonstrate advantage over bag-of-words TF-IDF. D1 tweets average only 31.4 words — too short for context to matter. SVM achieves F1=0.9269 vs XLM-RoBERTa's lower score on D1. The transformer's advantage grows with text length and dominates on D3 (avg 200+ words). ### Why Random Forest Was Excluded from Deployment | Model | D1 F1 | D3 F1 | Aggregate Size | |---|---|---|---| | XGBoost | competitive | competitive | ~4.2 MB | | Random Forest | worst on D1 | worst on D3 | **647 MB** | Size penalty not justified by performance. Random Forest `.pkl` files remain in `models/classical/` but are never loaded by `predict.py`. ### Performance Results | Dataset | Model | Macro F1 | Cohen's κ | |---|---|---|---| | D1 — Depression Type | **SVM** | **0.9269** | **0.9072** | | D1 — Depression Type | XGBoost | ~0.90 | — | | D1 — Depression Type | XLM-RoBERTa | lower | — | | D2 — Binary Depression | **XLM-RoBERTa** | **0.9993** | **0.9986** | | D3 — Suicide Risk | **XLM-RoBERTa** | **0.9810** | **0.9620** | | Baseline (Tumaliuan 2024) | — | 0.81 | — | **Improvement over baseline: +12.7%** --- ## 10. Entry Points & Running the App ### Production (Local) ```bash python app.py ``` Opens on `http://localhost:5000`. Server also accessible on local network via `http://:5000`. ### Training (Notebooks — Google Colab Only) | Notebook | Purpose | Runtime Required | |---|---|---| | `notebooks/DA_Notebook_One.ipynb` | Train LR, SVM, XGBoost on all 3 datasets; generate metrics CSV and confusion matrix PNGs | CPU (Colab free tier) | | `notebooks/DA_2_Notebook.ipynb` | Fine-tune XLM-RoBERTa on all 3 datasets; run full model comparison | **T4 GPU** (Colab) | Both notebooks save outputs to Google Drive at `MindScan_Models/`. --- ## 11. Dependencies ``` flask==3.0.3 Web framework + routing scikit-learn==1.6.1 LR, LinearSVC, TfidfVectorizer, LabelEncoder, SMOTE metrics xgboost==2.0.3 Gradient boosting classifier transformers==4.41.2 XLM-RoBERTa model + tokenizer (HuggingFace) torch==2.3.0 PyTorch runtime (GPU optional — CUDA auto-detected) joblib==1.4.2 Pickle serialization for large sklearn objects numpy==1.26.4 Numerical operations, softmax computation ``` No Node.js / npm dependencies. Pure Python backend, vanilla JS frontend (no build step). --- ## 12. Testing & Evaluation ### No Automated Test Suite The project has no `pytest`, `unittest`, or CI/CD pipeline. Evaluation is: **Quantitative (offline, notebook):** - Macro F1 score (primary metric — handles class imbalance) - Cohen's Kappa (measures agreement beyond chance — reported for D1) - Accuracy - Confusion matrices (saved as PNG to `models/classical/`) - `classical_results.csv` — full metrics table for all classical models **Visual (EDA plots):** - `eda_d1.png`, `eda_d2.png`, `eda_d3.png` — class distributions and text length histograms **Manual (UI):** - Sample text buttons in the live demo for smoke-testing the prediction pipeline - All 4 model predictions + confidence bars shown simultaneously --- ## 13. Key Findings & Anomalies 1. **SVM beats XLM-RoBERTa on D1** — Short tweets (31.4 words avg) don't provide enough context for transformer embeddings to outperform TF-IDF bag-of-words. Classical ML is not always inferior to modern deep learning. 2. **D3 text length asymmetry** — Suicide posts (200.8 words avg) are 3.2× longer than non-suicidal posts (62.2 words). This drove the max_length=256 decision for the D3 transformer. 3. **Near-perfect D2 score (F1=0.9993)** — Binary depression on tweets is almost perfectly separable with XLM-RoBERTa, likely due to strong lexical signals in the dataset. 4. **Parallel architecture prevents missed cases** — Sequential gating (e.g., only check suicide if depression detected) would miss suicidal ideation in people who show no depression markers. All 3 tasks always run. 5. **Confidence computation differs by model type** — SVM uses `softmax(decision_function())` because `LinearSVC` lacks native probability calibration. All outputs are normalized to 0–1 for UI consistency. 6. **Transformer weights in safetensors format** — Newer, more secure format vs. PyTorch `.bin`. Resists pickle deserialization attacks. 7. **SMOTE only on training data** — Oversampling is applied only to training splits. Test sets remain unmodified to reflect real-world class distributions. 8. **Random Forest technically present but never loaded** — The `.pkl` files exist in `models/classical/` but `predict.py` has no code path that loads them. --- ## 14. Research Context **Project:** NCI H9DAI — Data Analytics for Artificial Intelligence **Degree:** MSc Artificial Intelligence **Year:** 2026 **Methodology:** CRISP-DM (6 stages: Business Understanding → Data Understanding → Data Preparation → Modelling → Evaluation → Deployment) **Baseline paper:** Tumaliuan et al. (2024) — depression detection on Filipino Twitter, F1=0.81 **Research Questions:** - RQ1: Can classical ML (SVM, LR, XGBoost) exceed the 0.81 baseline? - RQ2: Can XLM-RoBERTa further improve on classical ML? - RQ3: Does SMOTE balancing improve F1 on imbalanced datasets? - RQ4: Does the parallel architecture catch cases a sequential pipeline would miss? **Datasets used (English/multilingual, broader than baseline's Filipino-only scope):** - D1: Zenodo 14233292 (Nusrat et al.) - D2: Kaggle — albertobellardini - D3: Kaggle — nikhileswarkomati **Key contributions over baseline:** - Multi-dataset parallel evaluation (vs. single dataset) - XLM-RoBERTa multilingual transformer (vs. no transformer) - SMOTE balancing (vs. no balancing strategy) - Cohen's Kappa reporting (vs. accuracy/F1 only) - Explainable per-model confidence scores in UI --- *NCI H9DAI · Data Analytics for Artificial Intelligence · MSc Artificial Intelligence · 2026*