Brain / README.md

Upload folder using huggingface_hub

016c645 verified about 1 month ago

4.38 kB

	# MindScan — Mental Health Detection System
	### NCI H9DAI Research Project 2026 · MSc Artificial Intelligence

	A multi-model mental health text analysis system that runs 12 ML classifiers across 3 datasets simultaneously, returning depression type, binary depression likelihood, and suicide risk scores for any input text.

	---

	## Project Structure

	```
	MindScan/
	├── app.py Flask backend — start here
	├── predict.py Prediction logic (all 12 models)
	├── requirements.txt Python dependencies
	├── README.md This file
	├── templates/
	│ └── index.html UI (served by Flask at localhost:5000)
	├── models/
	│ ├── classical/ Download from Google Drive (see below)
	│ └── transformers/ Download from Google Drive (see below)
	└── notebooks/
	├── DA_Notebook_One.ipynb Classical model training
	└── DA_2_Notebook.ipynb XLM-RoBERTa + comparison
	```

	---
	## Github Link
	https://github.com/Amod069/MindScan



	## Setup

	### 1. Download model files from Google Drive
	Download `MindScan_Models/` from Google Drive and place the contents like this:
	https://drive.google.com/drive/folders/16jfsPUcdekDWqtk4evTjQHQO2YoKJdpQ?usp=sharing

	```
	models/
	├── classical/
	│ ├── le_d1.pkl, le_d2.pkl, le_d3.pkl
	│ ├── tfidf_d1.pkl, tfidf_d2.pkl, tfidf_d3.pkl
	│ ├── logistic_regression_d1.pkl, _d2.pkl, _d3.pkl
	│ ├── svm_d1.pkl, _d2.pkl, _d3.pkl
	│ └── xgboost_d1.pkl, _d2.pkl, _d3.pkl
	└── transformers/
	├── xlmr_d1_final/
	├── xlmr_d2_final/
	└── xlmr_d3_final/
	```

	### 2. Create Python environment
	```bash
	python -m venv venv

	# Mac/Linux
	source venv/bin/activate

	# Windows
	venv\Scripts\activate
	```

	### 3. Install dependencies
	```bash
	pip install -r requirements.txt
	```

	### 4. Run the server
	```bash
	python app.py
	```

	### 5. Open the UI
	```
	http://localhost:5000
	```

	Note: First startup takes ~30 seconds while XLM-RoBERTa models load into memory.

	---

	## The 3 Datasets

	\| \| Dataset \| Source \| Size \| Task \|
	\|---\|---\|---\|---\|---\|
	\| D1 \| Nusrat et al. (2024) \| Zenodo 14233292 \| 14,983 tweets \| 6-class depression type \|
	\| D2 \| albertobellardini \| Kaggle \| 10,314 tweets \| Binary depression \|
	\| D3 \| nikhileswarkomati \| Kaggle \| 50,000 Reddit posts \| Binary suicide risk \|

	## The 4 Models (per dataset = 12 total)

	1. Logistic Regression — simple linear baseline
	2. SVM (LinearSVC) — classical NLP gold standard
	3. XGBoost — gradient boosting
	4. XLM-RoBERTa — transformer, contextual embeddings

	Note: Random Forest excluded from deployment (646 MB files, worst performer on D1/D3).

	---

	## Real Results

	\| Dataset \| Best Model \| Macro F1 \| Cohen's Kappa \|
	\|---\|---\|---\|---\|
	\| D1 Depression Type \| SVM \| 0.9269 \| 0.9072 \|
	\| D2 Binary Depression \| XLM-RoBERTa \| 0.9993 \| 0.9986 \|
	\| D3 Suicide Risk \| XLM-RoBERTa \| 0.9810 \| 0.9620 \|

	Key finding: SVM outperforms XLM-RoBERTa on 6-class psychiatric classification (D1). All models exceed the Tumaliuan et al. (2024) benchmark of F1=0.81.

	---

	## API

	POST /predict
	```json
	// Request
	{ "text": "your text here" }

	// Response
	{
	"dataset1": {
	"task": "Depression Type (6 Classes)",
	"models": {
	"Logistic Regression": { "label": "postpartum", "confidence": 0.958 },
	"SVM": { "label": "postpartum", "confidence": 0.828 },
	"XGBoost": { "label": "postpartum", "confidence": 0.999 },
	"XLM-RoBERTa": { "label": "postpartum", "confidence": 0.997 }
	},
	"winner_model": "XGBoost",
	"winner_prediction": "postpartum",
	"winner_confidence": 0.999,
	"class_probs": { "postpartum": 0.997, "bipolar": 0.001, ... }
	},
	"dataset2": { ... },
	"dataset3": { ... },
	"risk_flag": false,
	"suicide_votes": "0/4 models flagged suicide risk",
	"processing_time_ms": 2341
	}
	```

	GET /health
	```json
	{ "status": "ok", "models_ready": true }
	```

	---

	## Disclaimer

	This system is a research prototype built for academic coursework. It is not a clinical tool and must never be used for actual medical diagnosis or mental health assessment. All datasets are from publicly available sources for research purposes only.

	---

	NCI H9DAI · Data Analytics for Artificial Intelligence · 2026