IamGrooooot commited on Mar 4

Commit

53a6def

0 Parent(s):

Model E: Unsupervised PCA + clustering risk stratification

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

MODEL_CARD.md +199 -0
README.md +87 -0
config.json +8 -0
documentation/README.md +151 -0
pipeline.yml +29 -0
requirements.txt +1 -0
setup.cfg +7 -0
training/README.md +6 -0
training/src/README.md +8 -0
training/src/modelling/__pycache__/run_model.cpython-313.pyc +0 -0
training/src/modelling/additional_code_onevsone_onevsrest_approaches.py +346 -0
training/src/modelling/dtc_params.json +3 -0
training/src/modelling/event_calculations.py +183 -0
training/src/modelling/hierarchical_params.json +4 -0
training/src/modelling/kmeans_params.json +4 -0
training/src/modelling/one_vs_rest_BLR.py +377 -0
training/src/modelling/one_vs_rest_DTC.py +380 -0
training/src/modelling/predict_clusters.py +70 -0
training/src/modelling/run_model.py +355 -0
training/src/modelling/validate.py +260 -0
training/src/processing/README.md +24 -0
training/src/processing/__init__.py +1 -0
training/src/processing/mappings/Comorbidity feature review for models & clin summary update v2 May 2021.xlsx +0 -0
training/src/processing/mappings/README.md +7 -0
training/src/processing/mappings/diag_copd_resp_desc.json +5 -0
training/src/processing/mappings/inhaler_mapping.json +55 -0
training/src/processing/mappings/test_mapping.json +1 -0
training/src/processing/misc/process_gples.py +73 -0
training/src/processing/misc/process_validation_adm.py +28 -0
training/src/processing/misc/process_validation_presc.py +20 -0
training/src/processing/process_admissions.py +153 -0
training/src/processing/process_comorbidities.py +161 -0
training/src/processing/process_demographics.py +74 -0
training/src/processing/process_labs.py +247 -0
training/src/processing/process_prescribing.py +145 -0
training/src/processing/utils/README.md +11 -0
training/src/processing/utils/__init__.py +1 -0
training/src/processing/utils/adm_common.py +77 -0
training/src/processing/utils/adm_processing.py +146 -0
training/src/processing/utils/adm_reduction.py +65 -0
training/src/processing/utils/common.py +132 -0
training/src/processing/utils/comorb_processing.py +20 -0
training/src/processing/utils/labs_processing.py +16 -0
training/src/processing/utils/presc_common.py +68 -0
training/src/reduction/README.md +12 -0
training/src/reduction/__init__.py +1 -0
training/src/reduction/clean_and_scale_test.py +173 -0
training/src/reduction/clean_and_scale_train.py +171 -0
training/src/reduction/combine.py +217 -0
training/src/reduction/post_proc_reduction.py +37 -0

MODEL_CARD.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+language: en
+license: apache-2.0
+tags:
+  - healthcare
+  - ehr
+  - copd
+  - clinical-risk
+  - tabular
+  - scikit-learn
+  - clustering
+  - unsupervised
+pipeline_tag: tabular-classification
+library_name: sklearn
+---
+# COPD Open Models — Model E (Unsupervised Risk Stratification)
+## Model Details
+Model E groups COPD patients into risk clusters using **PCA dimensionality reduction** followed by **KMeans or hierarchical clustering**. Clusters are designed to support risk stratification — identifying whether patients are receiving the appropriate level of care for their apparent risk. Clusters update with new data to track how patient risk evolves over time.
+### Key Characteristics
+- **Unsupervised** — no target labels required; clusters are derived from patient feature similarity alone.
+- **Two-stage PCA** — Stage 1 selects features explaining ≥90% variance; Stage 2 reduces to 3 components explaining ≥80% variance.
+- **Modular pipeline** — processing, reduction, and modelling stages are fully separated and independently reusable.
+- Training code is fully decoupled from cloud infrastructure — runs locally with no Azure dependencies.
+> **Note:** This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.
+### Model Type
+Unsupervised clustering (PCA + KMeans / Agglomerative Hierarchical Clustering), validated via Decision Tree Classifier on cluster labels.
+### Release Notes
+- **Phase 1 (current):** Models C, E, H published as the initial "COPD Open Models" collection.
+- **Phase 2 (planned):** Additional models may follow after codebase sanitisation.
+---
+## Intended Use
+This model and code are published as **reference implementations** for research, education, and benchmarking on COPD risk stratification tasks.
+### Intended Users
+- ML practitioners exploring unsupervised healthcare ML pipelines
+- Researchers comparing dimensionality reduction and clustering approaches for EHR data
+- Developers building internal prototypes (non-clinical)
+### Out-of-Scope Uses
+- **Not** for clinical decision-making, triage, diagnosis, or treatment planning.
+- **Not** a substitute for clinical judgement or validated clinical tools.
+- Do **not** deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.
+### Regulatory Considerations (SaMD)
+Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.
+---
+## Training Data
+- **Source:** NHS EHR-derived datasets (training performed on controlled datasets; not distributed here).
+- **Data available in this repo:** Synthetic/example datasets only.
+- **Cohort:** COPD patients from hospital admissions, laboratory, pharmacy, and demographic records.
+- **Data split:** 60% train / 15% validation / 20% test (random_state=42). RECEIVER and Scale-Up cohorts held out as external validation sets.
+### Features (~70 after processing)
+| Category | Features |
+|----------|----------|
+| **Admissions** | adm_per_year, total_hosp_days, mean_los, copd_per_year, resp_per_year, copd_resp_per_year, days_since_copd_resp, days_since_adm, days_since_rescue, anxiety_depression_per_year |
+| **Demographics** | age, sex_bin, ethnicity (7 categories), marital_status (one-hot), SIMD quintile/decile/vigintile |
+| **Laboratory (2-year medians)** | 26 lab tests: albumin, ALT, AST, alkaline phosphatase, basophils, CRP, chloride, creatinine, eosinophils, eGFR, haematocrit, haemoglobin, lymphocytes, MCH, MCV, monocytes, neutrophils, platelets, potassium, RBC, sodium, total bilirubin, urea, WBC, neutrophil-to-lymphocyte ratio; plus labs_per_year |
+| **Prescribing** | single/double/triple_inhaler_per_year, salbutamol_per_year, rescue_meds_per_year, anxiety_depression_presc_per_year, presc_per_year |
+| **Comorbidities** | 30 binary flags (cardiovascular, respiratory, metabolic, oncology conditions), comorb_per_year |
+### Data Preprocessing
+1. **Imputation** — grouped median imputation by (age_bin × sex_bin). Age binned into 10 quantiles. Days-since features for patients with <5 years in cohort filled with dataset maximum.
+2. **Scaling** — MinMaxScaler to [0, 1], fit on training data only; test/validation/external sets transformed using saved scaler.
+---
+## Training Procedure
+### Training Framework
+- pandas, numpy, scikit-learn, matplotlib, mlflow
+### Algorithms
+| Component | Algorithm | Parameters |
+|-----------|-----------|------------|
+| **PCA Stage 1** | sklearn.decomposition.PCA | Selects features at ≥90% cumulative explained variance |
+| **PCA Stage 2** | sklearn.decomposition.PCA | Reduces to 3 components at ≥80% cumulative explained variance |
+| **Clustering (primary)** | sklearn.cluster.AgglomerativeClustering | n_clusters=3, linkage='ward' |
+| **Clustering (alternative)** | sklearn.cluster.KMeans | n_clusters=3, random_state=10 |
+| **Validation** | sklearn.tree.DecisionTreeClassifier | random_state=42; trained on cluster labels |
+### Cluster Count Selection
+Davies-Bouldin Index and Silhouette Score are calculated for **k=2 through k=9**. Both metrics are logged to MLflow for inspection before selecting the final cluster count.
+### Evaluation Design
+- Clustering quality: **Davies-Bouldin Index** (lower is better), **Silhouette Score** (higher is better).
+- Cluster validation: **Decision Tree accuracy** on held-out data (can the clustering be reliably reproduced?).
+- Clinical validation: cluster-level outcome rates (admissions, mortality, time-to-event), demographic breakdowns, medication therapy distributions.
+---
+## Evaluation Results
+> Replace this section with measured results from your training run.
+| Metric | Value | Notes |
+|--------|-------|-------|
+| Davies-Bouldin Index | TBD | Lower is better |
+| Silhouette Score | TBD | Range [-1, 1], higher is better |
+| DTC Accuracy | TBD | Decision Tree on validation set |
+| Cluster sizes | TBD | Patient counts per cluster |
+### Caveats on Metrics
+- Cluster quality metrics assess geometric separation, not clinical meaningfulness — clinical validation requires outcome analysis.
+- Results depend on the feature set, imputation strategy, and patient population.
+---
+## Bias, Risks, and Limitations
+- **Dataset shift:** EHR coding practices, care pathways, and population characteristics vary across sites and time periods.
+- **Feature availability:** Lab test availability varies by patient; imputation strategy affects cluster assignment.
+- **Fairness:** Cluster composition may correlate with age, sex, or deprivation — interpret with care.
+- **Misuse risk:** Cluster labels are not validated risk scores. Using them to drive clinical action without clinical safety processes can cause harm.
+- **Interpretability:** PCA components are linear combinations of features — clinical interpretation requires examining loadings.
+---
+## How to Use
+### Pipeline Execution Order
+```bash
+# 1. Install dependencies
+pip install pandas numpy scikit-learn matplotlib mlflow joblib tableone
+# 2. Process raw data (run each independently)
+python training/src/processing/process_demographics.py
+python training/src/processing/process_admissions.py
+python training/src/processing/process_comorbidities.py
+python training/src/processing/process_labs.py
+python training/src/processing/process_prescribing.py
+# 3. Combine and reduce features (run in order)
+python training/src/reduction/combine.py
+python training/src/reduction/post_proc_reduction.py
+python training/src/reduction/remove_ids.py
+python training/src/reduction/clean_and_scale_train.py
+python training/src/reduction/clean_and_scale_test.py
+# 4. Run clustering model
+python training/src/modelling/run_model.py
+# 5. Predict clusters for new patients
+python training/src/modelling/predict_clusters.py
+```
+### Adapting to Your Data
+Replace input file paths in `config.json` with your own EHR data extracts. The pipeline expects CSVs with patient ID, admission records, lab results, pharmacy records, and demographics.
+---
+## Environmental Impact
+Training computational requirements are minimal — PCA and clustering on tabular data completes in seconds on a standard laptop.
+---
+## Citation
+If you use this model or code, please cite:
+- This repository: *(add citation format / Zenodo DOI if minted)*
+- Associated publications: *(forthcoming)*
+## Authors and Contributors
+- **Storm ID** (maintainers)
+## License
+This model and code are released under the **Apache 2.0** license.

README.md ADDED Viewed

	@@ -0,0 +1,87 @@

+# Model E
+Model E is an unsupervised learning model, built with the aim of grouping patients within the COPD cohort into _k_ clusters as a means of risk stratification. These clusters are updated with new incoming data, with the cluster for each patient monitored in order to track how their risk changes over time. Results will be used to determine if patients are receiving the correct level of care for their apparent risk.
+The model uses EHR data (information on patients admission history, labs tests data, prescribing data, comorbidities and demographic data) which is first be pre-processed, then reduced using Principal Components Analysis (PCA). The 3D components produced in PCA are then be passed to a variety of clustering algorithms, with results plotted for inspection.
+## Structure
+```
+C:.
+|   pipeline.yml
+|   README.md
+|   requirements.txt
+|   setup.cfg
+|
++---documentation
+|       README.md
+|
++---training
+|   |   README.md
+|   |
+|   +---src
+|   |   |   README.md
+|   |   |
+|   |   +---data
+|   |   |
+|   |   +---modelling
+|   |   |   |   Cluster Data.ipynb
+|   |   |   |
+|   |   +---processing
+|   |   |   |   process_admissions.py
+|   |   |   |   process_comorbidities.py
+|   |   |   |   process_demographics.py
+|   |   |   |   process_gples.py
+|   |   |   |   process_labs.py
+|   |   |   |   process_prescribing.py
+|   |   |   |   README.md
+|   |   |   |   __init__.py
+|   |   |   |
+|   |   |   +---mappings
+|   |   |   |   |   Comorbidity feature review for models & clin summary update v2 May 2021.xlsx
+|   |   |   |   |   diag_copd_resp_desc.json
+|   |   |   |   |   inhaler_mapping.json
+|   |   |   |   |   README.md
+|   |   |   |   |   test_mapping.json
+|   |   |   |   |
+|   |   |   |
+|   |   |   \---utils
+|   |   |           adm_common.py
+|   |   |           adm_processing.py
+|   |   |           adm_reduction.py
+|   |   |           common.py
+|   |   |           comorb_processing.py
+|   |   |           labs_processing.py
+|   |   |           README.md
+|   |   |           __init__.py
+|   |   |
+|   |   \---reduction
+|   |       |   clean_and_scale_test.py
+|   |       |   clean_and_scale_train.py
+|   |       |   combine.py
+|   |       |   README.md
+|   |       |   remove_ids.py
+|   |       |   __init__.py
+|   |       |
+|   |       \---utils
+|   |               reduction.py
+|   |               __init__.py
+|   |
+|   \---tests
+|           README.md
+|
+\---validation
+    +---parameter_calculation
+    |       CAT_MRC_score_metrics_calculation.py
+    |       Fitbit_groups_calculation.py
+    |       GOLD_grade_GOLD_group_calculation.py
+    |       NIV_parameters_calculation.py
+    |       PRO_LOGIC_exacerbation_calculations.py
+    |       README.md
+    |       Time_to_death_calculation.py
+    |       Time_to_first_admission_calculations.py
+    |       Time_to_first_event_calculations.py
+    |
+    \---risk_score_calculation
+            combined_risk_score_RC_SU1.py
+```

config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "date": "2019-12-31",
+  "extract_data_path": "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/",
+  "rec_data_path": "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/",
+  "sup_data_path": "<YOUR_DATA_PATH>/SU_IDs/",
+  "model_data_path": "<YOUR_DATA_PATH>/Model_E_Extracts/",
+  "model_type": "hierarchical"
+}

documentation/README.md ADDED Viewed

	@@ -0,0 +1,151 @@

+# Model E
+## Abstract
+Model E is an unsupervised learning model, built with the aim of grouping patients within the COPD cohort into k clusters as a means of risk stratification. Clusters are updated with new incoming data, with the cluster for each patient monitored in order to track how their risk changes over time. Results will be used to determine if patients are receiving the correct level of care for their apparent risk.
+## Aims
+1. To use an unsupervised learning method to cluster patients within the COPD cohort into k clusters based on a variety of features.
+2. Cluster new data and update clusters accordingly. Monitor the identified cluster for each patient and alert if they transitions between clusters.
+3. Determine is patients are on the incorrect type of care based on their clusters.
+## Data - EXAMPLE_STUDY_DATA
+Below details the raw EHR features processed for model training, along with the resulting processed feature set.
+### Raw features
+#### Admissions/Comorbidites - SMR01_Cohort3R.csv
+Feature name | Description |
+-------------|-------------|
+SafeHavenID | Patient ID |
+ETHGRP | Ethnicity |
+ADMDATE | Date of admission |
+DISDATE | Date of discharge |
+DIAGxDesc (x=1-6) | Diagnosis columns 1-6 |
+STAY | Length of stay (days) |
+#### Demographics - Demographics_Cohort3R.csv
+Feature name | Description |
+-------------|-------------|
+SafeHavenID | Patient ID |
+OBF_DOB | Date of birth |
+SEX | Sex |
+Marital_Status | Marital status |
+SIMD_2009/12/16_QUINTILE | SIMD ranks to quintiles for 2009, 2012 and 2016 data zones |
+SIMD_2009/12/16_DECILE | SIMD ranks to deciles for 2009, 2012 and 2016 data zones |
+SIMD_2009/12/16_VIGINTILE | SIMD ranks to vigintiles for 2009, 2012 and 2016 data zones |
+#### Prescribing - Pharmacy_Cohort3R.csv
+Feature name | Description |
+-------------|-------------|
+SafeHavenID | Patient ID |
+PRESC_DATE | Date of prescription |
+PI_BNF_Item_Code | Code describing specific medicine as found in the British National Formulary (BNF) reference book |
+PI_Approved_Name | Name of medicine |
+#### Labs - SCI_Store_Cohort3R.csv
+Feature name | Description |
+-------------|-------------|
+SafeHavenID | Patient ID |
+SAMPLEDATE | Date lab test was taken |
+CLINICALCODEDESCRIPTION | Name of test |
+QUANTITYVALUE | Test value |
+RANGEHIGHVALUE | Test range highest value |
+RANGELOWVALUE | Test range lowest value |
+#### Mappings
+- `inhaler_mapping.json`: Inhaler mappings for any Chapter 3 BNF code inhaler prescriptions present in the SafeHaven prescribing dataset. Information on NHS inhaler types, found [here](https://www.coch.nhs.uk/media/172781/3-respiratory-system.pdf), was used to create the mapping.
+- `test_mapping.json`: A mapping created for any of the top 20 most frequently occurring lab tests, plus any lab tests found relevant for indicating COPD severity in Model A. This mapping creates a common name for a specific test and lists any related names the test may appear under within the SCI Store dataset.
+- `Comorbidity feature review for models & clin summary update v2 May 2021.xlsx`: A mapping between diagnosis names found in SMR01 and their associated comorbidities (taken from Model A).
+- `diag_copd_resp_desc.json`: DIAGDesc for COPD and respiratory admissions.
+### Processed features
+#### Demographics features
+The below features are saved to be used for any necessary validation, but are not included for model training.
+Feature name | Description |
+-------------|-------------|
+eth_grp | Ethnicity one-hot-encoded into 1 of 7 categories |
+entry_dataset | Dataset patient first appeared in within the health board region |
+first_entry | Date of first appearance in the health board region |
+obf_dob | Patient DOB at respective date |
+sex_bin | Sex in binary format: F=1, M=0 |
+marital_status_m | Married |
+marital_status_n | Not Known |
+marital_status_o | Other |
+marital_status_s | Single |
+age_bin | Age bucket based on training data (1 of 10) |
+days_since_copd_resp_med | Median days since COPD or respiratory admission |
+days_since_adm_med | Median days since any admission |
+days_since_rescue_med | Median days since rescue event |
+simd_quintile | SIMD ranks to quintile for closest year data zone |
+simd_decile | SIMD ranks to decile for closest year data zone |
+simd_vigintile | SIMD ranks to vigintile for closest year data zone |
+#### Final feature set
+The final feature set contains 50 features, as detailed below.
+Feature name | Description |
+-------------|-------------|
+SafeHavenID | Patient ID |
+year | Data year |
+total_hosp_days | Total hospital days in current year |
+mean_los | Average length of stay per year |
+ggc_years | Total years appearing in the health board region |
+age | Patient age |
+EVENT_per_year | Total events per year where EVENT=adm/comorb/salbutamol/rescue_meds/presc/labs/copd_resp |
+EVENT_to_date | Total events to date where EVENT=adm/copd/resp/presc/rescue/labs |
+days_sinced_EVENT | Days since event where EVENT=adm/copd_resp/rescue |
+TEST_med_2yr | Median test value from previous 2 years, where TEST=alt/ast/albumin/alkaline_phosphatase/basophils/c_reactive_protein/chloride/creatinine/eosinophils/estimated_gfr/haematocrit/haemoglobin/lymphocytes/mch/mean_cell_volume/monocytes/neutrophils/platelets/potassium/red_blood_count/sodium/total_bilirubin/urea/white_blood_count/neut_lymph |
+n_inhaler | Yearly inhaler prescription count where n=single/double/triple |
+These features are further reduced using Principal Components Analysis (PCA) to produce a reduced feature set containing:
+Feature name |
+-------------|
+age |
+ggc_years |
+comorb/presc/labs_per_year |
+presc/labs/rescue_to_date |
+days_since_adm/copd_resp/rescue |
+albumin/estimated_gfr/haemoglobin/labs/red_blood_count_med_2yr |
+### Method
+Raw datasets are loaded in and processed into a format suitable for machine learning to be applied. Features are then reduced to 1 row per SafeHavenID per year by selecting the:
+- Median value for lab tests taken in the previous 2 years
+- Sum of any binary/count features
+- Last value of any to-date features
+Once reduced, the datasets are then joined on SafeHavenID and year.
+At this stage SafeHavenIDs present in both the Receiver and Scale-Up cohorts are removed. The remaining data is the split into training and testing sets in a subject-wise fashion, with 20% of the remaining patients in the testing set.
+Each of these sets of data (training, testing, receiver and scale-up) are min-max scaled so that all features lie between 0 and 1. Note that all validation/testing sets (testing, receiver and scale-up) use the pre-trained scaler used to process the training set.
+Data is then passed through a pipeline where:
+- PCA is applied to reduce the processed dataset with 50+ features down to 15 features which are then further reduced to 6 principal components.
+- Davies Bouldin Score is applied to detect the cluster number in the training set.
+- Training data is clustered using the K-Means algorithm, with results plotted using matplotlib.
+- The test, receiver and scale-up datasets are reduced using the PCA method applied to the training set.
+- Clusters are calculated for all validation data.

pipeline.yml ADDED Viewed

	@@ -0,0 +1,29 @@

+trigger:
+  branches:
+    include:
+    - main
+    - release/*
+jobs:
+- job: 'build'
+  pool:
+    vmImage: 'ubuntu-latest'
+  steps:
+  - task: UsePythonVersion@0
+    inputs:
+      versionSpec: '3.8'
+      architecture: 'x64'
+    displayName: 'Specify Python version'
+  - script: |
+      python -m pip install --upgrade pip
+    displayName: 'Install pip'
+  - script: |
+      pip install -r requirements.txt
+    displayName: 'Install CI dependencies'
+  - script: |
+      flake8
+    displayName: 'Run linting'

requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ flake8

setup.cfg ADDED Viewed

	@@ -0,0 +1,7 @@

+[tool:pytest]
+filterwarnings =
+    ignore::DeprecationWarning
+[flake8]
+ignore = E501,W293,W292
+exclude = .git,__pycache__,docs/source/conf.py,old,build,dist
+max-complexity = 10

training/README.md ADDED Viewed

	@@ -0,0 +1,6 @@

+# Training
+Training scripts are split into `src` and `tests` directories, where `src` is is futher segmented into:
+- `processing`: containing scripts to processes raw EHR data
+- `reduction`: containing scripts to combine, reduce, fill and scale processed EHR data for modelling
+- `modelling`: containing notebooks and scripts required for model training

training/src/README.md ADDED Viewed

	@@ -0,0 +1,8 @@

+# Processing
+The folder contains scripts to process raw EHR data for training.
+Note that scripts must be run in the below order:
+1. `processing`
+2. `reduction`
+3. `modelling`

training/src/modelling/__pycache__/run_model.cpython-313.pyc ADDED Viewed

Binary file (16.5 kB). View file

training/src/modelling/additional_code_onevsone_onevsrest_approaches.py ADDED Viewed

	@@ -0,0 +1,346 @@

+import pandas as pd
+import numpy as np
+from sklearn.datasets import load_iris
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.metrics import confusion_matrix
+from sklearn.model_selection import train_test_split
+import matplotlib.pyplot as plt
+from sklearn.multiclass import OneVsRestClassifier
+from sklearn.tree import plot_tree
+from tabulate import tabulate
+from sklearn.linear_model import LogisticRegression
+import mlflow
+from sklearn.metrics import accuracy_score
+from sklearn.metrics import ConfusionMatrixDisplay
+from sklearn.tree.export import export_text
+from sklearn import tree
+from itertools import combinations
+# load in the data
+data = load_iris()
+iris = data
+# convert to a dataframe
+df = pd.DataFrame(data.data, columns=data.feature_names)
+# create the species column
+df['Species'] = data.target
+# replace this with the actual names
+target = np.unique(data.target)
+target_names = np.unique(data.target_names)
+targets = dict(zip(target, target_names))
+df['Species'] = df['Species'].replace(targets)
+# extract features and target variables
+x = df.drop(columns="Species")
+y = df["Species"]
+# save the feature name and target variables
+feature_names = x.columns
+labels = y.unique()
+# split the dataset
+X_train, test_x, y_train, test_lab = train_test_split(x, y, test_size=0.4, random_state=42)
+# The below is for classic logistic regression binary classifier one vs rest,
+# explainability is based on the coefficents in logistic regression
+# Create a One-vs-Rest logistic regression classifier
+clf = LogisticRegression(random_state=0, multi_class='ovr')
+# Train the classifier on the Iris dataset
+clf.fit(X_train, y_train)
+# Get the number of classes and features
+n_classes = len(set(iris.target))
+n_features = iris.data.shape[1]
+# Create a figure with one subplot for each class
+fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes))
+# Loop over each class
+for i in range(n_classes):
+    # Get the feature importances for the current class
+    coef = clf.coef_[i]
+    importance = coef
+    # Sort the feature importances in descending order
+    indices = np.argsort(importance)[::-1]
+    # Create a bar plot of the feature importances
+    axs[i].bar(range(n_features), importance[indices])
+    axs[i].set_xticks(range(n_features))
+    axs[i].set_xticklabels(np.array(iris.feature_names)[indices], rotation=90)
+    axs[i].set_xlabel('Features')
+    axs[i].set_ylabel('Importance')
+    axs[i].set_title('Feature Importance for Class {}'.format(iris.target_names[i]))
+# Adjust the spacing between subplots
+fig.tight_layout()
+# Show the plot
+plt.show()
+# Make predictions on the test data
+val_pred = clf.predict(test_x)
+accuracy = accuracy_score(test_lab, val_pred)
+mlflow.log_metric('dtc accuracy', accuracy)
+cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_)
+disp = ConfusionMatrixDisplay(
+    confusion_matrix=cm, display_labels=clf.classes_)
+disp.plot()
+plt.tight_layout()
+mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
+# The below is for one vs rest desicion trees with explainability importance values are
+# calculated based on the reduction of impurity measured by the Gini index.
+# Create a One-vs-Rest Decision Tree classifier
+clf_pre = DecisionTreeClassifier(random_state=0)
+clf = OneVsRestClassifier(clf_pre)
+# Train the classifier on the Iris dataset
+clf.fit(X_train, y_train)
+# Get the number of classes and features
+n_classes = len(set(iris.target))
+n_features = iris.data.shape[1]
+# Create a figure with one subplot for each class
+fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes))
+# Loop over each class
+for i in range(n_classes):
+    # Get the feature importances for the current class
+    importance = clf.estimators_[i].feature_importances_
+    # Sort the feature importances in descending order
+    indices = np.argsort(importance)[::-1]
+    # Create a bar plot of the feature importances
+    axs[i].bar(range(n_features), importance[indices])
+    axs[i].set_xticks(range(n_features))
+    axs[i].set_xticklabels(np.array(iris.feature_names)[indices], rotation=90)
+    axs[i].set_xlabel('Features')
+    axs[i].set_ylabel('Importance')
+    axs[i].set_title('Feature Importance for Class {}'.format(iris.target_names[i]))
+# Adjust the spacing between subplots
+fig.tight_layout()
+# Show the plot
+plt.show()
+y_pred_DTC = clf.predict(test_x)
+accuracy = accuracy_score(test_lab, val_pred)
+mlflow.log_metric('dtc accuracy', accuracy)
+cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_)
+disp = ConfusionMatrixDisplay(
+    confusion_matrix=cm, display_labels=clf.classes_)
+disp.plot()
+plt.tight_layout()
+mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
+# Show desicion tree for each class: two methods
+# Get the feature names
+feature_names = iris.feature_names
+# Loop over each decision tree classifier in the one-vs-rest classifier
+for i, estimator in enumerate(clf.estimators_):
+    # Export the decision rules for the current tree
+    tree_rules = export_text(estimator, feature_names=feature_names)
+    # Print the decision rules for the current tree
+    print(f"Decision rules for tree for cluster {i}:")
+    print(tree_rules)
+# assume clf is your one vs rest classifier
+for i, estimator in enumerate(clf.estimators_):
+    fig, ax = plt.subplots(figsize=(12, 8))
+    tree.plot_tree(estimator,
+                   feature_names=feature_names,
+                   class_names=labels,
+                   rounded=True,
+                   filled=True,
+                   fontsize=14,
+                   ax=ax)
+    ax.set_title(f'Tree {i+1}')
+plt.show()
+# One vs one approach
+# BLR
+# Create a One-vs-One logistic regression classifier
+clf = LogisticRegression(random_state=0, multi_class='multinomial', solver='lbfgs')
+# Train the classifier on the Iris dataset
+clf.fit(X_train, y_train)
+# Get the number of classes and features
+n_classes = len(set(iris.target))
+n_features = iris.data.shape[1]
+# Create a figure with one subplot for each class combination
+fig, axs = plt.subplots(n_classes * (n_classes - 1) // 2, 1, figsize=(10, 5 * n_classes * (n_classes - 1) // 2))
+# Loop over each class combination
+index = 0
+for i in range(n_classes):
+    for j in range(i + 1, n_classes):
+        # Get the feature importances for the current class combination
+        coef = clf.coef_[index]
+        importance = coef
+        # Sort the feature importances in descending order
+        indices = np.argsort(importance)[::-1]
+        # Create a bar plot of the feature importances
+        axs[index].bar(range(n_features), importance[indices])
+        axs[index].set_xticks(range(n_features))
+        axs[index].set_xticklabels(np.array(iris.feature_names)[indices], rotation=90)
+        axs[index].set_xlabel('Features')
+        axs[index].set_ylabel('Importance')
+        axs[index].set_title('Feature Importance for Class Combination {} vs {}'.format(iris.target_names[i], iris.target_names[j]))
+        index += 1
+# Adjust the spacing between subplots
+fig.tight_layout()
+# Show the plot
+plt.show()
+# Make predictions on the test data
+y_pred_ovo = clf.predict(test_x)
+accuracy = accuracy_score(test_lab, val_pred)
+mlflow.log_metric('blr accuracy', accuracy)
+# Get confusion matrix
+cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_)
+disp = ConfusionMatrixDisplay(
+    confusion_matrix=cm, display_labels=clf.classes_)
+disp.plot()
+plt.tight_layout()
+mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
+# Desicion tree clasifier
+# assume clf is your one vs one classifier
+for i, (c1, c2) in enumerate(combinations(clf.classes_, 2)):
+    # create a new binary label vector for the current pair of classes
+    y_binary = (y_train == c1) | (y_train == c2)
+    # train a decision tree on the current pair of classes
+    estimator = DecisionTreeClassifier()
+    estimator.fit(X_train, y_binary)
+    # get feature importances
+    importances = estimator.feature_importances_
+    # create a bar plot showing feature importances for the current tree
+    fig, ax = plt.subplots(figsize=(8, 6))
+    ax.bar(np.arange(len(feature_names)), importances)
+    ax.set_xticks(np.arange(len(feature_names)))
+    ax.set_xticklabels(feature_names, rotation=45, ha='right')
+    ax.set_title(f'Tree {i+1}: {c1} vs {c2} Feature Importances')
+    ax.set_ylabel('Importance')
+    plt.tight_layout()
+    plt.show()
+    # initialize a list to store feature importances for each tree
+importances_all = []
+# assume clf is your one vs one classifier
+for i, (c1, c2) in enumerate(combinations(clf.classes_, 2)):
+    # create a new binary label vector for the current pair of classes
+    y_binary = (y_train == c1) | (y_train == c2)
+    # train a decision tree on the current pair of classes
+    estimator = DecisionTreeClassifier()
+    estimator.fit(X_train, y_binary)
+    # get feature importances and store them in the list
+    importances = estimator.feature_importances_
+    importances_all.append(importances)
+    # plot the decision tree with feature importances
+    fig, ax = plt.subplots(figsize=(12, 8))
+    tree.plot_tree(estimator,
+                   feature_names=feature_names,
+                   class_names=[str(c1), str(c2)],
+                   rounded=True,
+                   filled=True,
+                   fontsize=14,
+                   ax=ax)
+    # add feature importances to title
+    title = f'Tree {i+1}: {c1} vs {c2}\n'
+    title += 'Feature importances:\n'
+    for feature, importance in zip(feature_names, importances):
+        title += f'{feature}: {importance:.3f}\n'
+    ax.set_title(title)
+# Get confusion matrix
+cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_)
+disp = ConfusionMatrixDisplay(
+    confusion_matrix=cm, display_labels=clf.classes_)
+disp.plot()
+plt.tight_layout()
+mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
+# Example of code to show explainability (one vs rest for a specific incidence)
+# Split the data into training and testing sets
+X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
+# Train a binary classifier for each class
+binary_classifiers = {}
+for i in range(len(iris.target_names)):
+    binary_y_train = np.where(y_train == i, 1, 0)
+    model = DecisionTreeClassifier(random_state=42)
+    model.fit(X_train, binary_y_train)
+    binary_classifiers[i] = model
+# Choose a specific instance to explain (e.g., the first instance in the test set)
+instance = X_test[7]
+# Get the predicted probability scores for each class for the instance
+probs = []
+for i in range(len(iris.target_names)):
+    binary_classifier = binary_classifiers[i]
+    prob = binary_classifier.predict_proba(instance.reshape(1, -1))[0, 1]
+    probs.append(prob)
+# Get the index of the class with the highest probability score
+predicted_class = np.argmax(probs)
+# Extract the binary classifier with the highest probability score
+binary_classifier = binary_classifiers[predicted_class]
+# Plot the decision tree for the binary classifier with the highest probability score
+fig, ax = plt.subplots(figsize=(12, 12))
+plot_tree(binary_classifier, filled=True, rounded=True, ax=ax, feature_names=iris.feature_names, class_names=['not ' + iris.target_names[predicted_class], iris.target_names[predicted_class]])
+plt.show()
+# Print the predicted class and probability for the instance
+predicted_prob = probs[predicted_class]
+print('Predicted Class:', predicted_class)
+print('Predicted Probability:', predicted_prob)
+# Create a table with the ID, characteristics, true class label, and predicted class label for each sample in the test data
+table_test = np.column_stack((np.arange(len(y_test)) + 1, X_test, y_test, y_pred_ovo, y_pred_DTC))
+header_test = np.concatenate((['ID'], iris.feature_names, ['True Class', 'Predicted Class_BLR', 'Predicted Class_DTC']))
+table_test = np.vstack((header_test, table_test))
+# Print the table for the test data
+print(tabulate(table_test))

training/src/modelling/dtc_params.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+    "random_state": 42
+}

training/src/modelling/event_calculations.py ADDED Viewed

	@@ -0,0 +1,183 @@

+"""
+Find information on COPD, respiratory, rescue med and death event tracking
+for patients within a timeframe
+"""
+import json
+import pandas as pd
+import numpy as np
+merged_cols = ['adm_per_year', 'copd_resp_per_year',
+               'anxiety_depression_per_year',
+               'rescue_meds_per_year', 'anxiety_depression_presc_per_year']
+base_cols = ['admission_any', 'admission_copd_resp',
+             'admission_anxiety_depression',
+             'presc_rescue_med', 'presc_anxiety_depression']
+n_cols = ["n_" + col for col in base_cols]
+adm_cols = ['SafeHavenID', 'ADMDATE', 'admission_any', 'admission_copd_resp']
+presc_cols = ['SafeHavenID', 'PRESC_DATE', 'rescue_meds']
+def read_deaths(extract_data_path):
+    """
+    Read in deaths dataset
+    --------
+    :param extract_data_path: path to data extracts
+    :return: dataframe
+    """
+    filename = extract_data_path + 'Deaths_Cohort3R.csv'
+    cols = ['SafeHavenID', 'DOD']
+    df = pd.read_csv(filename, usecols=cols).drop_duplicates()
+    df['DOD'] = pd.to_datetime(df.DOD)
+    return df
+def filter_data(df, date_col, eoy_date, start_date, end_date, typ):
+    """
+    Filter data to only include events occurring within given date range
+    --------
+    :param df: dataframe
+    :param date_col: str name of date column
+    :param eoy_date: end of year date
+    :param start_date: validation data start date
+    :param end_date: validation data end date
+    :param typ: type of data: 'adm', 'presc', 'merged', 'deaths'
+    :return: filtered dataframe
+    """
+    if typ == 'merged':
+        df = df[df.eoy == eoy_date]
+    else:
+        df = df[(df[date_col] >= start_date) & (df[date_col] < end_date)]
+    return df
+def calc_time_to_event(df, date_col, start_date, new_col):
+    """
+    Calculate time to next event
+    --------
+    :param df: dataframe
+    :param date_col: str name of date column
+    :param start_date: validation data start date
+    :param new_col: new column name
+    :return: dataframe with SafeHavenID days to event
+    """
+    df_next = df.groupby('SafeHavenID').agg(next_event=(date_col, min))
+    df_next = (df_next - start_date) / np.timedelta64(1, 'M')
+    df_next.columns = ['time_to_' + new_col]
+    return df_next
+def bucket_time_to_event(df):
+    """
+    Calculate time in months to next event and bucket into
+    1, 3, 6, 12, 12+ months.
+    --------
+    :param df: dataframe
+    :return: dataframe with event times in categories
+    """
+    month = [-1, 1, 3, 6, 12, 13]
+    label = ['1', '3', '6', '12', '12+']
+    df = df.apply(lambda x: pd.cut(x, month, labels=label))
+    df = df.fillna('12+')
+    return df
+def calculate_event_metrics(data_path, eoy_date, start_date, end_date):
+    """
+    Generate tables with number of events in 12 months and
+    boolean for events
+    --------
+    :param data_path: path to generated data
+    :param eoy_date: end of year date
+    :param start_date: validation data start date
+    :param end_date: validation data end date
+    """
+    # Load in data
+    merged = pd.read_pickle(data_path + 'merged.pkl')
+    # Select relevant dates and columns
+    merged = filter_data(
+        merged, 'eoy', eoy_date, start_date, end_date, 'merged')
+    df_event = merged[['SafeHavenID'] + merged_cols]
+    # Create frame with total events within 12mo period
+    df_count = df_event.copy()
+    df_count.columns = ['SafeHavenID'] + n_cols
+    df_count.to_pickle(data_path + 'metric_table_counts.pkl')
+    # Create frame with boolean events within 12mo period
+    df_event[merged_cols] = (df_event[merged_cols] > 0).astype(int)
+    df_event.columns = ['SafeHavenID'] + base_cols
+    df_event.to_pickle(data_path + 'metric_table_events.pkl')
+def calculate_next_event(data_path, extract_data_path, eoy_date,
+                         start_date, end_date):
+    """
+    Generate table with the time in 1, 3, 6, 12, 12+ months
+    --------
+    :param data_path: path to generated data
+    :param extract_data_path: path to data extracts
+    :param eoy_date: end of year date
+    :param start_date: validation data start date
+    :param end_date: validation data end date
+    """
+    # Find next adm events
+    adm = pd.read_pickle(data_path + 'validation_adm_proc.pkl')
+    adm = filter_data(
+        adm, 'ADMDATE', eoy_date, start_date, end_date, 'adm')
+    adm['admission_any'] = 1
+    adm['admission_copd_resp'] = adm.copd_event | adm.resp_event
+    adm = adm[adm_cols]
+    time_to_adm_any = calc_time_to_event(
+        adm, 'ADMDATE', start_date, 'admission_any')
+    time_to_adm_copd = calc_time_to_event(
+        adm[adm.admission_copd_resp == 1], 'ADMDATE', start_date,
+        'admission_copd_resp')
+    # Find next presc events
+    presc = pd.read_pickle(data_path + 'validation_presc_proc.pkl')
+    presc = filter_data(
+        presc, 'PRESC_DATE', eoy_date, start_date, end_date, 'presc')
+    presc = presc[presc_cols]
+    presc = presc[presc.rescue_meds == 1]
+    time_to_rescue = calc_time_to_event(
+        presc, 'PRESC_DATE', start_date, 'presc_rescue_med')
+    # Find next deaths
+    deaths = read_deaths(extract_data_path)
+    deaths = filter_data(
+        deaths, 'DOD', eoy_date, start_date, end_date, 'deaths')
+    deaths['death'] = 1
+    time_to_death = calc_time_to_event(
+        deaths, 'DOD', start_date, 'death')
+    # Merge results
+    frames = [time_to_adm_any, time_to_adm_copd, time_to_rescue, time_to_death]
+    results = pd.concat(frames, join='outer', axis=1)
+    results = bucket_time_to_event(results)
+    results.to_pickle(data_path + 'metric_table_next.pkl')
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    data_path = config['model_data_path']
+    extract_data_path = config['extract_data_path']
+    eoy_date = pd.to_datetime(config['date'])
+    start_date = eoy_date + pd.Timedelta(days=1)
+    end_date = eoy_date + pd.offsets.DateOffset(years=1)
+    calculate_event_metrics(data_path, eoy_date, start_date, end_date)
+    calculate_next_event(data_path, extract_data_path, eoy_date,
+                         start_date, end_date)
+main()

training/src/modelling/hierarchical_params.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "n_clusters": 3,
+    "linkage": "ward"
+}

training/src/modelling/kmeans_params.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "n_clusters": 3,
+    "random_state": 10
+}

training/src/modelling/one_vs_rest_BLR.py ADDED Viewed

	@@ -0,0 +1,377 @@

+"""
+Modelling process
+"""
+import pandas as pd
+import numpy as np
+import pickle
+import matplotlib.pyplot as plt
+import mlflow
+from matplotlib import rcParams
+from sklearn.cluster import AgglomerativeClustering, KMeans
+from sklearn.decomposition import PCA
+from sklearn.metrics import (davies_bouldin_score, silhouette_score,
+                             accuracy_score, confusion_matrix,
+                             ConfusionMatrixDisplay)
+from sklearn.linear_model import LogisticRegression
+# from sklearn.multiclass import OneVsRestClassifier
+import os
+# Set-up figures
+rcParams['figure.figsize'] = 20, 5
+rcParams['axes.spines.top'] = False
+rcParams['axes.spines.right'] = False
+# Set parameters for current run
+year = 2019
+model_type = 'hierarchical'
+data_type = 'train'
+k = 3
+stamp = str(pd.Timestamp.now(tz='GMT+0'))[:16].replace(':', '').replace(' ', '_')
+data_path = '<YOUR_DATA_PATH>/Model_E_Extracts/'
+# Set MLFlow parameters
+mlflow.set_tracking_uri("file:/.")
+tracking_uri = mlflow.get_tracking_uri()
+experiment_name = 'Model E: one vs rest adaption BLR ' + model_type
+run_name = "_".join((str(year), model_type, stamp))
+description = "Clustering model with one vs rest adaption (BLR) for COPD data in " + str(year)
+def extract_year(df, year):
+    """
+    Extract 1 year of data
+    --------
+    :param df: dataframe to extract from
+    :param year: year to select data from
+    :return: data from chosen year
+    """
+    return df[df.year == year]
+def read_yearly_data(typ, year):
+    """
+    Read in data for year required
+    --------
+    :param typ: type of data to read in
+    :param year: year to select data from
+    :return: data from chosen year and ids
+    """
+    df = pd.read_pickle(data_path + 'min_max_' + typ + '.pkl')
+    df_year = extract_year(df, year)
+    ids = df_year.pop('SafeHavenID').to_list()
+    df_year = df_year.drop('year', axis=1)
+    return df_year, ids
+def plot_variance(df, typ):
+    """
+    Plot PCA variance
+    ---------
+    :param df: dataframe to process with PCA
+    :param typ: type of plot - for 'full' data or 'reduced'
+    :return: pca object
+    """
+    pca = PCA().fit(df)
+    n = list(range(1, len(df.columns) + 1))
+    evr = pca.explained_variance_ratio_.cumsum()
+    fig, ax = plt.subplots()
+    ax.plot(n, evr)
+    title = 'PCA Variance - ' + typ
+    ax.set_title(title, size=20)
+    ax.set_xlabel('Number of principal components')
+    ax.set_ylabel('Cumulative explained variance')
+    ax.grid()
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+    return pca
+def extract_pca_loadings(df, pca_object):
+    """
+    Extract PCA loadings
+    --------
+    :param df: dataframe to reduce with pca
+    :param pca_object: pca object with feature loadings
+    :return: loadings table
+    """
+    cols = df.columns
+    loadings = pd.DataFrame(
+        data=pca_object.components_.T * np.sqrt(pca_object.explained_variance_),
+        columns=[f'PC{i}' for i in range(1, len(cols) + 1)],
+        index=cols)
+    return loadings
+def plot_loadings(loadings):
+    """
+    Plot loadings for PC1 returned from PCA
+    --------
+    :param loadings: table of feature correlations to PC1
+    :return: updated loadings table
+    """
+    loadings_abs = loadings.abs().sort_values(by='PC1', ascending=False)
+    pc1_abs = loadings_abs[['PC1']].reset_index()
+    col_map = {'index': 'Attribute', 'PC1': 'AbsCorrWithPC1'}
+    pc1_abs = pc1_abs.rename(col_map, axis=1)
+    fig, ax = plt.subplots()
+    pc1_abs.plot(ax=ax, kind='bar')
+    title = 'PCA loading scores (PC1)'
+    ax.set_title(title, size=20)
+    ax.set_xticks(ticks=pc1_abs.index, labels=pc1_abs.Attribute, rotation='vertical')
+    ax.set_xlabel('Attribute')
+    ax.set_ylabel('AbsCorrWithPC1')
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+    return pc1_abs
+def extract_array(df, pca_object, typ):
+    """
+    Extract data to pass to clustering algos
+    --------
+    :param df: dataframe to convert
+    :param pca_object: initialised PCA object
+    :param typ: type of return needed, either 'train' or 'test'
+    :return: converted array (and PCA object if training)
+    """
+    if typ == 'train':
+        pca_func = pca_object.fit_transform
+    else:
+        pca_func = pca_object.transform
+    pca_data = pd.DataFrame(pca_func(df)).to_numpy()
+    if typ == 'train':
+        pca_file = data_path + run_name + '_pca.pkl'
+        pickle.dump(pca_object, open(pca_file, 'wb'))
+    return pca_data
+def get_kmeans_score(data, k):
+    '''
+    Calculate K-Means Davies Bouldin and Silhouette scores
+    --------
+    :param data: dataset to fit K-Means to
+    :param k: number of centers/clusters
+    :return: Scores
+    '''
+    kmeans = KMeans(n_clusters=k)
+    model = kmeans.fit_predict(data)
+    db_score = davies_bouldin_score(data, model)
+    sil_score = silhouette_score(data, model)
+    return db_score, sil_score
+def plot_DB(df):
+    """
+    Extract David Bouldin score and plot for a range of cluster numbers,
+    applied using K-Means clustering.
+    "Davies Bouldin index represents the average 'similarity' of clusters,
+    where similarity is a measure that relates cluster distance to cluster
+    size" - the lowest score indicates best cluster set.
+    --------
+    :param df: dataframe to plot from
+    """
+    db_scores = []
+    sil_scores = []
+    centers = list(range(2, 10))
+    for center in centers:
+        db_score, sil_score = get_kmeans_score(df, center)
+        db_scores.append(db_score)
+        sil_scores.append(sil_score)
+    # Plot DB
+    fig, ax = plt.subplots()
+    ax.plot(centers, db_scores, linestyle='--', marker='o', color='b')
+    ax.set_xlabel('K')
+    ax.set_ylabel('Davies Bouldin score')
+    title = 'Davies Bouldin score vs. K'
+    ax.set_title(title, size=20)
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+    # Plot silhouette
+    fig, ax = plt.subplots()
+    ax.plot(centers, sil_scores, linestyle='--', marker='o', color='b')
+    ax.set_xlabel('K')
+    ax.set_ylabel('Silhouette score')
+    title = 'Silhouette score vs. K'
+    ax.set_title(title, size=20)
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+def plot_clust(df, labels):
+    """
+    Plot clusters
+    --------
+    :param df: dataframe to plot clusters from
+    :param labels: cluster labels
+    """
+    fig = plt.figure(figsize=(10, 10))
+    ax = fig.add_subplot(111, projection='3d')
+    sc = ax.scatter(df[:, 0], df[:, 1], df[:, 2], c=labels)
+    ax.set_xlabel('Principal Component 1')
+    ax.set_ylabel('Principal Component 2')
+    ax.set_zlabel('Principal Component 3')
+    ax.legend(*sc.legend_elements(), title='clusters')
+    title = 'Clusters'
+    ax.set_title(title, size=20)
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+def save_clusters(typ, labels):
+    """
+    Save results from clustering
+    --------
+    :param typ: type of datasets - train, val
+    :param labels: labels from clustering to add to df
+    :param cols: columns to use for training
+    :return: reduced dataframe in numpy format
+    """
+    df_full = pd.read_pickle(data_path + 'filled_' + typ + '.pkl')
+    df = df_full[df_full.year == year]
+    df['cluster'] = labels
+    df.to_pickle(data_path + '_'.join((run_name, typ, 'clusters.pkl')))
+def main():
+    # Read in data
+    df_train, train_ids = read_yearly_data('train', year)
+    df_val, val_ids = read_yearly_data('val', year)
+    # Set up ML Flow
+    print('Setting up ML Flow run')
+    mlflow.set_tracking_uri('http://127.0.0.1:5000/')
+    mlflow.set_experiment(experiment_name)
+    mlflow.start_run(run_name=run_name, description=description)
+    mlflow.set_tag("model.name", model_type)
+    mlflow.set_tag("model.training_data", "EXAMPLE_STUDY_DATA")
+    mlflow.set_tag("model.training_year", year)
+    mlflow.log_param("n_cols", len(df_train.columns) - 1)
+    mlflow.log_param("k", k)
+    # Select top features using PCA feature importance
+    print('Feature reduction stage 1')
+    pca = plot_variance(df_train, 'full')
+    loadings = extract_pca_loadings(df_train, pca)
+    pc1_abs_loadings = plot_loadings(loadings)
+    variance_full = pca.explained_variance_ratio_.cumsum()
+    n_cols = np.argmax(variance_full >= 0.9) + 1
+    mlflow.log_param("pca_stage_1", n_cols)
+    columns = pc1_abs_loadings.Attribute[:n_cols].values
+    np.save(data_path + run_name + '_cols.npy', columns)
+    # Reduce data by selecting n columns
+    df_train_reduced = df_train[columns]
+    df_val_reduced = df_val[columns]
+    # Convert columns to principal components
+    print('Feature reduction stage 2')
+    pca_n_cols = plot_variance(df_train_reduced, 'reduced')
+    variance_reduced = pca_n_cols.explained_variance_ratio_.cumsum()
+    n_components = np.argmax(variance_reduced >= 0.8) + 1
+    mlflow.log_param("pca_stage_2", n_components)
+    pca_reduced = PCA(n_components=n_components)
+    data_train = extract_array(df_train_reduced, pca_reduced, 'train')
+    data_val = extract_array(df_val_reduced, pca_reduced, 'test')
+    # Find best cluster number
+    print('Detecting best cluster number')
+    plot_DB(data_train)
+    # Fit clustering model
+    print('Cluster model training')
+    data = np.concatenate((data_train, data_val))
+    cluster_model = AgglomerativeClustering(n_clusters=k, linkage="ward")
+    # cluster_model = KMeans(n_clusters=k, random_state=10)
+    cluster_model.fit(data)
+    cluster_model_file = data_path + "_".join((run_name, model_type, 'cluster_model.pkl'))
+    pickle.dump(cluster_model, open(cluster_model_file, 'wb'))
+    # Split labels
+    labels = cluster_model.labels_
+    train_labels = labels[:len(train_ids)]
+    val_labels = labels[len(train_ids):]
+    save_clusters('train', train_labels)
+    save_clusters('val', val_labels)
+    # Plot cluster results
+    plot_clust(data, labels)
+    # Train and validate classifier
+    print('BLR classifier training')
+    # Create a One-vs-Rest logistic regression classifier
+    clf = LogisticRegression(random_state=42, multi_class='ovr')
+    clf.fit(df_train_reduced.to_numpy(), train_labels)
+    clf_model_file = data_path + run_name + '_dtc_model.pkl'
+    pickle.dump(clf, open(clf_model_file, 'wb'))
+    # Create a figure with one feature importance subplot for each class
+    n_classes = len(set(train_labels))
+    n_features = df_train_reduced.shape[1]
+    fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes))
+    # Set the vertical spacing between subplots
+    fig.subplots_adjust(hspace=0.99)
+    # Loop over each class
+    for i in range(n_classes):
+        # Get the feature importances for the current class
+        coef = clf.coef_[i]
+        importance = coef
+        # Sort the feature importances in descending order
+        indices = np.argsort(importance)[::-1]
+        # Create a bar plot of the feature importances
+        axs[i].bar(range(n_features), importance[indices])
+        axs[i].set_xticks(range(n_features))
+        axs[i].set_xticklabels(np.array(df_train_reduced.columns)[indices], rotation=90, fontsize=9)
+        axs[i].set_xlabel('Features')
+        axs[i].set_ylabel('Importance')
+        axs[i].set_title('Class {} Feature Importance'.format(i))
+    # save the plot to a temporary file
+    tmpfile = "plot.png"
+    fig.savefig(tmpfile)
+    # log the plot to MLflow
+    with open(tmpfile, "rb") as fig:
+        mlflow.log_artifact(tmpfile, "feature_importance.png")
+    # remove the temporary file
+    os.remove(tmpfile)
+    # Make predictions on the test data
+    val_pred = clf.predict(df_val_reduced.to_numpy())
+    accuracy = accuracy_score(val_labels, val_pred)
+    mlflow.log_metric('dtc accuracy', accuracy)
+    cm = confusion_matrix(val_labels, val_pred, labels=clf.classes_)
+    disp = ConfusionMatrixDisplay(
+        confusion_matrix=cm, display_labels=clf.classes_)
+    disp.plot()
+    plt.tight_layout()
+    mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
+    # Stop ML Flow
+    mlflow.end_run()
+main()

training/src/modelling/one_vs_rest_DTC.py ADDED Viewed

	@@ -0,0 +1,380 @@

+"""
+Modelling process
+"""
+import pandas as pd
+import numpy as np
+import pickle
+import matplotlib.pyplot as plt
+import mlflow
+from matplotlib import rcParams
+from sklearn.cluster import AgglomerativeClustering, KMeans
+from sklearn.decomposition import PCA
+from sklearn.metrics import (davies_bouldin_score, silhouette_score,
+                             accuracy_score, confusion_matrix,
+                             ConfusionMatrixDisplay)
+from sklearn.multiclass import OneVsRestClassifier
+from sklearn.tree import DecisionTreeClassifier  # , export_text
+import os
+# Set-up figures
+rcParams['figure.figsize'] = 20, 5
+rcParams['axes.spines.top'] = False
+rcParams['axes.spines.right'] = False
+# Set parameters for current run
+year = 2019
+model_type = 'hierarchical'
+data_type = 'train'
+k = 3
+stamp = str(pd.Timestamp.now(tz='GMT+0'))[:16].replace(':', '').replace(' ', '_')
+data_path = '<YOUR_DATA_PATH>/Model_E_Extracts/'
+# Set MLFlow parameters
+mlflow.set_tracking_uri("file:/.")
+tracking_uri = mlflow.get_tracking_uri()
+experiment_name = 'Model E: one vs rest adaption DTC ' + model_type
+run_name = "_".join((str(year), model_type, stamp))
+description = "Clustering model with one vs rest adaption (DTC) for COPD data in " + str(year)
+def extract_year(df, year):
+    """
+    Extract 1 year of data
+    --------
+    :param df: dataframe to extract from
+    :param year: year to select data from
+    :return: data from chosen year
+    """
+    return df[df.year == year]
+def read_yearly_data(typ, year):
+    """
+    Read in data for year required
+    --------
+    :param typ: type of data to read in
+    :param year: year to select data from
+    :return: data from chosen year and ids
+    """
+    df = pd.read_pickle(data_path + 'min_max_' + typ + '.pkl')
+    df_year = extract_year(df, year)
+    ids = df_year.pop('SafeHavenID').to_list()
+    df_year = df_year.drop('year', axis=1)
+    return df_year, ids
+def plot_variance(df, typ):
+    """
+    Plot PCA variance
+    ---------
+    :param df: dataframe to process with PCA
+    :param typ: type of plot - for 'full' data or 'reduced'
+    :return: pca object
+    """
+    pca = PCA().fit(df)
+    n = list(range(1, len(df.columns) + 1))
+    evr = pca.explained_variance_ratio_.cumsum()
+    fig, ax = plt.subplots()
+    ax.plot(n, evr)
+    title = 'PCA Variance - ' + typ
+    ax.set_title(title, size=20)
+    ax.set_xlabel('Number of principal components')
+    ax.set_ylabel('Cumulative explained variance')
+    ax.grid()
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+    return pca
+def extract_pca_loadings(df, pca_object):
+    """
+    Extract PCA loadings
+    --------
+    :param df: dataframe to reduce with pca
+    :param pca_object: pca object with feature loadings
+    :return: loadings table
+    """
+    cols = df.columns
+    loadings = pd.DataFrame(
+        data=pca_object.components_.T * np.sqrt(pca_object.explained_variance_),
+        columns=[f'PC{i}' for i in range(1, len(cols) + 1)],
+        index=cols)
+    return loadings
+def plot_loadings(loadings):
+    """
+    Plot loadings for PC1 returned from PCA
+    --------
+    :param loadings: table of feature correlations to PC1
+    :return: updated loadings table
+    """
+    loadings_abs = loadings.abs().sort_values(by='PC1', ascending=False)
+    pc1_abs = loadings_abs[['PC1']].reset_index()
+    col_map = {'index': 'Attribute', 'PC1': 'AbsCorrWithPC1'}
+    pc1_abs = pc1_abs.rename(col_map, axis=1)
+    fig, ax = plt.subplots()
+    pc1_abs.plot(ax=ax, kind='bar')
+    title = 'PCA loading scores (PC1)'
+    ax.set_title(title, size=20)
+    ax.set_xticks(ticks=pc1_abs.index, labels=pc1_abs.Attribute, rotation='vertical')
+    ax.set_xlabel('Attribute')
+    ax.set_ylabel('AbsCorrWithPC1')
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+    return pc1_abs
+def extract_array(df, pca_object, typ):
+    """
+    Extract data to pass to clustering algos
+    --------
+    :param df: dataframe to convert
+    :param pca_object: initialised PCA object
+    :param typ: type of return needed, either 'train' or 'test'
+    :return: converted array (and PCA object if training)
+    """
+    if typ == 'train':
+        pca_func = pca_object.fit_transform
+    else:
+        pca_func = pca_object.transform
+    pca_data = pd.DataFrame(pca_func(df)).to_numpy()
+    if typ == 'train':
+        pca_file = data_path + run_name + '_pca.pkl'
+        pickle.dump(pca_object, open(pca_file, 'wb'))
+    return pca_data
+def get_kmeans_score(data, k):
+    '''
+    Calculate K-Means Davies Bouldin and Silhouette scores
+    --------
+    :param data: dataset to fit K-Means to
+    :param k: number of centers/clusters
+    :return: Scores
+    '''
+    kmeans = KMeans(n_clusters=k)
+    model = kmeans.fit_predict(data)
+    db_score = davies_bouldin_score(data, model)
+    sil_score = silhouette_score(data, model)
+    return db_score, sil_score
+def plot_DB(df):
+    """
+    Extract David Bouldin score and plot for a range of cluster numbers,
+    applied using K-Means clustering.
+    "Davies Bouldin index represents the average 'similarity' of clusters,
+    where similarity is a measure that relates cluster distance to cluster
+    size" - the lowest score indicates best cluster set.
+    --------
+    :param df: dataframe to plot from
+    """
+    db_scores = []
+    sil_scores = []
+    centers = list(range(2, 10))
+    for center in centers:
+        db_score, sil_score = get_kmeans_score(df, center)
+        db_scores.append(db_score)
+        sil_scores.append(sil_score)
+    # Plot DB
+    fig, ax = plt.subplots()
+    ax.plot(centers, db_scores, linestyle='--', marker='o', color='b')
+    ax.set_xlabel('K')
+    ax.set_ylabel('Davies Bouldin score')
+    title = 'Davies Bouldin score vs. K'
+    ax.set_title(title, size=20)
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+    # Plot silhouette
+    fig, ax = plt.subplots()
+    ax.plot(centers, sil_scores, linestyle='--', marker='o', color='b')
+    ax.set_xlabel('K')
+    ax.set_ylabel('Silhouette score')
+    title = 'Silhouette score vs. K'
+    ax.set_title(title, size=20)
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+def plot_clust(df, labels):
+    """
+    Plot clusters
+    --------
+    :param df: dataframe to plot clusters from
+    :param labels: cluster labels
+    """
+    fig = plt.figure(figsize=(10, 10))
+    ax = fig.add_subplot(111, projection='3d')
+    sc = ax.scatter(df[:, 0], df[:, 1], df[:, 2], c=labels)
+    ax.set_xlabel('Principal Component 1')
+    ax.set_ylabel('Principal Component 2')
+    ax.set_zlabel('Principal Component 3')
+    ax.legend(*sc.legend_elements(), title='clusters')
+    title = 'Clusters'
+    ax.set_title(title, size=20)
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+def save_clusters(typ, labels):
+    """
+    Save results from clustering
+    --------
+    :param typ: type of datasets - train, val
+    :param labels: labels from clustering to add to df
+    :param cols: columns to use for training
+    :return: reduced dataframe in numpy format
+    """
+    df_full = pd.read_pickle(data_path + 'filled_' + typ + '.pkl')
+    df = df_full[df_full.year == year]
+    df['cluster'] = labels
+    df.to_pickle(data_path + '_'.join((run_name, typ, 'clusters.pkl')))
+def main():
+    # Read in data
+    df_train, train_ids = read_yearly_data('train', year)
+    df_val, val_ids = read_yearly_data('val', year)
+    # Set up ML Flow
+    print('Setting up ML Flow run')
+    mlflow.set_tracking_uri('http://127.0.0.1:5000/')
+    mlflow.set_experiment(experiment_name)
+    mlflow.start_run(run_name=run_name, description=description)
+    mlflow.set_tag("model.name", model_type)
+    mlflow.set_tag("model.training_data", "EXAMPLE_STUDY_DATA")
+    mlflow.set_tag("model.training_year", year)
+    mlflow.log_param("n_cols", len(df_train.columns) - 1)
+    mlflow.log_param("k", k)
+    # Select top features using PCA feature importance
+    print('Feature reduction stage 1')
+    pca = plot_variance(df_train, 'full')
+    loadings = extract_pca_loadings(df_train, pca)
+    pc1_abs_loadings = plot_loadings(loadings)
+    variance_full = pca.explained_variance_ratio_.cumsum()
+    n_cols = np.argmax(variance_full >= 0.9) + 1
+    mlflow.log_param("pca_stage_1", n_cols)
+    columns = pc1_abs_loadings.Attribute[:n_cols].values
+    np.save(data_path + run_name + '_cols.npy', columns)
+    # Reduce data by selecting n columns
+    df_train_reduced = df_train[columns]
+    df_val_reduced = df_val[columns]
+    # Convert columns to principal components
+    print('Feature reduction stage 2')
+    pca_n_cols = plot_variance(df_train_reduced, 'reduced')
+    variance_reduced = pca_n_cols.explained_variance_ratio_.cumsum()
+    n_components = np.argmax(variance_reduced >= 0.8) + 1
+    mlflow.log_param("pca_stage_2", n_components)
+    pca_reduced = PCA(n_components=n_components)
+    data_train = extract_array(df_train_reduced, pca_reduced, 'train')
+    data_val = extract_array(df_val_reduced, pca_reduced, 'test')
+    # Find best cluster number
+    print('Detecting best cluster number')
+    plot_DB(data_train)
+    # Fit clustering model
+    print('Cluster model training')
+    data = np.concatenate((data_train, data_val))
+    cluster_model = AgglomerativeClustering(n_clusters=k, linkage="ward")
+    # cluster_model = KMeans(n_clusters=k, random_state=10)
+    cluster_model.fit(data)
+    cluster_model_file = data_path + "_".join((run_name, model_type, 'cluster_model.pkl'))
+    pickle.dump(cluster_model, open(cluster_model_file, 'wb'))
+    # Split labels
+    labels = cluster_model.labels_
+    train_labels = labels[:len(train_ids)]
+    val_labels = labels[len(train_ids):]
+    save_clusters('train', train_labels)
+    save_clusters('val', val_labels)
+    # Plot cluster results
+    plot_clust(data, labels)
+    # Train and validate classifier
+    print('BLR classifier training')
+    # Create a One-vs-Rest DecisionTreeClassifier
+    clf_pre = DecisionTreeClassifier(random_state=42)
+    clf = OneVsRestClassifier(clf_pre)
+    clf.fit(df_train_reduced.to_numpy(), train_labels)
+    clf_model_file = data_path + run_name + '_dtc_model.pkl'
+    pickle.dump(clf, open(clf_model_file, 'wb'))
+    # Create a figure with one feature importance subplot for each class
+    n_classes = len(set(train_labels))
+    n_features = df_train_reduced.shape[1]
+    fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes))
+    # Set the vertical spacing between subplots
+    fig.subplots_adjust(hspace=0.99)
+    # Loop over each class
+    for i in range(n_classes):
+        # Get the feature importances for the current class
+        importance = clf.estimators_[i].feature_importances_
+        # Sort the feature importances in descending order
+        indices = np.argsort(importance)[::-1]
+        # Create a bar plot of the feature importances
+        axs[i].bar(range(n_features), importance[indices])
+        axs[i].set_xticks(range(n_features))
+        axs[i].set_xticklabels(np.array(df_train_reduced.columns)[indices], rotation=90, fontsize=9)
+        axs[i].set_xlabel('Features')
+        axs[i].set_ylabel('Importance')
+        axs[i].set_title('Class {} Feature Importance'.format(i))
+    # Adjust the spacing between the subplots
+    plt.subplots_adjust(hspace=0.5)
+    # save the plot to a temporary file
+    tmpfile = "plot.png"
+    fig.savefig(tmpfile)
+    # log the plot to MLflow
+    with open(tmpfile, "rb") as fig:
+        mlflow.log_artifact(tmpfile, "feature_importance.png")
+    # remove the temporary file
+    os.remove(tmpfile)
+    # Make predictions on the test data
+    val_pred = clf.predict(df_val_reduced.to_numpy())
+    accuracy = accuracy_score(val_labels, val_pred)
+    mlflow.log_metric('dtc accuracy', accuracy)
+    cm = confusion_matrix(val_labels, val_pred, labels=clf.classes_)
+    disp = ConfusionMatrixDisplay(
+        confusion_matrix=cm, display_labels=clf.classes_)
+    disp.plot()
+    plt.tight_layout()
+    mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
+    # Stop ML Flow
+    mlflow.end_run()
+main()

training/src/modelling/predict_clusters.py ADDED Viewed

	@@ -0,0 +1,70 @@

+import sys
+import json
+import pandas as pd
+import numpy as np
+import pickle
+def extract_year(df, eoy_date):
+    """
+    Extract 1 year of data
+    --------
+    :param df: dataframe to extract from
+    :param eoy_date: user-specified end of year date
+    :return: data from chosen year
+    """
+    return df[df.eoy == eoy_date]
+def read_yearly_data(data_path, data_type, eoy_date):
+    """
+    Read in data for year required
+    --------
+    :param data_path: path to generated data
+    :param data_type: type of data to read in
+    :param eoy_date: user-specified end of year date
+    :return: data from chosen year and ids
+    """
+    df = pd.read_pickle(data_path + 'min_max_' + data_type + '.pkl')
+    df_year = extract_year(df, eoy_date)
+    ids = df_year.pop('SafeHavenID').to_list()
+    df_year = df_year.drop('eoy', axis=1)
+    return df_year, ids
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    # Set model parameters
+    eoy_date = config['date']
+    data_path = config['model_data_path']
+    # Get datatype from cmd line
+    data_type = sys.argv[1]
+    run_name = sys.argv[2]
+    # Read data
+    print('Loading data')
+    columns = np.load(data_path + run_name + '_cols.npy', allow_pickle=True)
+    df_scaled, ids = read_yearly_data(data_path, data_type, eoy_date)
+    df_scaled_reduced = df_scaled[columns]
+    df_unscaled_full = pd.read_pickle(data_path + 'filled_' + data_type + '.pkl')
+    df_unscaled = extract_year(df_unscaled_full, eoy_date)
+    # Load model
+    print('Loading model')
+    clf_model_file = data_path + run_name + '_dtc_model.pkl'
+    clf = pickle.load(open(clf_model_file, 'rb'))
+    # Predict on new data
+    print('Predicting clusters')
+    labels = clf.predict(df_scaled_reduced.to_numpy())
+    df_unscaled['cluster'] = labels
+    df_unscaled.to_pickle(data_path + '_'.join((run_name, data_type, 'clusters.pkl')))
+main()

training/src/modelling/run_model.py ADDED Viewed

	@@ -0,0 +1,355 @@

+"""
+Modelling process
+"""
+import json
+import pandas as pd
+import numpy as np
+import pickle
+import matplotlib.pyplot as plt
+import mlflow
+from matplotlib import rcParams
+from sklearn.cluster import AgglomerativeClustering, KMeans
+from sklearn.tree import DecisionTreeClassifier as DTC
+from sklearn.decomposition import PCA
+from sklearn.metrics import (davies_bouldin_score, silhouette_score,
+                             accuracy_score, confusion_matrix,
+                             ConfusionMatrixDisplay)
+# Set-up figures
+rcParams['figure.figsize'] = 20, 5
+rcParams['axes.spines.top'] = False
+rcParams['axes.spines.right'] = False
+def extract_year(df, eoy_date):
+    """
+    Extract 1 year of data
+    --------
+    :param df: dataframe to extract from
+    :param eoy_date: user-specified EOY date for training
+    :return: data from chosen year
+    """
+    return df[df.eoy == eoy_date]
+def read_yearly_data(data_path, typ, eoy_date):
+    """
+    Read in data for year required
+    --------
+    :param data_path: path to generated data
+    :param typ: type of data to read in
+    :param eoy_date: end of year date to select data from
+    :return: data from chosen year and ids
+    """
+    df = pd.read_pickle(data_path + 'min_max_' + typ + '.pkl')
+    df_year = extract_year(df, eoy_date)
+    ids = df_year.pop('SafeHavenID').to_list()
+    df_year = df_year.drop('eoy', axis=1)
+    return df_year, ids
+def plot_variance(df, typ):
+    """
+    Plot PCA variance
+    ---------
+    :param df: dataframe to process with PCA
+    :param typ: type of plot - for 'full' data or 'reduced'
+    :return: pca object
+    """
+    pca = PCA().fit(df)
+    n = list(range(1, len(df.columns) + 1))
+    evr = pca.explained_variance_ratio_.cumsum()
+    fig, ax = plt.subplots()
+    ax.plot(n, evr)
+    title = 'PCA Variance - ' + typ
+    ax.set_title(title, size=20)
+    ax.set_xlabel('Number of principal components')
+    ax.set_ylabel('Cumulative explained variance')
+    ax.grid()
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+    return pca
+def extract_pca_loadings(df, pca_object):
+    """
+    Extract PCA loadings
+    --------
+    :param df: dataframe to reduce with pca
+    :param pca_object: pca object with feature loadings
+    :return: loadings table
+    """
+    cols = df.columns
+    loadings = pd.DataFrame(
+        data=pca_object.components_.T * np.sqrt(pca_object.explained_variance_),
+        columns=[f'PC{i}' for i in range(1, len(cols) + 1)],
+        index=cols)
+    return loadings
+def plot_loadings(loadings):
+    """
+    Plot loadings for PC1 returned from PCA
+    --------
+    :param loadings: table of feature correlations to PC1
+    :return: updated loadings table
+    """
+    loadings_abs = loadings.abs().sort_values(by='PC1', ascending=False)
+    pc1_abs = loadings_abs[['PC1']].reset_index()
+    col_map = {'index': 'Attribute', 'PC1': 'AbsCorrWithPC1'}
+    pc1_abs = pc1_abs.rename(col_map, axis=1)
+    fig, ax = plt.subplots()
+    pc1_abs.plot(ax=ax, kind='bar')
+    title = 'PCA loading scores (PC1)'
+    ax.set_title(title, size=20)
+    ax.set_xticks(ticks=pc1_abs.index, labels=pc1_abs.Attribute, rotation='vertical')
+    ax.set_xlabel('Attribute')
+    ax.set_ylabel('AbsCorrWithPC1')
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+    return pc1_abs
+def extract_array(df, data_path, run_name, pca_object, typ):
+    """
+    Extract data to pass to clustering algos
+    --------
+    :param df: dataframe to convert
+    :param data_path: path to generated data
+    :param run_name: name of run in ML Flow
+    :param pca_object: initialised PCA object
+    :param typ: type of return needed, either 'train' or 'test'
+    :return: converted array (and PCA object if training)
+    """
+    if typ == 'train':
+        pca_func = pca_object.fit_transform
+    else:
+        pca_func = pca_object.transform
+    pca_data = pd.DataFrame(pca_func(df)).to_numpy()
+    if typ == 'train':
+        pca_file = data_path + run_name + '_pca.pkl'
+        pickle.dump(pca_object, open(pca_file, 'wb'))
+    return pca_data
+def get_kmeans_score(data, k):
+    '''
+    Calculate K-Means Davies Bouldin and Silhouette scores
+    --------
+    :param data: dataset to fit K-Means to
+    :param k: number of centers/clusters
+    :return: Scores
+    '''
+    kmeans = KMeans(n_clusters=k)
+    model = kmeans.fit_predict(data)
+    db_score = davies_bouldin_score(data, model)
+    sil_score = silhouette_score(data, model)
+    return db_score, sil_score
+def plot_DB(df):
+    """
+    Extract David Bouldin score and plot for a range of cluster numbers,
+    applied using K-Means clustering.
+    "Davies Bouldin index represents the average 'similarity' of clusters,
+    where similarity is a measure that relates cluster distance to cluster
+    size" - the lowest score indicates best cluster set.
+    --------
+    :param df: dataframe to plot from
+    """
+    db_scores = []
+    sil_scores = []
+    centers = list(range(2, 10))
+    for center in centers:
+        db_score, sil_score = get_kmeans_score(df, center)
+        db_scores.append(db_score)
+        sil_scores.append(sil_score)
+    # Plot DB
+    fig, ax = plt.subplots()
+    ax.plot(centers, db_scores, linestyle='--', marker='o', color='b')
+    ax.set_xlabel('K')
+    ax.set_ylabel('Davies Bouldin score')
+    title = 'Davies Bouldin score vs. K'
+    ax.set_title(title, size=20)
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+    # Plot silhouette
+    fig, ax = plt.subplots()
+    ax.plot(centers, sil_scores, linestyle='--', marker='o', color='b')
+    ax.set_xlabel('K')
+    ax.set_ylabel('Silhouette score')
+    title = 'Silhouette score vs. K'
+    ax.set_title(title, size=20)
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+def plot_clust(df, labels):
+    """
+    Plot clusters
+    --------
+    :param df: dataframe to plot clusters from
+    :param labels: cluster labels
+    """
+    fig = plt.figure(figsize=(10, 10))
+    ax = fig.add_subplot(111, projection='3d')
+    sc = ax.scatter(df[:, 0], df[:, 1], df[:, 2], c=labels)
+    ax.set_xlabel('Principal Component 1')
+    ax.set_ylabel('Principal Component 2')
+    ax.set_zlabel('Principal Component 3')
+    ax.legend(*sc.legend_elements(), title='clusters')
+    title = 'Clusters'
+    ax.set_title(title, size=20)
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title + '.png')
+def save_clusters(data_path, run_name, eoy_date, typ, labels):
+    """
+    Save results from clustering
+    --------
+    :param typ: type of datasets - train, val
+    :param labels: labels from clustering to add to df
+    :param cols: columns to use for training
+    :return: reduced dataframe in numpy format
+    """
+    df_full = pd.read_pickle(data_path + 'filled_' + typ + '.pkl')
+    df = df_full[df_full.eoy == eoy_date]
+    df['cluster'] = labels
+    df.to_pickle(data_path + '_'.join((run_name, typ, 'clusters.pkl')))
+def main():
+    # Load in config files
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    # Set model parameters
+    eoy_date = config['date']
+    data_path = config['model_data_path']
+    model_type = config['model_type']
+    # Load in model config
+    with open(model_type + '_params.json') as json_params_file:
+        model_params = json.load(json_params_file)
+    # Create ML Flow run details
+    stamp = str(pd.Timestamp.now(tz='GMT+0'))[:16].replace(':', '').replace(' ', '_')
+    experiment_name = 'Model E - Date Specific: ' + model_type
+    run_name = "_".join((str(eoy_date), model_type, stamp))
+    description = "Clustering model for COPD data in the year prior to " + str(eoy_date)
+    # Set up ML Flow
+    print('Setting up ML Flow run')
+    mlflow.set_tracking_uri('http://127.0.0.1:5000/')
+    mlflow.set_experiment(experiment_name)
+    mlflow.start_run(run_name=run_name, description=description)
+    mlflow.set_tag("model.name", model_type)
+    mlflow.set_tag("model.training_data", config['extract_data_path'])
+    mlflow.set_tag("model.training_date", eoy_date)
+    mlflow.log_param("k", model_params['n_clusters'])
+    # Read in data
+    df_train, train_ids = read_yearly_data(data_path, 'train', eoy_date)
+    df_val, val_ids = read_yearly_data(data_path, 'val', eoy_date)
+    mlflow.log_param("n_cols", len(df_train.columns))
+    # Read in data
+    df_train, train_ids = read_yearly_data(data_path, 'train', eoy_date)
+    df_val, val_ids = read_yearly_data(data_path, 'val', eoy_date)
+    mlflow.log_param("n_cols", len(df_train.columns))
+    # Select top features using PCA feature importance
+    print('Feature reduction stage 1')
+    pca = plot_variance(df_train, 'full')
+    loadings = extract_pca_loadings(df_train, pca)
+    pc1_abs_loadings = plot_loadings(loadings)
+    variance_full = pca.explained_variance_ratio_.cumsum()
+    n_cols = np.argmax(variance_full >= 0.9) + 1
+    mlflow.log_param("pca_stage_1", n_cols)
+    columns = pc1_abs_loadings.Attribute[:n_cols].values
+    np.save(data_path + run_name + '_cols.npy', columns)
+    # Reduce data by selecting n columns
+    df_train_reduced = df_train[columns]
+    df_val_reduced = df_val[columns]
+    # Convert columns to principal components
+    print('Feature reduction stage 2')
+    pca_n_cols = plot_variance(df_train_reduced, 'reduced')
+    variance_reduced = pca_n_cols.explained_variance_ratio_.cumsum()
+    n_components = np.argmax(variance_reduced >= 0.8) + 1
+    mlflow.log_param("pca_stage_2", n_components)
+    pca_reduced = PCA(n_components=n_components)
+    data_train = extract_array(
+        df_train_reduced, data_path, run_name, pca_reduced, 'train')
+    data_val = extract_array(
+        df_val_reduced, data_path, run_name, pca_reduced, 'test')
+    # Find best cluster number
+    print('Detecting best cluster number')
+    plot_DB(data_train)
+    # Fit clustering model
+    print('Cluster model training')
+    data = np.concatenate((data_train, data_val))
+    if model_type == 'hierarchical':
+        cluster_model = AgglomerativeClustering(**model_params)
+    else:
+        cluster_model = KMeans(**model_params)
+    cluster_model.fit(data)
+    cluster_model_file = data_path + "_".join((run_name, model_type, 'cluster_model.pkl'))
+    pickle.dump(cluster_model, open(cluster_model_file, 'wb'))
+    # Split labels
+    labels = cluster_model.labels_
+    train_labels = labels[:len(train_ids)]
+    val_labels = labels[len(train_ids):]
+    save_clusters(data_path, run_name, eoy_date, 'train', train_labels)
+    save_clusters(data_path, run_name, eoy_date, 'val', val_labels)
+    # Plot cluster results
+    plot_clust(data, labels)
+    # Read in DTC parameters
+    with open('dtc_params.json') as dtc_params_file:
+        dtc_params = json.load(dtc_params_file)
+    # Train and validate classifier
+    print('Decision tree classifier training')
+    clf = DTC(**dtc_params).fit(df_train_reduced.to_numpy(), train_labels)
+    clf_model_file = data_path + run_name + '_dtc_model.pkl'
+    pickle.dump(clf, open(clf_model_file, 'wb'))
+    # Calculate metrics
+    val_pred = clf.predict(df_val_reduced.to_numpy())
+    accuracy = accuracy_score(val_labels, val_pred)
+    mlflow.log_metric('dtc accuracy', accuracy)
+    # Plot confusion matrix
+    cm = confusion_matrix(val_labels, val_pred, labels=clf.classes_)
+    disp = ConfusionMatrixDisplay(
+        confusion_matrix=cm, display_labels=clf.classes_)
+    disp.plot()
+    plt.tight_layout()
+    mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
+    # Stop ML Flow
+    mlflow.end_run()
+main()

training/src/modelling/validate.py ADDED Viewed

	@@ -0,0 +1,260 @@

+"""
+Validation process
+"""
+import sys
+import json
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import mlflow
+from matplotlib import rcParams
+from tableone import TableOne
+# Set-up figures
+rcParams['figure.figsize'] = 20, 5
+rcParams['axes.spines.top'] = False
+rcParams['axes.spines.right'] = False
+def plot_cluster_size(df, data_type):
+    """
+    Produce a bar plot of cluster size
+    --------
+    :param df: dataframe to plot
+    :param data_type: type of data - train, test, val, rec, sup
+    """
+    # Number of patients
+    fig, ax = plt.subplots()
+    df.groupby('cluster').size().plot(ax=ax, kind='barh')
+    title = "Patient Cohorts"
+    ax.set_title(title)
+    ax.set_xlabel("Number of Patients", size=20)
+    ax.set_ylabel("Cluster")
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title.replace(' ', '_') + '_' + data_type + '.png')
+def plot_feature_hist(df, col, data_type):
+    """
+    Produce a histogram plot for a chosen feature
+    --------
+    :param df: dataframe to plot
+    :param col: feature column to plot
+    :param data_type: type of data - train, test, val, rec, sup
+    """
+    fig, ax = plt.subplots()
+    df.groupby('cluster')[col].plot(ax=ax, kind='hist', alpha=0.5)
+    ax.set_xlabel(col)
+    title = col + ' Histogram'
+    ax.set_title(title, size=20)
+    ax.legend()
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + title.replace(' ', '_') + '_' + data_type + '.png')
+def plot_feature_bar(data, col, typ, data_type):
+    """
+    Produce a bar plot for a chosen feature
+    --------
+    :param df: dataframe to plot
+    :param col: feature column to plot
+    :param typ: 'count' or 'percentage'
+    :param data_type: type of data - train, test, val, rec, sup
+    """
+    if typ == 'count':
+        to_plot = data.groupby(['cluster']).apply(
+            lambda x: x.groupby(col).size())
+        x_label = "Number"
+    else:
+        to_plot = data.groupby(['cluster']).apply(
+            lambda x: 100 * x.groupby(col).size() / len(x))
+        x_label = "Percentage"
+    fig, ax = plt.subplots()
+    to_plot.plot(ax=ax, kind='barh')
+    title = "Patient Cohorts"
+    ax.set_title(title, size=20)
+    ax.set_xlabel(x_label + " of patients")
+    ax.set_ylabel("Cluster")
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + '_'.join((title.replace(' ', '_'), col, data_type + '.png')))
+def plot_cluster_bar(data, typ, data_type):
+    """
+    Produce a bar plot for a chosen feature
+    --------
+    :param data: data to plot
+    :param typ: 'count' or 'percentage'
+    :param data_type: type of data - train, test, val, rec, sup
+    """
+    fig, ax = plt.subplots()
+    data.plot(ax=ax, kind='bar')
+    ax.set_title(typ, size=20)
+    ax.set_xlabel("Cluster")
+    ax.set_ylabel("Percentage")
+    ax.set_ylim(0, 100)
+    plt.tight_layout()
+    mlflow.log_figure(fig, 'fig/' + typ + '_' + data_type + '.png')
+def plot_events(df, data_type):
+    """
+    Plot events in the next 12 months based on metric table
+    --------
+    :param df: metric table
+    :param data_type: type of data - train, test, val, rec, sup
+    """
+    df = df.drop('SafeHavenID', axis=1).set_index('cluster')
+    events = df.groupby('cluster').apply(lambda x: 100 * x.apply(
+        lambda x: len(x[x == 1]) / len(x)))
+    plot_cluster_bar(events, 'events', data_type)
+def process_deceased_metrics(col):
+    """
+    Process deceased column for plotting
+    -------
+    :param col: column to process
+    """
+    n_deceased = 100 * ((col[col < '12+']).count()) / len(col)
+    res = pd.DataFrame({'alive': [100 - n_deceased], 'deceased': [n_deceased]})
+    return res
+def plot_deceased(df, data_type):
+    """
+    Plot events in the next 12 months based on metric table
+    --------
+    :param df: metric table
+    :param data_type: type of data - train, test, val, rec, sup
+    """
+    survival = df.groupby('cluster')['time_to_death'].apply(
+        process_deceased_metrics).reset_index().drop(
+        'level_1', axis=1).set_index('cluster')
+    plot_cluster_bar(survival, 'survival', data_type)
+def plot_therapies(df_year, results, data_type):
+    """
+    Plot patient therapies per cluster
+    --------
+    :param df_year: unscaled data for current year
+    :param results: cluster results and safehaven id
+    :param data_type: type of data - train, test, val, rec, sup
+    """
+    # Inhaler data for training group
+    therapies = df_year[['SafeHavenID', 'single_inhaler', 'double_inhaler', 'triple_inhaler']]
+    res_therapies = pd.merge(therapies, results, on='SafeHavenID', how='inner')
+    # Find counts/percentage per cluster
+    inhaler_cols = ['single_inhaler', 'double_inhaler', 'triple_inhaler']
+    inhals = res_therapies[['cluster'] + inhaler_cols].set_index('cluster')
+    in_res = inhals.groupby('cluster').apply(
+        lambda x: x.apply(lambda x: 100 * (x[x > 0].count()) / len(x)))
+    # Number of people without an inhaler presc
+    no_in = res_therapies.groupby('cluster').apply(
+        lambda x: 100 * len(x[(x[inhaler_cols] == 0).all(axis=1)]) / len(x)).values
+    # Rename columns for plotting
+    in_res.columns = [c[0] for c in in_res.columns.str.split('_')]
+    # Add those with no inhaler
+    in_res['no_inhaler'] = no_in
+    plot_cluster_bar(in_res, 'therapies', data_type)
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    data_path = config['model_data_path']
+    # Get datatype from cmd line
+    data_type = sys.argv[1]
+    run_name = sys.argv[2]
+    run_id = sys.argv[3]
+    # Set MLFlow parameters
+    model_type = 'hierarchical'
+    experiment_name = 'Model E - Date Specific: ' + model_type
+    mlflow.set_tracking_uri('http://127.0.0.1:5000/')
+    mlflow.set_experiment(experiment_name)
+    mlflow.start_run(run_id=run_id)
+    # Read in unscaled data, results and column names used to train model
+    columns = np.load(data_path + run_name + '_cols.npy', allow_pickle=True)
+    df_clusters = pd.read_pickle(data_path + "_".join((run_name, data_type, 'clusters.pkl')))
+    df_reduced = df_clusters[list(columns) + ['cluster']]
+    # Number of patients
+    plot_cluster_size(df_reduced, data_type)
+    # Generate mean/std table
+    t1_year = TableOne(df_reduced, categorical=[], groupby='cluster', pval=True)
+    t1yr_file = data_path + 't1_year_' + run_name + '_' + data_type + '.html'
+    t1_year.to_html(t1yr_file)
+    mlflow.log_artifact(t1yr_file)
+    # Histogram feature plots
+    plot_feature_hist(df_clusters, 'age', data_type)
+    plot_feature_hist(df_clusters, 'albumin_med_2yr', data_type)
+    # Bar plots
+    df_clusters['sex'] = df_clusters['sex_bin'].map({0: 'Male', 1: 'Female'})
+    plot_feature_bar(df_clusters, 'sex', 'percent', data_type)
+    plot_feature_bar(df_clusters, 'simd_decile', 'precent', data_type)
+    # Metrics for following 12 months
+    df_events = pd.read_pickle(data_path + 'metric_table_events.pkl')
+    df_counts = pd.read_pickle(data_path + 'metric_table_counts.pkl')
+    df_next = pd.read_pickle(data_path + 'metric_table_next.pkl')
+    # Merge cluster number with SafeHavenID and metrics
+    clusters = df_clusters[['SafeHavenID', 'cluster']]
+    df_events = clusters.merge(df_events, on='SafeHavenID', how='left').fillna(0)
+    df_counts = clusters.merge(df_counts, on='SafeHavenID', how='left').fillna(0)
+    df_next = clusters.merge(df_next, on='SafeHavenID', how='left').fillna('12+')
+    # Generate TableOne for events
+    cat_cols = df_events.columns[2:]
+    df_events[cat_cols] = df_events[cat_cols].astype('int')
+    event_limit = dict(zip(cat_cols, 5 * [1]))
+    event_order = dict(zip(cat_cols, 5 * [[1, 0]]))
+    t1_events = TableOne(df_events[df_events.columns[1:]], groupby='cluster',
+                         limit=event_limit, order=event_order)
+    t1_events_file = data_path + '_'.join(('t1', data_type, 'events', run_name + '.html'))
+    t1_events.to_html(t1_events_file)
+    mlflow.log_artifact(t1_events_file)
+    # Generate TableOne for event counts
+    count_cols = df_counts.columns[2:]
+    df_counts[count_cols] = df_counts[count_cols].astype('int')
+    t1_counts = TableOne(df_counts[df_counts.columns[1:]], categorical=[], groupby='cluster')
+    t1_counts_file = data_path + '_'.join(('t1', data_type, 'counts', run_name + '.html'))
+    t1_counts.to_html(t1_counts_file)
+    mlflow.log_artifact(t1_counts_file)
+    # Generate TableOne for time to next events
+    next_cols = df_next.columns[2:]
+    next_event_order = dict(zip(next_cols, 5 * [['1', '3', '6', '12', '12+']]))
+    t1_next = TableOne(df_next[df_next.columns[1:]], groupby='cluster',
+                       order=next_event_order)
+    t1_next_file = data_path + '_'.join(('t1', data_type, 'next', run_name + '.html'))
+    t1_next.to_html(t1_next_file)
+    mlflow.log_artifact(t1_next_file)
+    # Plot metrics
+    plot_events(df_events, data_type)
+    plot_deceased(df_next, data_type)
+    plot_therapies(df_clusters, clusters, data_type)
+    # Stop ML Flow
+    mlflow.end_run()
+main()

training/src/processing/README.md ADDED Viewed

	@@ -0,0 +1,24 @@

+# Processing
+This folder contains scripts for processing raw EHR data, along with the mappings required to carry out the initial processing steps.
+Before running any scripts, first create a directory called 'Model_E_Extracts' within the 'S:/data' directory.
+_NB: The below processing scripts can be run in any order._
+### Admissions
+- process_admissions.py - SMR01 COPD/Resp admissions per patient per year
+- process_comorbidities.py - SMR01 comorbidities per patient per year
+### Demographics
+- process_demographics.py - DOB, sex, marital status and SIMD data
+### Labs
+- process_labs.py - lab test values per patient per year, taking the median lab test value from the 2 years prior
+### Prescribing
+- process_prescribing.py - prescriptions per patient per year

training/src/processing/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Empty file for folder to be recognised as module

training/src/processing/mappings/Comorbidity feature review for models & clin summary update v2 May 2021.xlsx ADDED Viewed

Binary file (98.3 kB). View file

training/src/processing/mappings/README.md ADDED Viewed

	@@ -0,0 +1,7 @@

+# Mappings
+This folder contains a range of mappings used within the processing stages of model E:
+- `inhaler_mapping.json`: Inhaler mappings for any Chapter 3 BNF code inhaler prescriptions present in the SafeHaven prescribing dataset. Information on NHS inhaler types, found [here](https://www.coch.nhs.uk/media/172781/3-respiratory-system.pdf), was used to create the mapping.
+- `test_mapping.json`: A mapping created for any of the top 20 most frequently occurring lab tests, plus any lab tests found relevant for indicating COPD severity in Model A. This mapping creates a common name for a specific test and lists any related names the test may appear under within the SCI Store dataset.
+- `Comorbidity feature review for models & clin summary update v2 May 2021.xlsx`: A mapping between diagnosis names found in SMR01 and their associated comorbidities (taken from Model A).
+- `diag_copd_resp_desc.json`: DIAGDesc for COPD and respiratory admissions

training/src/processing/mappings/diag_copd_resp_desc.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "copd": "CHRONIC OBSTRUCTIVE PULMONARY DISEASE",
+    "resp": ["PNEUMONITIS DUE TO FOOD AND VOMIT", "RESPIRATORY FAILURE, UNSPECIFIED; TYPE UNSPECIFIED", "CHRONIC RESPIRATORY FAILURE; TYPE II [HYPERCAPNIC]", "BRONCHOPNEUMONIA, UNSPECIFIED", "DYSPNOEA", "PLEURAL EFFUSION IN CONDITIONS CLASSIFIED ELSEWHERE", "RESPIRATORY FAILURE, UNSPECIFIED; TYPE [HYPERCAPNIC]", "PLEURAL EFFUSION, NOT ELSEWHERE CLASSIFIED", "CHRONIC RESPIRATORY FAILURE", "OTHER BACTERIAL PNEUMONIA", "ABN MICROBIOLOGICAL FINDINGS IN SPECS FROM RESPIRATORY ORGANS AND THORAX", "RESPIRATORY FAILURE, UNSPECIFIED", "PNEUMONIA, UNSPECIFIED", "LOBAR PNEUMONIA, UNSPECIFIED", "COUGH", "PLEURAL PLAQUE WITH PRESENCE OF ASBESTOS", "PLEURAL PLAQUE WITHOUT ASBESTOS", "OTHER DISORDERS OF LUNG", "OTHER SPECIFIED PLEURAL CONDITIONS", "PULMONARY COLLAPSE", "ACQUIRED ABSENCE OF LUNG [PART OF]", "ASPHYXIATION", "RESPIRATORY FAILURE, UNSPECIFIED; TYPE [HYPOXIC]", "TRACHEOSTOMY STATUS", "ACUTE RESPIRATORY FAILURE", "UNSPECIFIED ACUTE LOWER RESPIRATORY INFECTION", "OTHER SPECIFIED SYMPTOMS AND SIGNS INVOLVING THE CIRC AND RESP SYSTEMS", "BACTERIAL PNEUMONIA, UNSPECIFIED", "PYOTHORAX WITHOUT FISTULA", "DISEASES OF BRONCHUS, NOT ELSEWHERE CLASSIFIED", "PNEUMONIA DUE TO HAEMOPHILUS INFLUENZAE", "ABNORMAL SPUTUM", "OTHER POSTPROCEDURAL RESPIRATORY DISORDERS", "OTHER AND UNSPECIFIED ABNORMALITIES OF BREATHING", "INFLUENZA WITH OTHER RESP MANIFESTATIONS, SEASONAL INFLUENZA VIRUS IDENTIF", "PERSONAL HISTORY OF DISEASES OF THE RESPIRATORY SYSTEM", "PNEUMONIA DUE TO STREPTOCOCCUS PNEUMONIAE", "WHEEZING", "CHEST PAIN ON BREATHING", "HAEMOPTYSIS", "INFLUENZA WITH OTHER MANIFESTATIONS, VIRUS NOT IDENTIFIED", "OTHER SPECIFIED RESPIRATORY DISORDERS", "ACUTE UPPER RESPIRATORY INFECTION, UNSPECIFIED", "T.B. OF LUNG, W/O MENTION OF BACTERIOLOGICAL OR HISTOLOGICAL CONFIRMATION", "DEPENDENCE ON RESPIRATOR", "PLEURISY", "BRONCHITIS, NOT SPECIFIED AS ACUTE OR CHRONIC"],
+    "anxiety_depression": ["ADVERSE EFFECTS OF OTHER SEDATIVES, HYPNOTICS AND ANTIANXIETY DRUGS", "ADVERSE EFFECTS OF SEDATIVE, HYPNOTIC AND ANTIANXIETY DRUG, UNSPECIFIED", "ANXIETY DISORDER, UNSPECIFIED", "ANXIOUS [AVOIDANT] PERSONALITY DISORDER", "GENERALIZED ANXIETY DISORDER", "MIXED ANXIETY AND DEPRESSIVE DISORDER", "OTHER MIXED ANXIETY DISORDERS", "OTHER PHOBIC ANXIETY DISORDERS", "OTHER SPECIFIED ANXIETY DISORDERS", "PANIC DISORDER [EPISODIC PAROXYSMAL ANXIETY]", "PHOBIC ANXIETY DISORDER, UNSPECIFIED", "ADVERSE EFFECTS OF MONOAMINE-OXIDASE-INHIBITOR ANTIDEPRESSANTS", "ADVERSE EFFECTS OF OTHER AND UNSPECIFIED ANTIDEPRESSANTS", "ADVERSE EFFECTS OF TRICYCLIC AND TETRACYCLIC ANTIDEPRESSANTS", "BIPOLAR AFFECT DISORDER, CUR EPISODE SEVERE DEPRESSION WITH PSYCHOTIC SYMP", "BIPOLAR AFFECTIVE DISORDER, CURR EPISODE SEV DEPRESSION W/O PSYCHOTIC SYMP", "BIPOLAR AFFECTIVE DISORDER, CURRENT EPISODE MILD OR MODERATE DEPRESSION", "BRIEF DEPRESSIVE REACTION", "DEPRESSIVE EPISODE, UNSPECIFIED", "MILD DEPRESSIVE EPISODE", "MIXED ANXIETY AND DEPRESSIVE DISORDER", "MODERATE DEPRESSIVE EPISODE", "OTHER DEPRESSIVE EPISODES", "OTHER RECURRENT DEPRESSIVE DISORDERS", "POISONING BY MONOAMINE-OXIDASE-INHIBITOR ANTIDEPRESSANTS", "POISONING BY OTHER AND UNSPECIFIED ANTIDEPRESSANTS", "POISONING BY TRICYCLIC AND TETRACYCLIC ANTIDEPRESSANTS", "POST-SCHIZOPHRENIC DEPRESSION", "RECURRENT DEPRESSIVE DISORDER, CURRENT EPISODE MODERATE", "RECURRENT DEPRESSIVE DISORDER, CURRENT EPISODE SEVERE W/O PSYCHOTIC SYMPT", "RECURRENT DEPRESSIVE DISORDER, CURRENT EPISODE SEVERE WITH PSYCHOTIC SYMPT", "RECURRENT DEPRESSIVE DISORDER, UNSPECIFIED", "SCHIZOAFFECTIVE DISORDER, DEPRESSIVE TYPE", "SEVERE DEPRESSIVE EPISODE WITH PSYCHOTIC SYMPTOMS", "SEVERE DEPRESSIVE EPISODE WITHOUT PSYCHOTIC SYMPTOMS"]
+  }

training/src/processing/mappings/inhaler_mapping.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+    "SABA": [
+        "BAMBUTEROL HYDROCHLORIDE",
+        "SALBUTAMOL",
+        "TERBUTALINE SULFATE"
+    ],
+    "LABA": [
+        "FORMOTEROL FUMARATE",
+        "INDACATEROL",
+        "OLODATEROL",
+        "SALMETEROL"
+    ],
+    "LAMA": [
+        "ACLIDINIUM BROMIDE",
+        "TIOTROPIUM",
+        "UMECLIDINIUM BROMIDE"
+    ],
+    "SAMA": [
+        "IPRATROPIUM BROMIDE",
+        "LABA-LAMA",
+        "ACLIDINIUM BROMIDE AND FORMOTEROL FUMARATE",
+        "INDACATEROL WITH GLYCOPYRRONIUM BROMIDE",
+        "TIOTROPIUM AND OLODATEROL",
+        "UMECLIDINIUM BROMIDE AND VILANTEROL TRIFENATATE"
+    ],
+    "ICS": [
+        "BECLOMETASONE DIPROPIONATE",
+        "BUDESONIDE",
+        "CICLESONIDE",
+        "FLUTICASONE PROPIONATE",
+        "MOMETASONE FUROATE"
+    ],
+    "LABA-ICS": [
+        "BECLOMETASONE DIPROPIONATE AND FORMOTEROL FUMARATE",
+        "BUDESONIDE WITH FORMOTEROL FUMARATE",
+        "FLUTICASONE FUROATE AND VILANTEROL",
+        "FLUTICASONE PROPIONATE AND FORMOTEROL FUMARATE",
+        "SALMETEROL WITH FLUTICASONE PROPIONATE"
+    ],
+    "LAMA +LABA-ICS": [
+        "BECLOMETASONE DIPROPIONATE AND FORMOTEROL FUMARATE AND GLYCOPYRRONIUM",
+        "FLUTICASONE FUROATE WITH UMECLIDINIUM BROMIDE AND VILANTEROL TRIFENATATE"
+    ],
+    "LABA-LAMA-ICS": [],
+    "SABA + SAMA": [
+        "SALBUTAMOL WITH IPRATROPIUM"
+    ],
+    "MCS": [
+        "NEDOCROMIL SODIUM",
+        "SODIUM CROMOGLICATE"
+    ],
+    "Ignore": [
+        "MENTHOL WITH EUCALYPTUS"
+    ]
+}

training/src/processing/mappings/test_mapping.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"ALT": ["A.L.T.", "Alanine Transaminase"], "AST": ["A.S.T.", "Aspartate Transam", "Aspartate Transamina"], "Alkaline Phosphatase": "Alkaline Phos.", "Basophils": ["ABS BASOPHIL", "BASOPHIL (MANUAL)", "BASOPHILS", "Basophil count"], "C Reactive Protein": "C-reactive Protein", "Eosinophils": ["EOSINOPHIL (MANUAL)", "EOSINOPHILS", "Eosinophil count", "EOSINOPHILS ABSOLUTE", "Eosinophils\u017d"], "Haematocrit": "HAEMATOCRIT", "Haemoglobin": ["HAEMOGLOBIN", "HAEMOGLOBIN A1c"], "Lymphocytes": ["ABSOLUTE LYMPHOCYTES", "LYMPHOCYTES", "Lymphocyte Count", "Lymphocyte count"], "Mean Cell Volume": ["MEAN CELL VOLUME", "Mean cell volume"], "Monocytes": ["ABSOLUTE MONOCYTES", "MONOCYTES", "Monocyte count"], "Neutrophils": ["ABSOLUTE NEUTROPHILS", "NEUTROPHILS", "Neutrophil count"], "PCO2 Temp Corrected": "PCO2 temp corrected", "Platelets": ["PLATELET COUNT", "PLATELETS", "Platelet Count", "Platelet count"], "Red Blood Count": ["Red Cell Count", "RED BLOOD COUNT", "RED CELL COUNT", "Red Blood Cell Count", "Red blood cell (RBC) count", "Red blood count"], "Serum Vitamin B12": ["Serum vitamin B12", "SERUM B12"], "White Blood Count": ["WBC", "WBC - Biological Fl", "WHITE BLOOD CELLS", "WHITE BLOOD COUNT", "White Cell Count", "White blood count"]}

training/src/processing/misc/process_gples.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""
+Process GPLES data
+--------
+Extract the number of COPD GP events per patient per year
+"""
+import pandas as pd
+from utils.common import read_data, first_patient_appearance
+def initialize_gples_data(file):
+    """
+    Load in and convert GPLES dataset to correct format
+    --------
+    :param file: filename to read from
+    :return: gples dataframe with correct column names and types
+    """
+    print('Loading GPLES data')
+    # Read in data
+    gp_cols = ['SafeHavenID', 'EventDate', 'ShortName']
+    gp_types = ['int', 'object', 'str']
+    df = read_data(file, gp_cols, gp_types)
+    # Drop nulls and duplicates
+    df = df.dropna().drop_duplicates()
+    # Convert date columns to correct type
+    df.columns = ['SafeHavenID', 'ADMDATE', 'ShortName']
+    df['ADMDATE'] = pd.to_datetime(df['ADMDATE'])
+    # Only track COPD events
+    df = df[df.ShortName == 'COPD'][['SafeHavenID', 'ADMDATE']]
+    df['gp_copd_event'] = 1
+    return df
+def extract_yearly_data(df):
+    """
+    Extract data per year from GPLES dataset
+    --------
+    :param df: gples dataframe to be processed
+    :return: reduced gples dataset
+    """
+    print('Reducing GPLES data')
+    # Extract year column for historical features
+    df['year'] = df.ADMDATE.dt.year
+    # Extract yearly data
+    group_cols = ['SafeHavenID', 'year']
+    gples_events = df.groupby(group_cols)[['gp_copd_event']].sum()
+    return gples_events
+def main():
+    # Load data
+    gp_file = "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/GPLES_Cohort3R.csv"
+    gples = initialize_gples_data(gp_file)
+    # Save first date in dataset
+    first_patient_appearance(gples, 'ADMDATE', 'gples')
+    # Reduce GPLES to 1 row per year per ID
+    gples_yearly = extract_yearly_data(gples)
+    # Save data
+    gples_yearly.to_pickle('<YOUR_DATA_PATH>/Model_E_Extracts/gples_proc.pkl')
+main()

training/src/processing/misc/process_validation_adm.py ADDED Viewed

	@@ -0,0 +1,28 @@

+from utils.adm_common import (initialize_adm_data, correct_stays,
+                              track_copd_resp)
+def main():
+    # Load in data
+    adm_file = "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/SMR01_Cohort3R.csv"
+    adm = initialize_adm_data(adm_file)
+    # Fill null STAY data and combine transfer admissions
+    adm = correct_stays(adm)
+    # Track COPD and respiratory events
+    adm = track_copd_resp(adm)
+    # Select relevant columns
+    adm_reduced = adm[['SafeHavenID', 'ADMDATE', 'copd_event', 'resp_event']]
+    # Track events
+    adm_reduced['copd_resp_event'] = adm_reduced['copd_event'] | adm_reduced['resp_event']
+    adm_reduced['adm_event'] = 1
+    # Save data
+    adm_reduced.to_pickle('<YOUR_DATA_PATH>/Model_E_Extracts/validation_adm_proc-og.pkl')
+main()

training/src/processing/misc/process_validation_presc.py ADDED Viewed

	@@ -0,0 +1,20 @@

+from utils.presc_common import initialize_presc_data, track_medication
+def main():
+    # Read in data
+    presc_file = "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/Pharmacy_Cohort3R.csv"
+    presc = initialize_presc_data(presc_file)
+    # Track salbutamol and rescue meds
+    presc = track_medication(presc)
+    # Reduce columns
+    presc_reduced = presc[['SafeHavenID', 'PRESC_DATE', 'rescue_meds']]
+    # Save data
+    presc_reduced.to_pickle('<YOUR_DATA_PATH>/Model_E_Extracts/validation_presc_proc-og.pkl')
+main()

training/src/processing/process_admissions.py ADDED Viewed

	@@ -0,0 +1,153 @@

+"""
+Process SMR01 admission data
+--------
+Clean and process admission data while adding tracking for COPD and respiratory
+admissions per year for each SafeHavenID
+"""
+import json
+import pandas as pd
+from datetime import date
+from dateutil.relativedelta import relativedelta
+from utils.common import add_hist_adm_presc, first_patient_appearance
+from utils.adm_common import (initialize_adm_data, correct_stays,
+                              track_copd_resp)
+from utils.adm_processing import (convert_ethgrp_desc, mode_ethnicity,
+                                  search_diag)
+from utils.adm_reduction import fill_missing_years, calc_adm_per_year
+def process_ethnicity(df):
+    """
+    Find relevant ethnic group for each patient, accounting for null data
+    --------
+    :param df: admission dataframe to be updated
+    :return: admission dataframe with ethnicity cleaned and updated
+    """
+    print('Processing ethnicity')
+    # Fill in missing ethnicities
+    df = df.rename(columns={'ETHGRP': 'eth_grp'})
+    df['eth_grp'] = df.eth_grp.str.strip()
+    df['eth_grp'] = df.groupby('SafeHavenID')['eth_grp'].apply(
+        lambda x: x.ffill().bfill().fillna('Unknown'))
+    # Convert to 1 of 7 ethnic groups
+    df['eth_grp'] = [convert_ethgrp_desc(eth) for eth in df.eth_grp]
+    # Find most commonly occurring ethnicity per SafeHavenID
+    df = df.groupby('SafeHavenID').apply(mode_ethnicity, 'eth_grp')
+    return df
+def add_eoy_column(df, dt_col, eoy_date):
+    """
+    Add EOY relative to user-specified end date
+    --------
+    :param df: dataframe
+    :param dt_col: date column in dataframe
+    :param eoy_date: EOY date from config
+    :return: updated df with EOY column added
+    """
+    # Needed to stop error with creating a new column
+    df = df.reset_index(drop=True)
+    # Add column with user-specified end of year date
+    end_date = pd.to_datetime(eoy_date)
+    end_month = end_date.month
+    end_day = end_date.day
+    # Add for every year
+    df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year]
+    # Check that EOY date is after dt_col for each entry
+    eoy_index = df.columns[df.columns == 'eoy']
+    adm_vs_eoy = df[dt_col] > df.eoy
+    row_index = df.index[adm_vs_eoy]
+    df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1)
+    df['eoy'] = pd.to_datetime(df.eoy)
+    return df
+def extract_yearly_data(df):
+    """
+    Extract features on a yearly basis for each SafeHavenID
+    --------
+    :param adm: admission dataframe to be updated
+    :return: dataframe with feature values per year
+    """
+    print('Reducing to 1 row SafeHavenID per year')
+    # Track rows which are admissions
+    df['adm'] = 1
+    # Add rows from years where patient did not have admissions
+    df = df.groupby('SafeHavenID').apply(fill_missing_years)
+    df = df.reset_index(drop=True)
+    # Add any historical count columns
+    df = df.groupby('SafeHavenID').apply(add_hist_adm_presc, 'adm', 'ADMDATE')
+    df = df.reset_index(drop=True)
+    # Reduce data to 1 row per year
+    df = calc_adm_per_year(df)
+    # Select columns in final order
+    final_cols = ['eth_grp', 'adm_per_year', 'total_hosp_days',
+                  'mean_los', 'copd_per_year', 'resp_per_year',
+                  'anxiety_depression_per_year', 'days_since_copd',
+                  'days_since_resp', 'days_since_adm', 'adm_to_date',
+                  'copd_to_date', 'resp_to_date', 'anxiety_depression_to_date',
+                  'copd_date', 'resp_date', 'adm_date']
+    df = df[final_cols]
+    return df
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    # Load in data
+    adm_file = config['extract_data_path'] + 'SMR01_Cohort3R.csv'
+    adm = initialize_adm_data(adm_file)
+    # Fill null STAY data and combine transfer admissions
+    adm = correct_stays(adm)
+    # Save first date in dataset
+    data_path = config['model_data_path']
+    first_patient_appearance(adm, 'ADMDATE', 'adm', data_path)
+    # Process ethnicity data
+    adm = process_ethnicity(adm)
+    # Track COPD and respiratory events
+    adm = track_copd_resp(adm)
+    # Track anxiety event
+    adm = search_diag(adm, 'anxiety_depression')
+    # Select relevant columns
+    reduced_cols = ['SafeHavenID', 'eth_grp', 'ADMDATE', 'STAY', 'copd_event',
+                    'resp_event', 'anxiety_depression_event']
+    adm_reduced = adm[reduced_cols]
+    # Save per event dataset
+    adm_reduced.to_pickle(data_path + 'validation_adm_proc.pkl')
+    # Add column relative to user-specified date
+    adm_reduced = add_eoy_column(adm_reduced, 'ADMDATE', config['date'])
+    # Extract yearly data
+    adm_yearly = extract_yearly_data(adm_reduced)
+    # Save data
+    adm_yearly.to_pickle(data_path + 'adm_proc.pkl')
+main()

training/src/processing/process_comorbidities.py ADDED Viewed

	@@ -0,0 +1,161 @@

+"""
+Process SMR01 comorbidities data
+--------
+Clean and process comorbidities, tracking specific comorbidities and returning
+the total number of comorbidities per patient per year
+"""
+import json
+import pandas as pd
+from datetime import date
+from dateutil.relativedelta import relativedelta
+from utils.common import track_event
+from utils.adm_common import initialize_adm_data, correct_stays
+from utils.comorb_processing import diagnosis_mapping_lists
+def track_comorbidity(df, excel_file, sheet_name, diag_names):
+    """
+    Map from admission descriptions to comorbidities using provided sheet.
+    Add new column for each comorbidity.
+    --------
+    :param df: pandas dataframe
+    :param excel_file: str filename for diagnosis mapping
+    :param sheet_name: str sheet name for diagnosis mapping
+    :param diag_names: list of diagnoses
+    :return: dataframe update with diagnosis mapping
+    """
+    print('Tracking comorbidities')
+    # Load in mappings
+    mapping = diagnosis_mapping_lists(excel_file, sheet_name, diag_names)
+    # Select relevant columns
+    diag_columns = ['DIAG1Desc', 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc',
+                    'DIAG5Desc', 'DIAG6Desc']
+    df_diag = df[diag_columns]
+    # Create column for each comorbidity
+    for key in mapping:
+        com = mapping[key]
+        com_bool = df_diag.apply(lambda x: track_event(x, com, False))
+        com_int = com_bool.any(axis=1).astype(int)
+        df[key] = com_int
+    return df
+def fill_comorbidities(df, diag_names):
+    """
+    Fill comorbidites
+    --------
+    :param df: dataframe of groupby values
+    :param diag_names: list of diagnoses
+    :return: updated dataframe
+    """
+    df[diag_names] = df[diag_names].replace(to_replace=0, method='ffill')
+    return df
+def add_eoy_column(df, dt_col, eoy_date):
+    """
+    Add EOY relative to user-specified end date
+    --------
+    :param df: dataframe
+    :param dt_col: date column in dataframe
+    :param eoy_date: EOY date from config
+    :return: updated df with EOY column added
+    """
+    # Needed to stop error with creating a new column
+    df = df.reset_index(drop=True)
+    # Add column with user-specified end of year date
+    end_date = pd.to_datetime(eoy_date)
+    end_month = end_date.month
+    end_day = end_date.day
+    # Add for every year
+    df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year]
+    # Check that EOY date is after dt_col for each entry
+    eoy_index = df.columns[df.columns == 'eoy']
+    adm_vs_eoy = df[dt_col] > df.eoy
+    row_index = df.index[adm_vs_eoy]
+    df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1)
+    df['eoy'] = pd.to_datetime(df.eoy)
+    return df
+def add_yearly_stats(df):
+    """
+    Sum comorbidities per patient per year
+    --------
+    :param df: dataframe to update
+    :return: sum of comorbidities per patient per year
+    """
+    print('Adding comorbidity count per year')
+    # Drop cols not required anymore
+    cols_2_drop = ['ADMDATE', 'DISDATE', 'STAY', 'ETHGRP', 'DIAG1Desc',
+                   'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc', 'DIAG5Desc',
+                   'DIAG6Desc', 'DISDATE', 'STAY']
+    df = df.drop(cols_2_drop, axis=1)
+    # Sum comorbidities
+    df = df.groupby(['SafeHavenID', 'eoy']).last().sum(axis=1)
+    df = df.to_frame().rename(columns={0: 'comorb_per_year'})
+    return df
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    # Load in data
+    adm_file = config['extract_data_path'] + 'SMR01_Cohort3R.csv'
+    adm = initialize_adm_data(adm_file)
+    # Fill null STAY data and combine transfer admissions
+    adm = correct_stays(adm)
+    # Prepare text data - strip string columns
+    adm = adm.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
+    # Track comorbidities
+    excel_file = "mappings/Comorbidity feature review for models & clin " \
+                 "summary update v2 May 2021.xlsx"
+    sheet_name = 'Diagnosis category mapping3'
+    diag_names = ['Ischaemic_hd', 'Atrial_fib', 'pacemake', 'periph_vasc',
+                  'cog_imp', 'HF1', 'LV_sys', 'valv_hd', 'HF_pres_ejec',
+                  'hypertension', 'Cerebrovascula_dis', 'Diabetes_mel',
+                  'Osteoporosis', 'frailty', 'liver_dis', 'metastat_canc',
+                  'headneck_canc', 'breast_canc', 'gi_canc', 'other_canc',
+                  'kidney_dis', 'Asthma_ov', 'Pulmonary_fib',
+                  'Obstructive_apnoea', 'Pulmonary_hyp', 'Previous_pneum',
+                  'DVT_PTE', 'Lung_cancer', 'Bronchiectasis', 'Resp_fail']
+    adm_comorb = track_comorbidity(adm, excel_file, sheet_name, diag_names)
+    # Drop date column
+    adm_comorb = adm_comorb.sort_values('ADMDATE').reset_index(drop=True)
+    # Drop fill comorb cols
+    print('Filling comorbidities')
+    adm_filled = adm_comorb.groupby('SafeHavenID').apply(
+        fill_comorbidities, diag_names)
+    # Add column relative to user-specified date
+    adm_filled = add_eoy_column(adm_filled, 'ADMDATE', config['date'])
+    # Add yearly stats
+    adm_yearly = add_yearly_stats(adm_filled)
+    # Save data
+    adm_yearly.to_pickle(config['model_data_path'] + 'comorb_proc.pkl')
+main()

training/src/processing/process_demographics.py ADDED Viewed

	@@ -0,0 +1,74 @@

+"""
+Process demographics data
+--------
+Process DOB, sex, marital status and SIMD data
+"""
+import json
+from utils.common import read_data, correct_column_names
+def initialize_demo_data(demo_file):
+    """
+    Load in demographics dataset to correct format
+    --------
+    :param demo_file: demographics data file name
+    :return: demographics dataframe with correct column names and types
+    """
+    print('Loading demographic data')
+    # Read in data
+    demo_cols = ['SafeHavenID', 'OBF_DOB', 'SEX', 'MARITAL_STATUS',
+                 'SIMD_2009_QUINTILE', 'SIMD_2009_DECILE',
+                 'SIMD_2009_VIGINTILE', 'SIMD_2012_QUINTILE',
+                 'SIMD_2012_DECILE', 'SIMD_2012_VIGINTILE',
+                 'SIMD_2016_QUINTILE', 'SIMD_2016_DECILE',
+                 'SIMD_2016_VIGINTILE']
+    demo_types = ['int', 'object', 'str', 'str', 'float', 'float', 'float',
+                  'float', 'float', 'float', 'float', 'float', 'float']
+    df = read_data(demo_file, demo_cols, demo_types)
+    # Nulls dropped later in process, only drop duplicates
+    df = df.drop_duplicates()
+    return df
+def process_sex(df):
+    """
+    Process sex column in demographics
+    --------
+    :param df: dataframe to update
+    :return: updated dataframe
+    """
+    print('One-hot encoding sex')
+    df['sex_bin'] = (df.SEX == 'F').astype(int)
+    return df
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    # Load in data
+    demo_file = config['extract_data_path'] + 'Demographics_Cohort3R.csv'
+    demo = initialize_demo_data(demo_file)
+    # Create binary sex column
+    demo = process_sex(demo)
+    # Drop original columns
+    demo = demo.drop('SEX', axis=1)
+    # Correct column names
+    new_cols = correct_column_names(demo.columns[1:], 'demo')
+    demo.columns = ['SafeHavenID'] + new_cols
+    # Save data
+    demo.to_pickle(config['model_data_path'] + 'demo_proc.pkl')
+main()

training/src/processing/process_labs.py ADDED Viewed

	@@ -0,0 +1,247 @@

+"""
+Script for preprocessing labs data
+--------
+Track median values for labs tests over the previous 2 years for patients
+with resulting dataset containing 1 row of information per patient per year
+"""
+import json
+import pandas as pd
+import numpy as np
+from datetime import date
+from dateutil.relativedelta import relativedelta
+from utils.common import (read_data, correct_column_names,
+                          first_patient_appearance)
+from utils.labs_processing import add_total_labs
+def initialize_labs_data(labs_file):
+    """
+    Load in labs dataset to correct format
+    --------
+    :param labs_file: labs data file name
+    :return: labs dataframe with correct column names and types
+    """
+    print('Loading labs data')
+    # Read in data
+    old_cols = ['SafeHavenID', 'SAMPLEDATE', 'CLINICALCODEDESCRIPTION',
+                'QUANTITYVALUE', 'RANGEHIGHVALUE', 'RANGELOWVALUE']
+    labs_types = ['int', 'object', 'str', 'float', 'float', 'float']
+    df = read_data(labs_file, old_cols, labs_types)
+    # Rename columns to CamelCase
+    new_cols = ['SafeHavenID', 'SampleDate', 'ClinicalCodeDescription',
+                'QuantityValue', 'RangeHighValue', 'RangeLowValue']
+    mapping = dict(zip(old_cols, new_cols))
+    df = df.rename(columns=mapping)
+    # Drop any nulls, duplicates or negative (broken) test values
+    df = df.dropna().drop_duplicates()
+    # Check tests are valid (values > -1)
+    num_cols = ['QuantityValue', 'RangeHighValue', 'RangeLowValue']
+    df = df[(df[num_cols] > -1).all(axis=1)]
+    # Select final columns
+    final_cols = ['SafeHavenID', 'SampleDate', 'ClinicalCodeDescription',
+                  'QuantityValue']
+    df = df[final_cols]
+    # Convert date
+    df['SampleDate'] = pd.to_datetime(df.SampleDate)
+    return df
+def clean_labs(df):
+    """
+    Clean descriptions and select relevant tests
+    --------
+    :param df: pandas dataframe
+    :return: cleaned dataframe
+    """
+    print('Cleaning labs data')
+    lab_tests = ['ALT', 'AST', 'Albumin', 'Alkaline Phosphatase', 'Basophils',
+                 'C Reactive Protein', 'Chloride', 'Creatinine', 'Eosinophils',
+                 'Estimated GFR', 'Haematocrit', 'Haemoglobin', 'Lymphocytes',
+                 'MCH', 'Mean Cell Volume', 'Monocytes', 'Neutrophils',
+                 'PCO2 (temp corrected', 'Platelets', 'Potassium',
+                 'Red Blood Count', 'Serum vitamin B12', 'Sodium',
+                 'Total Bilirubin', 'Urea', 'White Blood Count']
+    # Strip any whitespaces
+    str_col = 'ClinicalCodeDescription'
+    df[str_col] = df[str_col].str.strip()
+    # Read in test mapping
+    with open('mappings/test_mapping.json') as json_file:
+        test_mapping = json.load(json_file)
+    # Correct names for relevant tests
+    for k, v in test_mapping.items():
+        df[str_col] = df[str_col].replace(v, k)
+    # Select relevant tests
+    df = df[[desc in lab_tests for desc in df[str_col]]]
+    return df
+def add_neut_lypmh(df):
+    """
+    Pivot dataframe and calculate neut_lypmh feature
+    --------
+    :param df: pandas dataframe
+    :return: pivoted dataframe
+    """
+    print('Calculating neut_lypmh data')
+    # Pivot table with CCDesc as headers and QuantityValue as values
+    df = pd.pivot_table(
+        df, index=['SafeHavenID', 'SampleDate'],
+        columns=['ClinicalCodeDescription'], values='QuantityValue',
+        dropna=True).reset_index()
+    # Add neut_lymph feature
+    df['neut_lymph'] = df.Neutrophils / df.Lymphocytes
+    # Replace any infinite values
+    df['neut_lymph'] = df.neut_lymph.replace([np.inf, -np.inf], np.nan)
+    return df
+def add_eoy_column(df, dt_col, eoy_date):
+    """
+    Add EOY relative to user-specified end date
+    --------
+    :param df: dataframe
+    :param dt_col: date column in dataframe
+    :param eoy_date: EOY date from config
+    :return: updated df with EOY column added
+    """
+    # Needed to stop error with creating a new column
+    df = df.reset_index(drop=True)
+    # Add column with user-specified end of year date
+    end_date = pd.to_datetime(eoy_date)
+    end_month = end_date.month
+    end_day = end_date.day
+    # Add for every year
+    df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year]
+    # Check that EOY date is after dt_col for each entry
+    eoy_index = df.columns[df.columns == 'eoy']
+    adm_vs_eoy = df[dt_col] > df.eoy
+    row_index = df.index[adm_vs_eoy]
+    df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1)
+    df['eoy'] = pd.to_datetime(df.eoy)
+    return df
+def reduce_labs_data(df, dt_col):
+    """
+    Reduce dataset to 1 row per ID per year looking back at the median values
+    over the previous 2 years
+    --------
+    :param df: pandas dataframe
+    :param dt_col: date column
+    :return: reduced labs dataframe
+    """
+    print('Reducing labs to 1 row per patient per year')
+    group_cols = ['SafeHavenID', 'eoy']
+    med_cols = ['ALT', 'AST', 'Albumin', 'Alkaline Phosphatase', 'Basophils',
+                'C Reactive Protein', 'Chloride', 'Creatinine', 'Eosinophils',
+                'Estimated GFR', 'Haematocrit', 'Haemoglobin', 'Lymphocytes',
+                'MCH', 'Mean Cell Volume', 'Monocytes', 'Neutrophils',
+                'Platelets', 'Potassium', 'Red Blood Count', 'Sodium',
+                'Total Bilirubin', 'Urea', 'White Blood Count', 'neut_lymph']
+    # Add column to track labs per year
+    df['labs'] = 1
+    # Sort by date and extract year
+    df = df.sort_values(dt_col)
+    # Include data from previous year
+    shifted = df[['eoy']] + pd.DateOffset(years=1)
+    new_tab = df[['SafeHavenID', dt_col] + med_cols].join(shifted)
+    combined_cols = ['SafeHavenID', 'eoy', dt_col] + med_cols
+    combined = pd.concat([df[combined_cols], new_tab])
+    combined = combined.sort_values(dt_col)
+    # Extract median data for last 2 years
+    df_med = combined.groupby(group_cols).median()
+    # Rename median columns
+    new_med_cols = [col + '_med_2yr' for col in df_med.columns]
+    df_med.columns = new_med_cols
+    # Only carry forward year data that appeared in df
+    test = []
+    for k, v in df.groupby('SafeHavenID')['eoy'].unique().to_dict().items():
+        test.append(df_med.loc[(k, v), ])
+    df_med = pd.concat(test)
+    # Extract features to find last value of
+    df_last = df[group_cols + ['labs_to_date']]
+    df_last = df_last.groupby(group_cols).last()
+    # Extract features to calculate sum of
+    df_sum = df[group_cols + ['labs']]
+    df_sum = df.groupby(group_cols)['labs'].sum()
+    # Rename sum columns
+    df_sum = df_sum.to_frame()
+    df_sum.columns = ['labs_per_year']
+    # Merge datasets
+    df_annual = df_med.join(df_last).join(df_sum)
+    return df_annual
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    # Load in data
+    labs_file = config['extract_data_path'] + 'SCI_Store_Cohort3R.csv'
+    labs = initialize_labs_data(labs_file)
+    # Clean data
+    labs = clean_labs(labs)
+    # Save first date in dataset
+    data_path = config['model_data_path']
+    first_patient_appearance(labs, 'SampleDate', 'labs', data_path)
+    # Pivot and add neut_lypmh
+    labs = add_neut_lypmh(labs)
+    # Add EOY column relative to user specified date
+    labs = add_eoy_column(labs, 'SampleDate', config['date'])
+    labs = labs.sort_values('SampleDate')
+    # Track each lab event
+    labs['labs_to_date'] = 1
+    labs = labs.groupby('SafeHavenID').apply(add_total_labs)
+    labs = labs.reset_index(drop=True)
+    # Reduce labs to 1 row per ID per year
+    labs_yearly = reduce_labs_data(labs, 'SampleDate')
+    # Correct column names
+    labs_yearly.columns = correct_column_names(labs_yearly.columns, 'labs')
+    # Save data
+    labs_yearly.to_pickle(data_path + 'labs_proc.pkl')
+main()

training/src/processing/process_prescribing.py ADDED Viewed

	@@ -0,0 +1,145 @@

+"""
+Script for preprocessing pharmacy data
+--------
+Process pharmacy data and track inhaler prescriptions and rescue meds
+"""
+import json
+import pandas as pd
+from datetime import date
+from dateutil.relativedelta import relativedelta
+from utils.common import (add_hist_adm_presc, correct_column_names,
+                          first_patient_appearance)
+from utils.presc_common import initialize_presc_data, track_medication
+def add_inhaler_mappings(df):
+    """
+    Load inhaler prescription mappings and track where they appear in the data
+    --------
+    :param df: dataframe
+    :return: dataframe with column added for each inhaler type
+    """
+    print('Mapping inhaler prescriptions')
+    # Load in inhaler mapping
+    with open('mappings/inhaler_mapping.json') as json_file:
+        inhaler_mapping = json.load(json_file)
+    for k, v in inhaler_mapping.items():
+        df[k + '_inhaler'] = df.PI_Approved_Name.str.contains(
+            '|'.join(v)).astype(int)
+    # Remove for now as empty
+    df = df.drop(['LABA-LAMA-ICS_inhaler', 'Ignore_inhaler'], axis=1)
+    return df
+def add_eoy_column(df, dt_col, eoy_date):
+    """
+    Add EOY relative to user-specified end date
+    --------
+    :param df: dataframe
+    :param dt_col: date column in dataframe
+    :param eoy_date: EOY date from config
+    :return: updated df with EOY column added
+    """
+    # Needed to stop error with creating a new column
+    df = df.reset_index(drop=True)
+    # Add column with user-specified end of year date
+    end_date = pd.to_datetime(eoy_date)
+    end_month = end_date.month
+    end_day = end_date.day
+    # Add for every year
+    df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year]
+    # Check that EOY date is after dt_col for each entry
+    eoy_index = df.columns[df.columns == 'eoy']
+    adm_vs_eoy = df[dt_col] > df.eoy
+    row_index = df.index[adm_vs_eoy]
+    df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1)
+    df['eoy'] = pd.to_datetime(df.eoy)
+    return df
+def calc_presc_per_year(df):
+    """
+    Reduce data to 1 row per year
+    --------
+    :param df: dataframe to reduced
+    :return: reduced dataframe
+    """
+    print('Reducing to 1 row per year')
+    # Add end of year columns
+    eoy_cols = ['presc_to_date', 'days_since_rescue', 'rescue_to_date',
+                'anxiety_depression_presc_to_date', 'rescue_date']
+    last = df.groupby(['SafeHavenID', 'eoy'])[eoy_cols].last()
+    # Total columns
+    sum_cols = ['SALBUTAMOL', 'SABA_inhaler', 'LABA_inhaler', 'LAMA_inhaler',
+                'SAMA_inhaler', 'ICS_inhaler', 'LABA-ICS_inhaler',
+                'LAMA +LABA-ICS_inhaler', 'SABA + SAMA_inhaler',
+                'MCS_inhaler', 'rescue_meds', 'presc', 'anxiety_depression_presc']
+    total_cols = [col + '_per_year' for col in sum_cols]
+    total = df.groupby(['SafeHavenID', 'eoy'])[sum_cols].sum()
+    total.columns = total_cols
+    # Join together
+    results = last.join(total)
+    return results
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    # Load in data
+    presc_file = config['extract_data_path'] + 'Pharmacy_Cohort3R.csv'
+    presc = initialize_presc_data(presc_file)
+    # Save first date in dataset
+    data_path = config['model_data_path']
+    first_patient_appearance(presc, 'PRESC_DATE', 'presc', data_path)
+    # Add inhaler mapping
+    presc = add_inhaler_mappings(presc)
+    # Track salbutamol and rescue meds
+    presc = track_medication(presc)
+    # Drop columns
+    cols_2_drop = ['PI_Approved_Name', 'PI_BNF_Item_Code', 'code']
+    presc = presc.drop(cols_2_drop, axis=1)
+    # Add column relative to user-specified date
+    presc = add_eoy_column(presc, 'PRESC_DATE', config['date'])
+    # Track rows which are admissions
+    presc['presc'] = 1
+    # Add any historical count columns
+    presc = presc.groupby('SafeHavenID').apply(
+        add_hist_adm_presc, 'presc', 'PRESC_DATE')
+    presc = presc.reset_index(drop=True)
+    # Save per event dataset
+    presc.to_pickle(data_path + 'validation_presc_proc.pkl')
+    # Reduce data to 1 row per year
+    presc_yearly = calc_presc_per_year(presc)
+    # Correct column names
+    presc_yearly.columns = correct_column_names(presc_yearly.columns, 'presc')
+    # Save data
+    presc_yearly.to_pickle(data_path + 'presc_proc.pkl')
+main()

training/src/processing/utils/README.md ADDED Viewed

	@@ -0,0 +1,11 @@

+# Processing Utilities
+This folder contains processing utilities called within the main processing scripts in the folder above.
+- `adm/comorb/labs_processing.py` contain utilities for processing each type of specific data
+- `adm_reduction.py` contains reduction functions required for processing admissions
+- `common.py` functions are used across processing for all datasets
+- `adm_common.py` functions are used in both admissions and comorbidities

training/src/processing/utils/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Empty file for folder to be recognised as module

training/src/processing/utils/adm_common.py ADDED Viewed

	@@ -0,0 +1,77 @@

+"""
+Utility functions common across admission processing
+(admissions/comorbidities/gples)
+"""
+import pandas as pd
+from utils.common import read_data
+from utils.adm_processing import (update_null_stay, calculate_total_stay,
+                                  search_diag)
+def initialize_adm_data(adm_file):
+    """
+    Load in and convert admission dataset to correct format
+    --------
+    :param adm_file: admission data file name
+    :return: admission dataframe with correct column names and types
+    """
+    print('Loading admission data')
+    # Read in data
+    adm_cols = ['SafeHavenID', 'ETHGRP', 'ADMDATE', 'DISDATE', 'STAY',
+                'DIAG1Desc', 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc',
+                'DIAG5Desc', 'DIAG6Desc']
+    adm_types = ['int', 'object', 'object', 'object', 'int',
+                 'str', 'str', 'str', 'str', 'str', 'str']
+    df = read_data(adm_file, adm_cols, adm_types)
+    # Drop duplicates - nulls needed in DIAGDesc columns
+    df = df.drop_duplicates()
+    # Convert date columns to correct type
+    df['ADMDATE'] = pd.to_datetime(df['ADMDATE'])
+    df['DISDATE'] = pd.to_datetime(df['DISDATE'])
+    return df
+def correct_stays(df):
+    """
+    Fill any null STAY data and consolidate any transfer admissions into single
+    admission occurrences
+    --------
+    :param df: admission dataframe to be corrected
+    :return: admission dataframe with null stays filled and transfers combined
+    """
+    print('Correcting stays')
+    # Update any null STAY data using ADM and DIS dates
+    df = update_null_stay(df)
+    # Correct stays for patients passed across departments
+    df = df.sort_values(['SafeHavenID', 'ADMDATE', 'DISDATE'])
+    df = df.groupby('SafeHavenID').apply(calculate_total_stay)
+    df = df.reset_index(drop=True)
+    return df
+def track_copd_resp(df):
+    """
+    Search for COPD and/or respiratory admissions
+    --------
+    :param df: admission dataframe to be updated
+    :return: updated dataframe with events tracked
+    """
+    print('Tracking events')
+    # Strip DIAGDesc columns
+    df = df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
+    # Track COPD admissions
+    df = search_diag(df, 'copd')
+    # Track respiratory admissions
+    df = search_diag(df, 'resp')
+    return df

training/src/processing/utils/adm_processing.py ADDED Viewed

	@@ -0,0 +1,146 @@

+"""
+Admission processing utilities
+"""
+import json
+import numpy as np
+from utils.common import track_event
+def update_null_stay(df):
+    """
+    Calculate length of stay based on ADM/DISDATE for null STAY values
+    --------
+    :param df: pandas dataframe to be updated
+    :return: updated dataframe
+    """
+    # Check for nulls
+    is_null = df.STAY.isnull()
+    # If null calculate total length of stay
+    if sum(is_null) > 0:
+        null_stay = np.where(is_null)
+        for i in null_stay:
+            stay = df.loc[i, 'DISDATE'].item() - df.loc[i, 'ADMDATE'].item()
+            df.loc[i, 'STAY'] = float(stay.days)
+    return df
+def calculate_total_stay(df):
+    """
+    Convert admissions with same ADMDATE as previous DISDATE to single
+    admission where patient has been transferred between departments
+    --------
+    :param df: pandas dataframe to be updated
+    :return: updated dataframe
+    """
+    df.reset_index(inplace=True, drop=True)
+    rows_to_drop = []
+    # If ADMDATE matches previous DISDATE, mark as transfer and combine
+    df['transfer'] = df.ADMDATE.eq(df.DISDATE.shift())
+    for index, row in df.iloc[1:].iterrows():
+        if row.transfer is True:
+            df.loc[index, 'ADMDATE'] = df.iloc[index - 1].ADMDATE
+            df.loc[index, 'STAY'] = row.STAY + df.iloc[index - 1].STAY
+            rows_to_drop.append(index - 1)
+    # Drop original individual rows in transfer
+    df.drop(rows_to_drop, inplace=True)
+    # Drop tracking column
+    df.drop('transfer', axis=1, inplace=True)
+    return df
+def convert_ethgrp_desc(eth):
+    """
+    Find ethnic group based on given ETHGRP string
+    --------
+    :param eth: str ethnic group description in the style of SMR01 data
+    :return: string ethnicity
+    """
+    if ("White" in eth) | ("Irish" in eth) | ("Welsh" in eth) | ("English" in eth):
+        return "White"
+    elif eth.startswith("British"):
+        return "White"
+    elif "mixed" in eth:
+        return "Mixed"
+    elif ("Asian" in eth) | ("Pakistani" in eth) | ("Indian" in eth) | ("Bangladeshi" in eth) | ("Chinese" in eth):
+        return "Asian"
+    elif ("Black" in eth) | ("Caribbean" in eth) | ("African" in eth):
+        return "Black"
+    elif ("Arab" in eth) | ("other ethnic" in eth):
+        return "Other"
+    elif "Refused" in eth:
+        return "Refused"
+    else:
+        return "Unknown"
+def mode_ethnicity(v, eth_col):
+    """
+    Select the most commonly occuring ethnicity for each patient in groupby
+    --------
+    :param v: pandas patient dataframe to be updated
+    :param eth_col: str ethnicity column
+    :return: updated subset of data with common ethnicity per ID
+    """
+    eth = v[eth_col]
+    n = eth.nunique()
+    has_unk = eth.str.contains('Unknown')
+    any_unk = any(has_unk)
+    wout_unk = has_unk.apply(lambda x: x is False)
+    has_ref = eth.str.contains('Refused')
+    any_ref = any(has_ref)
+    wout_ref = has_ref.apply(lambda x: x is False)
+    # Select ethnicities excluding 'Unknown' or 'Refused' where possible
+    if any_unk & any_ref & (n > 2):
+        eth = eth[wout_unk & wout_ref]
+    elif any_unk & (n > 1):
+        eth = eth[wout_unk]
+    elif any_ref & (n > 1):
+        eth = eth[wout_ref]
+    # Select the most commonly appearing ethnicity
+    main_eth = eth.mode().values[0]
+    v[eth_col] = main_eth
+    return v
+def search_diag(df, typ):
+    """
+    Search diagnosis columns for descriptions indicative of copd or resp events
+    --------
+    :param df: dataframe to search
+    :param typ: 'copd', 'resp' or 'anxiety_depression'
+    :return: dataframe with column added tracking specific type of admission
+    """
+    # Columns to search
+    diag_cols = ['DIAG1Desc', 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc',
+                 'DIAG5Desc', 'DIAG6Desc']
+    # Load mappings
+    copd_resp_desc = json.load(open('mappings/diag_copd_resp_desc.json'))
+    # Select mappings relevant to desired type of admission
+    desc = copd_resp_desc[typ]
+    # copd descriptions will only require searching a single specific phrase
+    single = typ == 'copd'
+    # Search columns and track
+    df[typ + '_event'] = df[diag_cols].apply(
+        lambda x: track_event(x, desc, single)).any(axis=1).astype(int)
+    return df

training/src/processing/utils/adm_reduction.py ADDED Viewed

	@@ -0,0 +1,65 @@

+"""
+Admission reduction utilities
+"""
+import pandas as pd
+from datetime import date
+def fill_missing_years(df):
+    """
+    Add admission data from years where patient is missing from the dataset
+    --------
+    :param df: dataframe to be updated
+    :return: dataframe with missing years added
+    """
+    df = df.sort_values('ADMDATE')
+    year_col = df.eoy.dt.year.tolist()
+    end_month = df.eoy.dt.month.iloc[0]
+    end_day = df.eoy.dt.day.iloc[0]
+    # We only want missing years
+    year_range = range(year_col[0] + 1, year_col[-1])
+    years = [y for y in year_range if not (y in year_col)]
+    # If any years missing add rows
+    if len(years) > 0:
+        sh_id = df.SafeHavenID.iloc[0]
+        eth_grp = df.eth_grp.iloc[0]
+        adm_dates = pd.to_datetime([date(y, end_month, end_day) for y in years])
+        data = {'SafeHavenID': sh_id, 'eth_grp': eth_grp, 'ADMDATE': adm_dates,
+                'STAY': 0, 'copd_event': 0, 'resp_event': 0, 'eoy': adm_dates,
+                'adm': 0, 'anxiety_depression_event': 0}
+        missed_years = pd.DataFrame(data)
+        df = pd.concat([df, missed_years]).sort_values('ADMDATE')
+    return df
+def calc_adm_per_year(df):
+    """
+    Reduce data to 1 row per year
+    --------
+    :param df: dataframe to reduced
+    :return: reduced dataframe
+    """
+    # Last EOY columns
+    eoy_cols = ['eth_grp', 'days_since_copd', 'days_since_resp', 'days_since_adm',
+                'adm_to_date', 'copd_to_date', 'resp_to_date',
+                'anxiety_depression_to_date', 'copd_date', 'resp_date', 'adm_date']
+    last = df.groupby(['SafeHavenID', 'eoy'])[eoy_cols].last()
+    # Average column
+    los = df.groupby(['SafeHavenID', 'eoy'])[['STAY']].mean()
+    los.columns = ['mean_los']
+    # Total columns
+    sum_cols = ['adm', 'copd_event', 'resp_event', 'anxiety_depression_event', 'STAY']
+    total_cols = ['adm_per_year', 'copd_per_year', 'resp_per_year',
+                  'anxiety_depression_per_year', 'total_hosp_days']
+    total = df.groupby(['SafeHavenID', 'eoy'])[sum_cols].sum()
+    total.columns = total_cols
+    # Join together
+    results = last.join(los).join(total)
+    return results

training/src/processing/utils/common.py ADDED Viewed

	@@ -0,0 +1,132 @@

+"""
+Utilities required across all processing scripts
+"""
+import pandas as pd
+import numpy as np
+def read_data(file, cols, types):
+    """
+    Read in data source
+    --------
+    :param file: string filename
+    :param cols: string list of column names
+    :param types: string list of column types
+    :return: dataframe
+    """
+    schema = dict(zip(cols, types))
+    df = pd.read_csv(file, usecols=cols, encoding="cp1252", dtype=schema)
+    return df
+def first_patient_appearance(df, dt_col, typ, data_path):
+    """
+    Save first appearance of patient in dataset
+    --------
+    :param df: dataframe to check
+    :param dt_col: date column to sort by
+    :param typ: type of dataset being used
+    :param data_path: path to data extracts
+    :return: None, dataframe with first dates saved
+    """
+    df = df.sort_values(dt_col)
+    df_first = df.groupby('SafeHavenID')[dt_col].first()
+    df_first = df_first.to_frame().reset_index()
+    df_first.columns = ['SafeHavenID', 'first_adm']
+    df_first.to_pickle(data_path + typ + '_first_dates.pkl')
+def add_days_since_event(df, typ, dt_col):
+    """
+    Historical features: add days since features e.g. copd/resp/rescue
+    --------
+    :param df: dataframe to be updated
+    :param typ: 'rescue', 'copd' or 'resp' feature to be created
+    :param dt_col: str date column name
+    :return: updated dataframe with historical column added
+    """
+    if typ == 'rescue':
+        event_col = 'rescue_meds'
+    elif typ == 'adm':
+        event_col = 'adm'
+    else:
+        event_col = typ + '_event'
+    date_col = typ + '_date'
+    days_col = 'days_since_' + typ
+    df[date_col] = df.apply(
+        lambda x: x[dt_col] if x[event_col] else np.nan, axis=1).ffill()
+    if df[date_col].isna().all():
+        df[days_col] = np.nan
+    else:
+        df[days_col] = (df.eoy - df[date_col]).dt.days
+    return df
+def track_event(x, desc, single):
+    """
+    Fill nulls and search to see if x matches a description
+    --------
+    :param x: str list of features to track
+    :param desc: str list to compare
+    :param single: boolean for checking against single description e.g.
+        "COPD" True otherwise False
+    :return: tracked feature list
+    """
+    x = x.fillna('')
+    # COPD only has single description to search
+    if single:
+        result = [desc in s for s in x]
+    # Respiratory has a list of descriptions to search
+    else:
+        result = [s in desc for s in x]
+    return result
+def add_hist_adm_presc(df, typ, dt_col):
+    """
+    Historical features: add days since and to-date features
+    --------
+    :param df: dataframe to be updated
+    :param typ: type of data - 'adm' or 'presc'
+    :param dt_col: string name of date column
+    :return: updated dataframe with historical columns added
+    """
+    if typ == 'presc':
+        df = df.sort_values(dt_col).reset_index(drop=True)
+        df = add_days_since_event(df, 'rescue', dt_col)
+        df['rescue_to_date'] = df.rescue_meds.cumsum()
+        df['anxiety_depression_presc_to_date'] = df.anxiety_depression_presc.cumsum()
+    else:
+        for col in ['adm', 'copd', 'resp']:
+            df = add_days_since_event(df, col, dt_col)
+        for col in ['copd', 'resp', 'anxiety_depression']:
+            df[col + '_to_date'] = df[col + '_event'].cumsum()
+    # Add counter for events to date
+    df[typ + '_to_date'] = df[typ].cumsum()
+    return df
+def correct_column_names(cols, typ):
+    """
+    Convert column names to lower case and fill any spaces with underscores
+    --------
+    :param cols: string list of column names
+    :param typ: type of dataset being updated
+    :return: cleaned column names
+    """
+    print('Correcting column headers')
+    if typ == 'presc':
+        lower_cols = cols.str.replace('[+-]', ' ').str.lower()
+        new_cols = ["_".join(col.split()) for col in lower_cols]
+    else:
+        new_cols = cols.str.lower().str.replace(' ', '_').tolist()
+    return new_cols

training/src/processing/utils/comorb_processing.py ADDED Viewed

	@@ -0,0 +1,20 @@

+"""
+Comorbidities processing utilities
+"""
+import pandas as pd
+def diagnosis_mapping_lists(excel_file, sheet_name, diagnosis_names):
+    """
+    Create mapping between diagnoses and comorbidities
+    --------
+    :param excel_file: str filename for diagnosis mapping
+    :param sheet_name: str sheet name for diagnosis mapping
+    :param diagnosis_names: str list of diagnoses
+    :return: dictionary of diagnosis names and values
+    """
+    df_diag = pd.read_excel(excel_file, sheet_name, skiprows=range(0, 1))
+    df_lists = df_diag.T.values.tolist()
+    diag_lists = [[s.strip() for s in x if pd.notna(s)] for x in df_lists]
+    return dict(zip(diagnosis_names, diag_lists))

training/src/processing/utils/labs_processing.py ADDED Viewed

	@@ -0,0 +1,16 @@

+"""
+Labs processing utilities
+"""
+def add_total_labs(df):
+    """
+    Historical features: to-date features
+    --------
+    :param df: dataframe to be updated
+    :return: updated dataframe with historical columns added
+    """
+    # Add counter for rescue meds to date
+    df['labs_to_date'] = df.labs_to_date.cumsum()
+    return df

training/src/processing/utils/presc_common.py ADDED Viewed

	@@ -0,0 +1,68 @@

+import pandas as pd
+from utils.common import read_data
+steroid_codes = ['0603020T0AAACAC', '0603020T0AABKBK', '0603020T0AAAXAX',
+                 '0603020T0AAAGAG', '0603020T0AABHBH', '0603020T0AAACAC',
+                 '0603020T0AABKBK', '0603020T0AABNBN', '0603020T0AAAGAG',
+                 '0603020T0AABHBH']
+antib_codes = ['0501013B0AAAAAA', '0501013B0AAABAB', '0501030I0AAABAB',
+               '0501030I0AAAAAA', '0501050B0AAAAAA', '0501050B0AAADAD',
+               '0501013K0AAAJAJ']
+exac_meds = steroid_codes + antib_codes
+def initialize_presc_data(presc_file):
+    """
+    Load in prescribing dataset to correct format
+    --------
+    :param presc_file: prescribing data file name
+    :return: prescribing dataframe with correct column names and types
+    """
+    print('Loading prescribing data')
+    # Read in data
+    presc_cols = ['SafeHavenID', 'PRESC_DATE', 'PI_Approved_Name',
+                  'PI_BNF_Item_Code']
+    presc_types = ['int', 'object', 'str', 'str']
+    df = read_data(presc_file, presc_cols, presc_types)
+    # Drop any nulls or duplicates
+    df = df.dropna()
+    df = df.drop_duplicates()
+    # Convert date
+    df['PRESC_DATE'] = pd.to_datetime(df.PRESC_DATE)
+    return df
+def track_medication(df):
+    """
+    Track salbutamol and rescue med prescriptions
+    https://openprescribing.net/bnf/
+    --------
+    :param df: dataframe
+    :return: dataframe with tracked meds
+    """
+    print('Tracking medication')
+    # Extract BNF codes without brand info
+    df['code'] = df.PI_BNF_Item_Code.apply(lambda x: x[0:9])
+    # Add flag for salbutamol - marked important by Chris
+    df['SALBUTAMOL'] = (df.code == '0301011R0').astype(int)
+    # Track rescue meds
+    df['rescue_meds'] = df.PI_BNF_Item_Code.str.contains(
+        '|'.join(exac_meds)).astype(int)
+    # Track anxiety and depression medication
+    ad_bnf = ('040102', '0403', '0204000R0', '0408010AE')
+    ad_events = df.PI_BNF_Item_Code.str.startswith(ad_bnf).fillna(False)
+    drop_dummy = (df.PI_Approved_Name != 'DUMMY') & (df.PI_Approved_Name != 'DUMMY REJECTED')
+    df['anxiety_depression_presc'] = (drop_dummy & ad_events).astype(int)
+    return df

training/src/reduction/README.md ADDED Viewed

	@@ -0,0 +1,12 @@

+# Reduction
+This folder contains scripts for combining, reducing, filling and scaling processed EHR data for modelling. Scripts should be run in the below order.
+Note that scripts must be run in the below order:
+1. `combine.py` - combine datasets and perform any post-processing
+2. `post_prod_reduction.py` - Combine columns to reduce 0 values
+3. `remove_ids.py` - remove receiver, scale up and test IDs
+4. `clean_and_scale_train.py` - impute nulls and min-max scale training data
+5. `clean_and_scale_test.py` - impute nulls and min-max scale testing data
+_NB: The data_type in `clean_and_scale_test.py` can be changed to rec, sup, val and test._

training/src/reduction/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Empty file for folder to be recognised as module

training/src/reduction/clean_and_scale_test.py ADDED Viewed

	@@ -0,0 +1,173 @@

+"""
+TESTING
+Impute any null data, save ethnicity info for each ID and scale
+final dataset
+NB: This script can be used for merged receiver, scale up or testing data
+"""
+import json
+import sys
+import joblib
+import pandas as pd
+import numpy as np
+from numpy import loadtxt
+ds_cols = ['days_since_copd_resp', 'days_since_adm', 'days_since_rescue']
+null_cols = ['alt_med_2yr', 'ast_med_2yr', 'albumin_med_2yr',
+             'alkaline_phosphatase_med_2yr', 'basophils_med_2yr',
+             'c_reactive_protein_med_2yr', 'chloride_med_2yr',
+             'creatinine_med_2yr', 'eosinophils_med_2yr',
+             'estimated_gfr_med_2yr', 'haematocrit_med_2yr',
+             'haemoglobin_med_2yr', 'lymphocytes_med_2yr',
+             'mch_med_2yr', 'mean_cell_volume_med_2yr',
+             'monocytes_med_2yr', 'neutrophils_med_2yr',
+             'platelets_med_2yr', 'potassium_med_2yr',
+             'red_blood_count_med_2yr', 'sodium_med_2yr',
+             'total_bilirubin_med_2yr', 'urea_med_2yr',
+             'white_blood_count_med_2yr', 'neut_lymph_med_2yr',
+             'days_since_copd_resp', 'days_since_adm', 'days_since_rescue']
+cols2drop = ['eth_grp', 'entry_dataset', 'first_entry', 'obf_dob',
+             'marital_status', 'label', 'simd_vigintile', 'simd_decile',
+             'simd_quintile', 'sex_bin']
+def calc_age_bins_test(df, data_path):
+    """
+    Load training bins and assign to testing data
+    --------
+    :param df: dataframe to be updated
+    :param data_path: path to generated data
+    :return: updated dataframe
+    """
+    ed = loadtxt(data_path + 'age_bins_train.csv', delimiter=',')
+    categories, edges = pd.qcut(
+        df['age'], q=10, precision=0, retbins=True, labels=ed[1:])
+    df['age_bin'] = categories.astype(int)
+    return df
+def create_label(df):
+    """
+    Create a label containing the age and sex bins of the data
+    --------
+    :param df: dataframe
+    :return: dataframe with label added
+    """
+    df['label'] = df['age_bin'].astype(str) + '_' + df['sex_bin'].astype(str)
+    df = df.drop('age_bin', axis=1)
+    return df
+def fill_nulls(label, df, medians):
+    """
+    Fill any null values in testing/REC/SUP data with median values from
+    training data.
+    --------
+    :param label: string label containing age and sex bin values, e.g. '51_0'
+        for a male patient in the less than 51 age bin
+    :param df: dataframe
+    :param medians: dataframe of training set medians for each label and
+        column
+    :return: filled dataframe for specified label
+    """
+    meds = medians[medians['label'] == label].iloc[0]
+    df_2_fill = df[df['label'] == label]
+    for col in null_cols:
+        df_2_fill[col] = df_2_fill[col].fillna(meds[col])
+    return df_2_fill
+def ds_fill_5year_test(df, col, max_vals):
+    """
+    Fill days_since_X columns where patient has been in the dataset less than
+    5 years
+    --------
+    :param df: dataframe to be updated
+    :param col: column to check
+    :param max_vals: series with columns and their max value from training
+    :return: dataframe with column nulls filled where patient has ggc_years < 5
+    """
+    df_5years = df.ggc_years < 5
+    df.loc[df_5years, col] = df.loc[df_5years, col].fillna(max_vals[col])
+    return df
+def scale_data_test(df, scaler):
+    """
+    Min-max scale final dataset
+    -----
+    :param df: dataframe to be scaled
+    :param scaler: scaler object to apply to df
+    :return: scaled dataset for modelling
+    """
+    all_cols = df.columns
+    all_cols = all_cols.drop(['SafeHavenID', 'eoy'])
+    data_scaled = scaler.transform(df[all_cols].to_numpy())
+    df_scaled = pd.DataFrame(data_scaled, columns=all_cols)
+    df_final = (df[['SafeHavenID', 'eoy']]
+                .reset_index(drop=True)
+                .join(df_scaled))
+    return df_final
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    # Get generated data_path
+    data_path = config['model_data_path']
+    # Get datatype from cmd line
+    data_type = sys.argv[1]
+    # Read in data
+    df = pd.read_pickle(data_path + 'merged_' + data_type + '.pkl')
+    # Load training age groups and apply to data
+    df = calc_age_bins_test(df, data_path)
+    # Load in training median for each age-bin/sex-bin labelled group
+    df_medians = pd.read_pickle(data_path + 'medians.pkl')
+    df_medians = df_medians.reset_index()
+    df_medians = create_label(df_medians)
+    df = create_label(df)
+    labels = df_medians['label']
+    # Fill null days_since columns for patient with ggc_years < 5
+    max_vals = pd.read_pickle(data_path + 'maxs.pkl')
+    for col in ds_cols:
+        df = ds_fill_5year_test(df, col, max_vals)
+    # Fill remaining nulls using training medians
+    df_filled = pd.concat([fill_nulls(x, df, df_medians) for x in labels])
+    # Convert ds_cols to int
+    for col in ds_cols:
+        day = np.timedelta64(1, 'D')
+        df_filled[col] = (df_filled[col] / day).astype(int)
+    # Save processed data before scaling
+    df_filled.to_pickle(data_path + 'filled_' + data_type + '.pkl')
+    # Drop non-modelling columns
+    df_filled = df_filled.drop(cols2drop, axis=1)
+    # Load in min-max scaler from training set
+    scaler = joblib.load(data_path + 'min_max_scaler_train.pkl')
+    df_filled = scale_data_test(df_filled, scaler)
+    # Save final dataset
+    df_filled.to_pickle(data_path + 'min_max_' + data_type + '.pkl')
+main()

training/src/reduction/clean_and_scale_train.py ADDED Viewed

	@@ -0,0 +1,171 @@

+"""
+TRAIN
+Impute any null data, save ethnicity info for each ID and scale
+final dataset
+"""
+import json
+import joblib
+import pandas as pd
+import numpy as np
+from numpy import savetxt
+from sklearn.preprocessing import MinMaxScaler
+from utils.reduction import calc_ds_med
+demo_cols = ['age_bin', 'sex_bin']
+ds_cols = ['days_since_copd_resp', 'days_since_adm', 'days_since_rescue']
+null_cols = ['alt_med_2yr', 'ast_med_2yr', 'albumin_med_2yr',
+             'alkaline_phosphatase_med_2yr', 'basophils_med_2yr',
+             'c_reactive_protein_med_2yr', 'chloride_med_2yr',
+             'creatinine_med_2yr', 'eosinophils_med_2yr',
+             'estimated_gfr_med_2yr', 'haematocrit_med_2yr',
+             'haemoglobin_med_2yr', 'lymphocytes_med_2yr',
+             'mch_med_2yr', 'mean_cell_volume_med_2yr',
+             'monocytes_med_2yr', 'neutrophils_med_2yr',
+             'platelets_med_2yr', 'potassium_med_2yr',
+             'red_blood_count_med_2yr', 'sodium_med_2yr',
+             'total_bilirubin_med_2yr', 'urea_med_2yr',
+             'white_blood_count_med_2yr', 'neut_lymph_med_2yr']
+cols2drop = ['eth_grp', 'entry_dataset', 'first_entry', 'obf_dob',
+             'sex_bin', 'marital_status', 'age_bin',
+             'days_since_copd_resp_med', 'days_since_adm_med',
+             'days_since_rescue_med', 'simd_vigintile', 'simd_decile',
+             'simd_quintile']
+def calc_age_bins_train(df, data_path):
+    """
+    Split ages into 10 bins and save results for median filling test data
+    --------
+    :param df: dataframe to be updated
+    :param data_path: path to generated data
+    :return: updated dataframe
+    """
+    # Split age column into 10 buckets and use the edges as labels
+    cat, ed = pd.qcut(df['age'], q=10, precision=0, retbins=True)
+    categories, edges = pd.qcut(
+        df['age'], q=10, precision=0, retbins=True, labels=ed[1:])
+    df['age_bin'] = categories.astype(int)
+    # Save categories for test data
+    savetxt(data_path + 'age_bins_train.csv', edges, delimiter=',')
+    return df
+def calc_df_med(df, data_path):
+    """
+    Calculate the medians for all columns in the dataset
+    --------
+    :param df: dataframe to update
+    :param data_path: path to generated data
+    :return: dataframe with null columns filled with median values and days_since
+        median columns added to the dataframe
+    """
+    # Calculate median for all columns except SafeHavenID, year and ds_cols
+    all_cols = df.columns
+    all_cols = all_cols.drop(['SafeHavenID', 'eoy'])
+    df_median = df[all_cols].groupby(demo_cols).median()
+    # Calculate medians for ds_cols
+    ds_med = df[demo_cols + ds_cols].groupby(demo_cols).apply(calc_ds_med)
+    # Join ds_cols medians to median table and original dataframe
+    df_median = df_median.join(ds_med)
+    # Save medians for imputing testing data
+    df_median.to_pickle(data_path + 'medians.pkl')
+    # Rename and add to original dataframe
+    ds_med.columns += '_med'
+    df = df.join(ds_med, on=demo_cols)
+    return df
+def ds_fill_5year_train(df, col):
+    """
+    Fill days_since_X columns where patient has been in the dataset less than
+    5 years
+    --------
+    :param df: dataframe to be updated
+    :param col: column to check
+    :return: dataframe with column nulls filled where patient has ggc_years < 5
+    """
+    df_5years = df.ggc_years < 5
+    df.loc[df_5years, col] = df.loc[df_5years, col].fillna(df[col].max())
+    return df
+def scale_data_train(df, data_path, scaler):
+    """
+    Min-max scale final dataset
+    -----
+    :param df: dataframe to be scaled
+    :param data_path: path to generated data
+    :param scaler: scaler object to apply to df
+    :return: scaled dataset for modelling
+    """
+    all_cols = df.columns
+    all_cols = all_cols.drop(['SafeHavenID', 'eoy'])
+    data_scaled = scaler.fit_transform(df[all_cols].to_numpy())
+    df_scaled = pd.DataFrame(data_scaled, columns=all_cols)
+    df_final = (df[['SafeHavenID', 'eoy']]
+                .reset_index(drop=True)
+                .join(df_scaled))
+    # Save the scaler for testing
+    joblib.dump(scaler, data_path + 'min_max_scaler_train.pkl')
+    return df_final
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    data_path = config['model_data_path']
+    # Read in combined data
+    df = pd.read_pickle(data_path + 'merged_train.pkl')
+    # Calculate age bins
+    df = calc_age_bins_train(df, data_path)
+    # Calculate medians for each column for imputation
+    df = calc_df_med(df, data_path)
+    # Fill null columns
+    df[null_cols] = df.groupby(demo_cols)[null_cols].apply(
+        lambda x: x.fillna(x.median()))
+    # Fill null days_since columns
+    day = np.timedelta64(1, 'D')
+    df[ds_cols].max().to_pickle(data_path + 'maxs.pkl')
+    for col in ds_cols:
+        df = ds_fill_5year_train(df, col)
+        df[col] = df[col].fillna(df[col + '_med'])
+        df[col] = (df[col] / day).astype(int)
+    # Save processed data before scaling
+    df.to_pickle(data_path + 'filled_train.pkl')
+    # Drop non-modelling columns
+    df = df.drop(cols2drop, axis=1)
+    # Initialize scaler
+    scaler = MinMaxScaler()
+    # Scale final dataset
+    df_final = scale_data_train(df, data_path, scaler)
+    # Save final dataset
+    df_final.to_pickle(data_path + 'min_max_train.pkl')
+main()

training/src/reduction/combine.py ADDED Viewed

	@@ -0,0 +1,217 @@

+"""
+To Do:
+- Refactor script to be more readable/smaller main function
+"""
+import json
+import pandas as pd
+import numpy as np
+from datetime import timedelta
+def read_pkl_data(dataset, data_path, path_type):
+    """
+    Read in pickled dataset
+    --------
+    :param dataset: type of dataset to read in
+    :param data_path: path to generated data
+    :param path_type: type of path to read from
+    :return: dataframe
+    """
+    print('Reading in ' + dataset)
+    file_path = data_path + dataset
+    if path_type == 'data':
+        file_path += '_proc.pkl'
+    else:
+        file_path += '_first_dates.pkl'
+    return pd.read_pickle(file_path)
+def fill_eth_grp_data(df):
+    """
+    Fill nulls in eth_grp column introduced in joining
+    :param df: dataframe to update
+    :return: Filled dataframe
+    """
+    df['eth_grp'] = df.groupby('SafeHavenID').eth_grp.apply(
+        lambda x: x.ffill().bfill())
+    df['eth_grp'] = df['eth_grp'].fillna('Unknown')
+    return df
+def fill_to_date_columns(df):
+    """
+    Fill nulls in to_date columns introduced in joining
+    :param df: dataframe to update
+    :return: Filled dataframe
+    """
+    to_date_cols = ['adm_to_date', 'copd_to_date', 'resp_to_date',
+                    'presc_to_date', 'rescue_to_date', 'labs_to_date',
+                    'anxiety_depression_to_date',
+                    'anxiety_depression_presc_to_date']
+    df[to_date_cols] = df.groupby('SafeHavenID')[to_date_cols].apply(
+        lambda x: x.ffill().fillna(0))
+    return df
+def fill_yearly_columns(df):
+    """
+    Fill nulls in yearly columns introduced in joining
+    :param df: dataframe to update
+    :return: Filled dataframe
+    """
+    zero_cols = ['adm_per_year', 'total_hosp_days', 'mean_los',
+                 'copd_per_year', 'resp_per_year', 'comorb_per_year',
+                 'salbutamol_per_year',
+                 'saba_inhaler_per_year', 'laba_inhaler_per_year',
+                 'lama_inhaler_per_year', 'sama_inhaler_per_year',
+                 'ics_inhaler_per_year', 'laba_ics_inhaler_per_year',
+                 'lama_laba_ics_inhaler_per_year', 'saba_sama_inhaler_per_year',
+                 'mcs_inhaler_per_year', 'rescue_meds_per_year',
+                 'presc_per_year', 'labs_per_year',
+                 'anxiety_depression_per_year', 'anxiety_depression_presc_per_year']
+    df[zero_cols] = df[zero_cols].fillna(0)
+    return df
+def fill_days_since(df, typ):
+    """
+    Fill days_since_copd/resp/rescue
+    :param df: dataframe to update
+    :param typ: type of feature to fill ('copd', 'resp', 'rescue')
+    :return: Filled dataframe
+    """
+    df['days_since_' + typ] = df.eoy - df[typ + '_date'].ffill()
+    return df
+def process_first_dates(df):
+    """
+    Process dataframe containing patient's first date in the health board region
+    --------
+    :param df: dataframe to process
+    :return: processed dataframe
+    """
+    df = df.set_index('SafeHavenID')
+    entry_dataset = df.idxmin(axis=1).apply(lambda x: x.split('_')[1])
+    first_entry = df.min(axis=1)
+    df['entry_dataset'] = entry_dataset
+    df['first_entry'] = first_entry
+    df_reduced = df[['entry_dataset', 'first_entry']].reset_index()
+    return df_reduced
+def find_closest_simd(v):
+    """
+    Find closest SIMD vigintile for each row 'v'
+    --------
+    :param v: row of data from apply statement
+    :param typ: type of simd column to add
+    :return: simd value
+    """
+    simd_years = [2009, 2012, 2016]
+    bools = [v.eoy.year >= year for year in simd_years]
+    if any(bools):
+        simd_year = str(simd_years[np.where(bools)[0][-1]])
+        v['simd_quintile'] = v['simd_' + simd_year + '_quintile']
+        v['simd_decile'] = v['simd_' + simd_year + '_decile']
+        v['simd_vigintile'] = v['simd_' + simd_year + '_vigintile']
+    else:
+        v['simd_quintile'] = np.nan
+        v['simd_decile'] = np.nan
+        v['simd_vigintile'] = np.nan
+    return v
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    data_path = config['model_data_path']
+    # Read in data
+    adm = read_pkl_data('adm', data_path, 'data')
+    comorb = read_pkl_data('comorb', data_path, 'data')
+    presc = read_pkl_data('presc', data_path, 'data')
+    labs = read_pkl_data('labs', data_path, 'data')
+    demo = read_pkl_data('demo', data_path, 'data')
+    # Join datasets
+    df = adm.join(
+        comorb, how='left').join(
+        presc, how='outer').join(
+        labs, how='outer')
+    df = df.reset_index()
+    # Fill nulls introduced in joining
+    print('Filling data')
+    df = fill_eth_grp_data(df)
+    df = fill_to_date_columns(df)
+    df = fill_yearly_columns(df)
+    # Fill days_since columns
+    for typ in ['copd', 'resp', 'rescue', 'adm']:
+        df = df.groupby('SafeHavenID').apply(fill_days_since, typ)
+    # Reduce to single column
+    ds_cols = ['days_since_copd', 'days_since_resp']
+    df['days_since_copd_resp'] = df[ds_cols].min(axis=1)
+    # Read in first date data
+    print('Adding first dates')
+    adm_dates = read_pkl_data('adm', data_path, 'date')
+    presc_dates = read_pkl_data('presc', data_path, 'date')
+    labs_dates = read_pkl_data('labs', data_path, 'date')
+    # Merge first date data
+    first_dates = pd.merge(
+        pd.merge(adm_dates, presc_dates, how="outer", on='SafeHavenID'),
+        labs_dates, how="outer", on='SafeHavenID')
+    # Save first dates if needed
+    first_dates.to_pickle(data_path + 'overall_first_dates.pkl')
+    # Process first_years
+    date_data = process_first_dates(first_dates)
+    # Merge first dates data with dataframe
+    print('Merging data')
+    df_merged = pd.merge(df, date_data, on='SafeHavenID', how='inner')
+    # Add years in health board region
+    ggc_years = (df_merged.eoy - df_merged.first_entry) / np.timedelta64(1, 'Y')
+    df_merged['ggc_years'] = round(ggc_years)
+    # Merge demographics
+    df_merged = pd.merge(df_merged, demo, on='SafeHavenID')
+    # Calculate age relative to end of year
+    dt_diff = df_merged.eoy - pd.to_datetime(df_merged.obf_dob)
+    df_merged['age'] = dt_diff // timedelta(days=365.2425)
+    # Find closest SIMD
+    df_merged = df_merged.apply(find_closest_simd, axis=1)
+    # Drop additional columns
+    cols2drop = ['copd_date', 'resp_date', 'adm_date', 'rescue_date',
+                 'simd_2009_quintile', 'simd_2009_decile',
+                 'simd_2009_vigintile', 'simd_2012_quintile',
+                 'simd_2012_decile', 'simd_2012_vigintile',
+                 'simd_2016_quintile', 'simd_2016_decile',
+                 'simd_2016_vigintile', 'days_since_copd',
+                 'days_since_resp']
+    df_merged = df_merged.drop(cols2drop, axis=1)
+    # Save dataset
+    df_merged.to_pickle(data_path + 'merged_full.pkl')
+main()

training/src/reduction/post_proc_reduction.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import json
+import pandas as pd
+single_inhaler = ['saba_inhaler_per_year', 'laba_inhaler_per_year',
+                  'lama_inhaler_per_year', 'sama_inhaler_per_year',
+                  'ics_inhaler_per_year', 'mcs_inhaler_per_year']
+double_inhaler = ['laba_ics_inhaler_per_year', 'saba_sama_inhaler_per_year']
+triple_inhaler = 'lama_laba_ics_inhaler_per_year'
+adm_cols = ['copd_per_year', 'resp_per_year']
+def main():
+    # Load in config items
+    with open('../../../config.json') as json_config_file:
+        config = json.load(json_config_file)
+    data_path = config['model_data_path']
+    # Read in original data before scaling
+    df = pd.read_pickle(data_path + 'merged_full.pkl')
+    # Create new reduced columns
+    df['single_inhaler'] = df[single_inhaler].sum(axis=1)
+    df['double_inhaler'] = df[double_inhaler].sum(axis=1)
+    df['triple_inhaler'] = df[triple_inhaler]
+    df['copd_resp_per_year'] = df[adm_cols].sum(axis=1)
+    # Drop original columns
+    cols2drop = single_inhaler + double_inhaler + [triple_inhaler] + adm_cols
+    df = df.drop(cols2drop, axis=1)
+    # Save data
+    df.to_pickle(data_path + 'merged.pkl')
+main()