IamGrooooot commited on
Commit
53a6def
·
0 Parent(s):

Model E: Unsupervised PCA + clustering risk stratification

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. MODEL_CARD.md +199 -0
  2. README.md +87 -0
  3. config.json +8 -0
  4. documentation/README.md +151 -0
  5. pipeline.yml +29 -0
  6. requirements.txt +1 -0
  7. setup.cfg +7 -0
  8. training/README.md +6 -0
  9. training/src/README.md +8 -0
  10. training/src/modelling/__pycache__/run_model.cpython-313.pyc +0 -0
  11. training/src/modelling/additional_code_onevsone_onevsrest_approaches.py +346 -0
  12. training/src/modelling/dtc_params.json +3 -0
  13. training/src/modelling/event_calculations.py +183 -0
  14. training/src/modelling/hierarchical_params.json +4 -0
  15. training/src/modelling/kmeans_params.json +4 -0
  16. training/src/modelling/one_vs_rest_BLR.py +377 -0
  17. training/src/modelling/one_vs_rest_DTC.py +380 -0
  18. training/src/modelling/predict_clusters.py +70 -0
  19. training/src/modelling/run_model.py +355 -0
  20. training/src/modelling/validate.py +260 -0
  21. training/src/processing/README.md +24 -0
  22. training/src/processing/__init__.py +1 -0
  23. training/src/processing/mappings/Comorbidity feature review for models & clin summary update v2 May 2021.xlsx +0 -0
  24. training/src/processing/mappings/README.md +7 -0
  25. training/src/processing/mappings/diag_copd_resp_desc.json +5 -0
  26. training/src/processing/mappings/inhaler_mapping.json +55 -0
  27. training/src/processing/mappings/test_mapping.json +1 -0
  28. training/src/processing/misc/process_gples.py +73 -0
  29. training/src/processing/misc/process_validation_adm.py +28 -0
  30. training/src/processing/misc/process_validation_presc.py +20 -0
  31. training/src/processing/process_admissions.py +153 -0
  32. training/src/processing/process_comorbidities.py +161 -0
  33. training/src/processing/process_demographics.py +74 -0
  34. training/src/processing/process_labs.py +247 -0
  35. training/src/processing/process_prescribing.py +145 -0
  36. training/src/processing/utils/README.md +11 -0
  37. training/src/processing/utils/__init__.py +1 -0
  38. training/src/processing/utils/adm_common.py +77 -0
  39. training/src/processing/utils/adm_processing.py +146 -0
  40. training/src/processing/utils/adm_reduction.py +65 -0
  41. training/src/processing/utils/common.py +132 -0
  42. training/src/processing/utils/comorb_processing.py +20 -0
  43. training/src/processing/utils/labs_processing.py +16 -0
  44. training/src/processing/utils/presc_common.py +68 -0
  45. training/src/reduction/README.md +12 -0
  46. training/src/reduction/__init__.py +1 -0
  47. training/src/reduction/clean_and_scale_test.py +173 -0
  48. training/src/reduction/clean_and_scale_train.py +171 -0
  49. training/src/reduction/combine.py +217 -0
  50. training/src/reduction/post_proc_reduction.py +37 -0
MODEL_CARD.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - healthcare
6
+ - ehr
7
+ - copd
8
+ - clinical-risk
9
+ - tabular
10
+ - scikit-learn
11
+ - clustering
12
+ - unsupervised
13
+ pipeline_tag: tabular-classification
14
+ library_name: sklearn
15
+ ---
16
+
17
+ # COPD Open Models — Model E (Unsupervised Risk Stratification)
18
+
19
+ ## Model Details
20
+
21
+ Model E groups COPD patients into risk clusters using **PCA dimensionality reduction** followed by **KMeans or hierarchical clustering**. Clusters are designed to support risk stratification — identifying whether patients are receiving the appropriate level of care for their apparent risk. Clusters update with new data to track how patient risk evolves over time.
22
+
23
+ ### Key Characteristics
24
+
25
+ - **Unsupervised** — no target labels required; clusters are derived from patient feature similarity alone.
26
+ - **Two-stage PCA** — Stage 1 selects features explaining ≥90% variance; Stage 2 reduces to 3 components explaining ≥80% variance.
27
+ - **Modular pipeline** — processing, reduction, and modelling stages are fully separated and independently reusable.
28
+ - Training code is fully decoupled from cloud infrastructure — runs locally with no Azure dependencies.
29
+
30
+ > **Note:** This repository contains no real patient-level data. All included data files are synthetic or example data for pipeline validation.
31
+
32
+ ### Model Type
33
+
34
+ Unsupervised clustering (PCA + KMeans / Agglomerative Hierarchical Clustering), validated via Decision Tree Classifier on cluster labels.
35
+
36
+ ### Release Notes
37
+
38
+ - **Phase 1 (current):** Models C, E, H published as the initial "COPD Open Models" collection.
39
+ - **Phase 2 (planned):** Additional models may follow after codebase sanitisation.
40
+
41
+ ---
42
+
43
+ ## Intended Use
44
+
45
+ This model and code are published as **reference implementations** for research, education, and benchmarking on COPD risk stratification tasks.
46
+
47
+ ### Intended Users
48
+
49
+ - ML practitioners exploring unsupervised healthcare ML pipelines
50
+ - Researchers comparing dimensionality reduction and clustering approaches for EHR data
51
+ - Developers building internal prototypes (non-clinical)
52
+
53
+ ### Out-of-Scope Uses
54
+
55
+ - **Not** for clinical decision-making, triage, diagnosis, or treatment planning.
56
+ - **Not** a substitute for clinical judgement or validated clinical tools.
57
+ - Do **not** deploy in healthcare settings without an appropriate regulatory, clinical safety, and information governance framework.
58
+
59
+ ### Regulatory Considerations (SaMD)
60
+
61
+ Regulatory status for software depends on the intended purpose expressed in documentation, labelling, and promotional materials. Downstream users integrating or deploying this model should determine whether their implementation qualifies as Software as a Medical Device (SaMD) and identify the legal "manufacturer" responsible for compliance and post-market obligations.
62
+
63
+ ---
64
+
65
+ ## Training Data
66
+
67
+ - **Source:** NHS EHR-derived datasets (training performed on controlled datasets; not distributed here).
68
+ - **Data available in this repo:** Synthetic/example datasets only.
69
+ - **Cohort:** COPD patients from hospital admissions, laboratory, pharmacy, and demographic records.
70
+ - **Data split:** 60% train / 15% validation / 20% test (random_state=42). RECEIVER and Scale-Up cohorts held out as external validation sets.
71
+
72
+ ### Features (~70 after processing)
73
+
74
+ | Category | Features |
75
+ |----------|----------|
76
+ | **Admissions** | adm_per_year, total_hosp_days, mean_los, copd_per_year, resp_per_year, copd_resp_per_year, days_since_copd_resp, days_since_adm, days_since_rescue, anxiety_depression_per_year |
77
+ | **Demographics** | age, sex_bin, ethnicity (7 categories), marital_status (one-hot), SIMD quintile/decile/vigintile |
78
+ | **Laboratory (2-year medians)** | 26 lab tests: albumin, ALT, AST, alkaline phosphatase, basophils, CRP, chloride, creatinine, eosinophils, eGFR, haematocrit, haemoglobin, lymphocytes, MCH, MCV, monocytes, neutrophils, platelets, potassium, RBC, sodium, total bilirubin, urea, WBC, neutrophil-to-lymphocyte ratio; plus labs_per_year |
79
+ | **Prescribing** | single/double/triple_inhaler_per_year, salbutamol_per_year, rescue_meds_per_year, anxiety_depression_presc_per_year, presc_per_year |
80
+ | **Comorbidities** | 30 binary flags (cardiovascular, respiratory, metabolic, oncology conditions), comorb_per_year |
81
+
82
+ ### Data Preprocessing
83
+
84
+ 1. **Imputation** — grouped median imputation by (age_bin × sex_bin). Age binned into 10 quantiles. Days-since features for patients with <5 years in cohort filled with dataset maximum.
85
+ 2. **Scaling** — MinMaxScaler to [0, 1], fit on training data only; test/validation/external sets transformed using saved scaler.
86
+
87
+ ---
88
+
89
+ ## Training Procedure
90
+
91
+ ### Training Framework
92
+
93
+ - pandas, numpy, scikit-learn, matplotlib, mlflow
94
+
95
+ ### Algorithms
96
+
97
+ | Component | Algorithm | Parameters |
98
+ |-----------|-----------|------------|
99
+ | **PCA Stage 1** | sklearn.decomposition.PCA | Selects features at ≥90% cumulative explained variance |
100
+ | **PCA Stage 2** | sklearn.decomposition.PCA | Reduces to 3 components at ≥80% cumulative explained variance |
101
+ | **Clustering (primary)** | sklearn.cluster.AgglomerativeClustering | n_clusters=3, linkage='ward' |
102
+ | **Clustering (alternative)** | sklearn.cluster.KMeans | n_clusters=3, random_state=10 |
103
+ | **Validation** | sklearn.tree.DecisionTreeClassifier | random_state=42; trained on cluster labels |
104
+
105
+ ### Cluster Count Selection
106
+
107
+ Davies-Bouldin Index and Silhouette Score are calculated for **k=2 through k=9**. Both metrics are logged to MLflow for inspection before selecting the final cluster count.
108
+
109
+ ### Evaluation Design
110
+
111
+ - Clustering quality: **Davies-Bouldin Index** (lower is better), **Silhouette Score** (higher is better).
112
+ - Cluster validation: **Decision Tree accuracy** on held-out data (can the clustering be reliably reproduced?).
113
+ - Clinical validation: cluster-level outcome rates (admissions, mortality, time-to-event), demographic breakdowns, medication therapy distributions.
114
+
115
+ ---
116
+
117
+ ## Evaluation Results
118
+
119
+ > Replace this section with measured results from your training run.
120
+
121
+ | Metric | Value | Notes |
122
+ |--------|-------|-------|
123
+ | Davies-Bouldin Index | TBD | Lower is better |
124
+ | Silhouette Score | TBD | Range [-1, 1], higher is better |
125
+ | DTC Accuracy | TBD | Decision Tree on validation set |
126
+ | Cluster sizes | TBD | Patient counts per cluster |
127
+
128
+ ### Caveats on Metrics
129
+
130
+ - Cluster quality metrics assess geometric separation, not clinical meaningfulness — clinical validation requires outcome analysis.
131
+ - Results depend on the feature set, imputation strategy, and patient population.
132
+
133
+ ---
134
+
135
+ ## Bias, Risks, and Limitations
136
+
137
+ - **Dataset shift:** EHR coding practices, care pathways, and population characteristics vary across sites and time periods.
138
+ - **Feature availability:** Lab test availability varies by patient; imputation strategy affects cluster assignment.
139
+ - **Fairness:** Cluster composition may correlate with age, sex, or deprivation — interpret with care.
140
+ - **Misuse risk:** Cluster labels are not validated risk scores. Using them to drive clinical action without clinical safety processes can cause harm.
141
+ - **Interpretability:** PCA components are linear combinations of features — clinical interpretation requires examining loadings.
142
+
143
+ ---
144
+
145
+ ## How to Use
146
+
147
+ ### Pipeline Execution Order
148
+
149
+ ```bash
150
+ # 1. Install dependencies
151
+ pip install pandas numpy scikit-learn matplotlib mlflow joblib tableone
152
+
153
+ # 2. Process raw data (run each independently)
154
+ python training/src/processing/process_demographics.py
155
+ python training/src/processing/process_admissions.py
156
+ python training/src/processing/process_comorbidities.py
157
+ python training/src/processing/process_labs.py
158
+ python training/src/processing/process_prescribing.py
159
+
160
+ # 3. Combine and reduce features (run in order)
161
+ python training/src/reduction/combine.py
162
+ python training/src/reduction/post_proc_reduction.py
163
+ python training/src/reduction/remove_ids.py
164
+ python training/src/reduction/clean_and_scale_train.py
165
+ python training/src/reduction/clean_and_scale_test.py
166
+
167
+ # 4. Run clustering model
168
+ python training/src/modelling/run_model.py
169
+
170
+ # 5. Predict clusters for new patients
171
+ python training/src/modelling/predict_clusters.py
172
+ ```
173
+
174
+ ### Adapting to Your Data
175
+
176
+ Replace input file paths in `config.json` with your own EHR data extracts. The pipeline expects CSVs with patient ID, admission records, lab results, pharmacy records, and demographics.
177
+
178
+ ---
179
+
180
+ ## Environmental Impact
181
+
182
+ Training computational requirements are minimal — PCA and clustering on tabular data completes in seconds on a standard laptop.
183
+
184
+ ---
185
+
186
+ ## Citation
187
+
188
+ If you use this model or code, please cite:
189
+
190
+ - This repository: *(add citation format / Zenodo DOI if minted)*
191
+ - Associated publications: *(forthcoming)*
192
+
193
+ ## Authors and Contributors
194
+
195
+ - **Storm ID** (maintainers)
196
+
197
+ ## License
198
+
199
+ This model and code are released under the **Apache 2.0** license.
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model E
2
+
3
+ Model E is an unsupervised learning model, built with the aim of grouping patients within the COPD cohort into _k_ clusters as a means of risk stratification. These clusters are updated with new incoming data, with the cluster for each patient monitored in order to track how their risk changes over time. Results will be used to determine if patients are receiving the correct level of care for their apparent risk.
4
+
5
+ The model uses EHR data (information on patients admission history, labs tests data, prescribing data, comorbidities and demographic data) which is first be pre-processed, then reduced using Principal Components Analysis (PCA). The 3D components produced in PCA are then be passed to a variety of clustering algorithms, with results plotted for inspection.
6
+
7
+ ## Structure
8
+
9
+ ```
10
+ C:.
11
+ | pipeline.yml
12
+ | README.md
13
+ | requirements.txt
14
+ | setup.cfg
15
+ |
16
+ +---documentation
17
+ | README.md
18
+ |
19
+ +---training
20
+ | | README.md
21
+ | |
22
+ | +---src
23
+ | | | README.md
24
+ | | |
25
+ | | +---data
26
+ | | |
27
+ | | +---modelling
28
+ | | | | Cluster Data.ipynb
29
+ | | | |
30
+ | | +---processing
31
+ | | | | process_admissions.py
32
+ | | | | process_comorbidities.py
33
+ | | | | process_demographics.py
34
+ | | | | process_gples.py
35
+ | | | | process_labs.py
36
+ | | | | process_prescribing.py
37
+ | | | | README.md
38
+ | | | | __init__.py
39
+ | | | |
40
+ | | | +---mappings
41
+ | | | | | Comorbidity feature review for models & clin summary update v2 May 2021.xlsx
42
+ | | | | | diag_copd_resp_desc.json
43
+ | | | | | inhaler_mapping.json
44
+ | | | | | README.md
45
+ | | | | | test_mapping.json
46
+ | | | | |
47
+ | | | |
48
+ | | | \---utils
49
+ | | | adm_common.py
50
+ | | | adm_processing.py
51
+ | | | adm_reduction.py
52
+ | | | common.py
53
+ | | | comorb_processing.py
54
+ | | | labs_processing.py
55
+ | | | README.md
56
+ | | | __init__.py
57
+ | | |
58
+ | | \---reduction
59
+ | | | clean_and_scale_test.py
60
+ | | | clean_and_scale_train.py
61
+ | | | combine.py
62
+ | | | README.md
63
+ | | | remove_ids.py
64
+ | | | __init__.py
65
+ | | |
66
+ | | \---utils
67
+ | | reduction.py
68
+ | | __init__.py
69
+ | |
70
+ | \---tests
71
+ | README.md
72
+ |
73
+ \---validation
74
+ +---parameter_calculation
75
+ | CAT_MRC_score_metrics_calculation.py
76
+ | Fitbit_groups_calculation.py
77
+ | GOLD_grade_GOLD_group_calculation.py
78
+ | NIV_parameters_calculation.py
79
+ | PRO_LOGIC_exacerbation_calculations.py
80
+ | README.md
81
+ | Time_to_death_calculation.py
82
+ | Time_to_first_admission_calculations.py
83
+ | Time_to_first_event_calculations.py
84
+ |
85
+ \---risk_score_calculation
86
+ combined_risk_score_RC_SU1.py
87
+ ```
config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "date": "2019-12-31",
3
+ "extract_data_path": "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/",
4
+ "rec_data_path": "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/",
5
+ "sup_data_path": "<YOUR_DATA_PATH>/SU_IDs/",
6
+ "model_data_path": "<YOUR_DATA_PATH>/Model_E_Extracts/",
7
+ "model_type": "hierarchical"
8
+ }
documentation/README.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model E
2
+
3
+
4
+ ## Abstract
5
+
6
+ Model E is an unsupervised learning model, built with the aim of grouping patients within the COPD cohort into k clusters as a means of risk stratification. Clusters are updated with new incoming data, with the cluster for each patient monitored in order to track how their risk changes over time. Results will be used to determine if patients are receiving the correct level of care for their apparent risk.
7
+
8
+
9
+ ## Aims
10
+
11
+ 1. To use an unsupervised learning method to cluster patients within the COPD cohort into k clusters based on a variety of features.
12
+
13
+ 2. Cluster new data and update clusters accordingly. Monitor the identified cluster for each patient and alert if they transitions between clusters.
14
+
15
+ 3. Determine is patients are on the incorrect type of care based on their clusters.
16
+
17
+
18
+ ## Data - EXAMPLE_STUDY_DATA
19
+
20
+ Below details the raw EHR features processed for model training, along with the resulting processed feature set.
21
+
22
+ ### Raw features
23
+
24
+ #### Admissions/Comorbidites - SMR01_Cohort3R.csv
25
+
26
+ Feature name | Description |
27
+ -------------|-------------|
28
+ SafeHavenID | Patient ID |
29
+ ETHGRP | Ethnicity |
30
+ ADMDATE | Date of admission |
31
+ DISDATE | Date of discharge |
32
+ DIAGxDesc (x=1-6) | Diagnosis columns 1-6 |
33
+ STAY | Length of stay (days) |
34
+
35
+ #### Demographics - Demographics_Cohort3R.csv
36
+
37
+ Feature name | Description |
38
+ -------------|-------------|
39
+ SafeHavenID | Patient ID |
40
+ OBF_DOB | Date of birth |
41
+ SEX | Sex |
42
+ Marital_Status | Marital status |
43
+ SIMD_2009/12/16_QUINTILE | SIMD ranks to quintiles for 2009, 2012 and 2016 data zones |
44
+ SIMD_2009/12/16_DECILE | SIMD ranks to deciles for 2009, 2012 and 2016 data zones |
45
+ SIMD_2009/12/16_VIGINTILE | SIMD ranks to vigintiles for 2009, 2012 and 2016 data zones |
46
+
47
+ #### Prescribing - Pharmacy_Cohort3R.csv
48
+
49
+ Feature name | Description |
50
+ -------------|-------------|
51
+ SafeHavenID | Patient ID |
52
+ PRESC_DATE | Date of prescription |
53
+ PI_BNF_Item_Code | Code describing specific medicine as found in the British National Formulary (BNF) reference book |
54
+ PI_Approved_Name | Name of medicine |
55
+
56
+ #### Labs - SCI_Store_Cohort3R.csv
57
+
58
+ Feature name | Description |
59
+ -------------|-------------|
60
+ SafeHavenID | Patient ID |
61
+ SAMPLEDATE | Date lab test was taken |
62
+ CLINICALCODEDESCRIPTION | Name of test |
63
+ QUANTITYVALUE | Test value |
64
+ RANGEHIGHVALUE | Test range highest value |
65
+ RANGELOWVALUE | Test range lowest value |
66
+
67
+ #### Mappings
68
+
69
+ - `inhaler_mapping.json`: Inhaler mappings for any Chapter 3 BNF code inhaler prescriptions present in the SafeHaven prescribing dataset. Information on NHS inhaler types, found [here](https://www.coch.nhs.uk/media/172781/3-respiratory-system.pdf), was used to create the mapping.
70
+
71
+ - `test_mapping.json`: A mapping created for any of the top 20 most frequently occurring lab tests, plus any lab tests found relevant for indicating COPD severity in Model A. This mapping creates a common name for a specific test and lists any related names the test may appear under within the SCI Store dataset.
72
+
73
+ - `Comorbidity feature review for models & clin summary update v2 May 2021.xlsx`: A mapping between diagnosis names found in SMR01 and their associated comorbidities (taken from Model A).
74
+
75
+ - `diag_copd_resp_desc.json`: DIAGDesc for COPD and respiratory admissions.
76
+
77
+ ### Processed features
78
+
79
+ #### Demographics features
80
+
81
+ The below features are saved to be used for any necessary validation, but are not included for model training.
82
+
83
+ Feature name | Description |
84
+ -------------|-------------|
85
+ eth_grp | Ethnicity one-hot-encoded into 1 of 7 categories |
86
+ entry_dataset | Dataset patient first appeared in within the health board region |
87
+ first_entry | Date of first appearance in the health board region |
88
+ obf_dob | Patient DOB at respective date |
89
+ sex_bin | Sex in binary format: F=1, M=0 |
90
+ marital_status_m | Married |
91
+ marital_status_n | Not Known |
92
+ marital_status_o | Other |
93
+ marital_status_s | Single |
94
+ age_bin | Age bucket based on training data (1 of 10) |
95
+ days_since_copd_resp_med | Median days since COPD or respiratory admission |
96
+ days_since_adm_med | Median days since any admission |
97
+ days_since_rescue_med | Median days since rescue event |
98
+ simd_quintile | SIMD ranks to quintile for closest year data zone |
99
+ simd_decile | SIMD ranks to decile for closest year data zone |
100
+ simd_vigintile | SIMD ranks to vigintile for closest year data zone |
101
+
102
+ #### Final feature set
103
+
104
+ The final feature set contains 50 features, as detailed below.
105
+
106
+ Feature name | Description |
107
+ -------------|-------------|
108
+ SafeHavenID | Patient ID |
109
+ year | Data year |
110
+ total_hosp_days | Total hospital days in current year |
111
+ mean_los | Average length of stay per year |
112
+ ggc_years | Total years appearing in the health board region |
113
+ age | Patient age |
114
+ EVENT_per_year | Total events per year where EVENT=adm/comorb/salbutamol/rescue_meds/presc/labs/copd_resp |
115
+ EVENT_to_date | Total events to date where EVENT=adm/copd/resp/presc/rescue/labs |
116
+ days_sinced_EVENT | Days since event where EVENT=adm/copd_resp/rescue |
117
+ TEST_med_2yr | Median test value from previous 2 years, where TEST=alt/ast/albumin/alkaline_phosphatase/basophils/c_reactive_protein/chloride/creatinine/eosinophils/estimated_gfr/haematocrit/haemoglobin/lymphocytes/mch/mean_cell_volume/monocytes/neutrophils/platelets/potassium/red_blood_count/sodium/total_bilirubin/urea/white_blood_count/neut_lymph |
118
+ n_inhaler | Yearly inhaler prescription count where n=single/double/triple |
119
+
120
+ These features are further reduced using Principal Components Analysis (PCA) to produce a reduced feature set containing:
121
+
122
+ Feature name |
123
+ -------------|
124
+ age |
125
+ ggc_years |
126
+ comorb/presc/labs_per_year |
127
+ presc/labs/rescue_to_date |
128
+ days_since_adm/copd_resp/rescue |
129
+ albumin/estimated_gfr/haemoglobin/labs/red_blood_count_med_2yr |
130
+
131
+ ### Method
132
+
133
+ Raw datasets are loaded in and processed into a format suitable for machine learning to be applied. Features are then reduced to 1 row per SafeHavenID per year by selecting the:
134
+
135
+ - Median value for lab tests taken in the previous 2 years
136
+ - Sum of any binary/count features
137
+ - Last value of any to-date features
138
+
139
+ Once reduced, the datasets are then joined on SafeHavenID and year.
140
+
141
+ At this stage SafeHavenIDs present in both the Receiver and Scale-Up cohorts are removed. The remaining data is the split into training and testing sets in a subject-wise fashion, with 20% of the remaining patients in the testing set.
142
+
143
+ Each of these sets of data (training, testing, receiver and scale-up) are min-max scaled so that all features lie between 0 and 1. Note that all validation/testing sets (testing, receiver and scale-up) use the pre-trained scaler used to process the training set.
144
+
145
+ Data is then passed through a pipeline where:
146
+
147
+ - PCA is applied to reduce the processed dataset with 50+ features down to 15 features which are then further reduced to 6 principal components.
148
+ - Davies Bouldin Score is applied to detect the cluster number in the training set.
149
+ - Training data is clustered using the K-Means algorithm, with results plotted using matplotlib.
150
+ - The test, receiver and scale-up datasets are reduced using the PCA method applied to the training set.
151
+ - Clusters are calculated for all validation data.
pipeline.yml ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ trigger:
2
+ branches:
3
+ include:
4
+ - main
5
+ - release/*
6
+
7
+ jobs:
8
+ - job: 'build'
9
+ pool:
10
+ vmImage: 'ubuntu-latest'
11
+
12
+ steps:
13
+ - task: UsePythonVersion@0
14
+ inputs:
15
+ versionSpec: '3.8'
16
+ architecture: 'x64'
17
+ displayName: 'Specify Python version'
18
+
19
+ - script: |
20
+ python -m pip install --upgrade pip
21
+ displayName: 'Install pip'
22
+
23
+ - script: |
24
+ pip install -r requirements.txt
25
+ displayName: 'Install CI dependencies'
26
+
27
+ - script: |
28
+ flake8
29
+ displayName: 'Run linting'
requirements.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ flake8
setup.cfg ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ [tool:pytest]
2
+ filterwarnings =
3
+ ignore::DeprecationWarning
4
+ [flake8]
5
+ ignore = E501,W293,W292
6
+ exclude = .git,__pycache__,docs/source/conf.py,old,build,dist
7
+ max-complexity = 10
training/README.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ # Training
2
+
3
+ Training scripts are split into `src` and `tests` directories, where `src` is is futher segmented into:
4
+ - `processing`: containing scripts to processes raw EHR data
5
+ - `reduction`: containing scripts to combine, reduce, fill and scale processed EHR data for modelling
6
+ - `modelling`: containing notebooks and scripts required for model training
training/src/README.md ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ # Processing
2
+
3
+ The folder contains scripts to process raw EHR data for training.
4
+
5
+ Note that scripts must be run in the below order:
6
+ 1. `processing`
7
+ 2. `reduction`
8
+ 3. `modelling`
training/src/modelling/__pycache__/run_model.cpython-313.pyc ADDED
Binary file (16.5 kB). View file
 
training/src/modelling/additional_code_onevsone_onevsrest_approaches.py ADDED
@@ -0,0 +1,346 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ import numpy as np
3
+ from sklearn.datasets import load_iris
4
+ from sklearn.tree import DecisionTreeClassifier
5
+ from sklearn.metrics import confusion_matrix
6
+ from sklearn.model_selection import train_test_split
7
+ import matplotlib.pyplot as plt
8
+ from sklearn.multiclass import OneVsRestClassifier
9
+ from sklearn.tree import plot_tree
10
+ from tabulate import tabulate
11
+ from sklearn.linear_model import LogisticRegression
12
+ import mlflow
13
+ from sklearn.metrics import accuracy_score
14
+ from sklearn.metrics import ConfusionMatrixDisplay
15
+ from sklearn.tree.export import export_text
16
+ from sklearn import tree
17
+ from itertools import combinations
18
+
19
+
20
+ # load in the data
21
+ data = load_iris()
22
+ iris = data
23
+ # convert to a dataframe
24
+ df = pd.DataFrame(data.data, columns=data.feature_names)
25
+ # create the species column
26
+ df['Species'] = data.target
27
+ # replace this with the actual names
28
+ target = np.unique(data.target)
29
+ target_names = np.unique(data.target_names)
30
+ targets = dict(zip(target, target_names))
31
+ df['Species'] = df['Species'].replace(targets)
32
+
33
+ # extract features and target variables
34
+ x = df.drop(columns="Species")
35
+ y = df["Species"]
36
+ # save the feature name and target variables
37
+ feature_names = x.columns
38
+ labels = y.unique()
39
+
40
+ # split the dataset
41
+ X_train, test_x, y_train, test_lab = train_test_split(x, y, test_size=0.4, random_state=42)
42
+
43
+
44
+ # The below is for classic logistic regression binary classifier one vs rest,
45
+ # explainability is based on the coefficents in logistic regression
46
+
47
+ # Create a One-vs-Rest logistic regression classifier
48
+ clf = LogisticRegression(random_state=0, multi_class='ovr')
49
+
50
+ # Train the classifier on the Iris dataset
51
+ clf.fit(X_train, y_train)
52
+
53
+ # Get the number of classes and features
54
+ n_classes = len(set(iris.target))
55
+ n_features = iris.data.shape[1]
56
+
57
+ # Create a figure with one subplot for each class
58
+ fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes))
59
+
60
+ # Loop over each class
61
+ for i in range(n_classes):
62
+ # Get the feature importances for the current class
63
+ coef = clf.coef_[i]
64
+ importance = coef
65
+
66
+ # Sort the feature importances in descending order
67
+ indices = np.argsort(importance)[::-1]
68
+
69
+ # Create a bar plot of the feature importances
70
+ axs[i].bar(range(n_features), importance[indices])
71
+ axs[i].set_xticks(range(n_features))
72
+ axs[i].set_xticklabels(np.array(iris.feature_names)[indices], rotation=90)
73
+ axs[i].set_xlabel('Features')
74
+ axs[i].set_ylabel('Importance')
75
+ axs[i].set_title('Feature Importance for Class {}'.format(iris.target_names[i]))
76
+
77
+ # Adjust the spacing between subplots
78
+ fig.tight_layout()
79
+
80
+ # Show the plot
81
+ plt.show()
82
+
83
+
84
+ # Make predictions on the test data
85
+ val_pred = clf.predict(test_x)
86
+ accuracy = accuracy_score(test_lab, val_pred)
87
+ mlflow.log_metric('dtc accuracy', accuracy)
88
+
89
+ cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_)
90
+ disp = ConfusionMatrixDisplay(
91
+ confusion_matrix=cm, display_labels=clf.classes_)
92
+ disp.plot()
93
+ plt.tight_layout()
94
+ mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
95
+
96
+
97
+ # The below is for one vs rest desicion trees with explainability importance values are
98
+ # calculated based on the reduction of impurity measured by the Gini index.
99
+ # Create a One-vs-Rest Decision Tree classifier
100
+ clf_pre = DecisionTreeClassifier(random_state=0)
101
+ clf = OneVsRestClassifier(clf_pre)
102
+
103
+ # Train the classifier on the Iris dataset
104
+ clf.fit(X_train, y_train)
105
+
106
+ # Get the number of classes and features
107
+ n_classes = len(set(iris.target))
108
+ n_features = iris.data.shape[1]
109
+
110
+ # Create a figure with one subplot for each class
111
+ fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes))
112
+
113
+ # Loop over each class
114
+ for i in range(n_classes):
115
+ # Get the feature importances for the current class
116
+ importance = clf.estimators_[i].feature_importances_
117
+
118
+ # Sort the feature importances in descending order
119
+ indices = np.argsort(importance)[::-1]
120
+
121
+ # Create a bar plot of the feature importances
122
+ axs[i].bar(range(n_features), importance[indices])
123
+ axs[i].set_xticks(range(n_features))
124
+ axs[i].set_xticklabels(np.array(iris.feature_names)[indices], rotation=90)
125
+ axs[i].set_xlabel('Features')
126
+ axs[i].set_ylabel('Importance')
127
+ axs[i].set_title('Feature Importance for Class {}'.format(iris.target_names[i]))
128
+
129
+ # Adjust the spacing between subplots
130
+ fig.tight_layout()
131
+
132
+ # Show the plot
133
+ plt.show()
134
+
135
+
136
+ y_pred_DTC = clf.predict(test_x)
137
+ accuracy = accuracy_score(test_lab, val_pred)
138
+ mlflow.log_metric('dtc accuracy', accuracy)
139
+
140
+ cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_)
141
+ disp = ConfusionMatrixDisplay(
142
+ confusion_matrix=cm, display_labels=clf.classes_)
143
+ disp.plot()
144
+ plt.tight_layout()
145
+ mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
146
+
147
+
148
+ # Show desicion tree for each class: two methods
149
+
150
+ # Get the feature names
151
+ feature_names = iris.feature_names
152
+
153
+ # Loop over each decision tree classifier in the one-vs-rest classifier
154
+ for i, estimator in enumerate(clf.estimators_):
155
+ # Export the decision rules for the current tree
156
+ tree_rules = export_text(estimator, feature_names=feature_names)
157
+
158
+ # Print the decision rules for the current tree
159
+ print(f"Decision rules for tree for cluster {i}:")
160
+ print(tree_rules)
161
+
162
+ # assume clf is your one vs rest classifier
163
+ for i, estimator in enumerate(clf.estimators_):
164
+ fig, ax = plt.subplots(figsize=(12, 8))
165
+ tree.plot_tree(estimator,
166
+ feature_names=feature_names,
167
+ class_names=labels,
168
+ rounded=True,
169
+ filled=True,
170
+ fontsize=14,
171
+ ax=ax)
172
+ ax.set_title(f'Tree {i+1}')
173
+ plt.show()
174
+
175
+
176
+ # One vs one approach
177
+
178
+ # BLR
179
+ # Create a One-vs-One logistic regression classifier
180
+ clf = LogisticRegression(random_state=0, multi_class='multinomial', solver='lbfgs')
181
+
182
+ # Train the classifier on the Iris dataset
183
+ clf.fit(X_train, y_train)
184
+
185
+ # Get the number of classes and features
186
+ n_classes = len(set(iris.target))
187
+ n_features = iris.data.shape[1]
188
+
189
+ # Create a figure with one subplot for each class combination
190
+ fig, axs = plt.subplots(n_classes * (n_classes - 1) // 2, 1, figsize=(10, 5 * n_classes * (n_classes - 1) // 2))
191
+
192
+ # Loop over each class combination
193
+ index = 0
194
+ for i in range(n_classes):
195
+ for j in range(i + 1, n_classes):
196
+ # Get the feature importances for the current class combination
197
+ coef = clf.coef_[index]
198
+ importance = coef
199
+
200
+ # Sort the feature importances in descending order
201
+ indices = np.argsort(importance)[::-1]
202
+
203
+ # Create a bar plot of the feature importances
204
+ axs[index].bar(range(n_features), importance[indices])
205
+ axs[index].set_xticks(range(n_features))
206
+ axs[index].set_xticklabels(np.array(iris.feature_names)[indices], rotation=90)
207
+ axs[index].set_xlabel('Features')
208
+ axs[index].set_ylabel('Importance')
209
+ axs[index].set_title('Feature Importance for Class Combination {} vs {}'.format(iris.target_names[i], iris.target_names[j]))
210
+ index += 1
211
+
212
+ # Adjust the spacing between subplots
213
+ fig.tight_layout()
214
+
215
+ # Show the plot
216
+ plt.show()
217
+
218
+
219
+ # Make predictions on the test data
220
+ y_pred_ovo = clf.predict(test_x)
221
+ accuracy = accuracy_score(test_lab, val_pred)
222
+ mlflow.log_metric('blr accuracy', accuracy)
223
+
224
+ # Get confusion matrix
225
+ cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_)
226
+ disp = ConfusionMatrixDisplay(
227
+ confusion_matrix=cm, display_labels=clf.classes_)
228
+ disp.plot()
229
+ plt.tight_layout()
230
+ mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
231
+
232
+
233
+ # Desicion tree clasifier
234
+
235
+ # assume clf is your one vs one classifier
236
+ for i, (c1, c2) in enumerate(combinations(clf.classes_, 2)):
237
+ # create a new binary label vector for the current pair of classes
238
+ y_binary = (y_train == c1) | (y_train == c2)
239
+
240
+ # train a decision tree on the current pair of classes
241
+ estimator = DecisionTreeClassifier()
242
+ estimator.fit(X_train, y_binary)
243
+
244
+ # get feature importances
245
+ importances = estimator.feature_importances_
246
+
247
+ # create a bar plot showing feature importances for the current tree
248
+ fig, ax = plt.subplots(figsize=(8, 6))
249
+ ax.bar(np.arange(len(feature_names)), importances)
250
+ ax.set_xticks(np.arange(len(feature_names)))
251
+ ax.set_xticklabels(feature_names, rotation=45, ha='right')
252
+ ax.set_title(f'Tree {i+1}: {c1} vs {c2} Feature Importances')
253
+ ax.set_ylabel('Importance')
254
+ plt.tight_layout()
255
+ plt.show()
256
+
257
+ # initialize a list to store feature importances for each tree
258
+ importances_all = []
259
+
260
+ # assume clf is your one vs one classifier
261
+ for i, (c1, c2) in enumerate(combinations(clf.classes_, 2)):
262
+ # create a new binary label vector for the current pair of classes
263
+ y_binary = (y_train == c1) | (y_train == c2)
264
+
265
+ # train a decision tree on the current pair of classes
266
+ estimator = DecisionTreeClassifier()
267
+ estimator.fit(X_train, y_binary)
268
+
269
+ # get feature importances and store them in the list
270
+ importances = estimator.feature_importances_
271
+ importances_all.append(importances)
272
+
273
+ # plot the decision tree with feature importances
274
+ fig, ax = plt.subplots(figsize=(12, 8))
275
+ tree.plot_tree(estimator,
276
+ feature_names=feature_names,
277
+ class_names=[str(c1), str(c2)],
278
+ rounded=True,
279
+ filled=True,
280
+ fontsize=14,
281
+ ax=ax)
282
+
283
+ # add feature importances to title
284
+ title = f'Tree {i+1}: {c1} vs {c2}\n'
285
+ title += 'Feature importances:\n'
286
+ for feature, importance in zip(feature_names, importances):
287
+ title += f'{feature}: {importance:.3f}\n'
288
+ ax.set_title(title)
289
+
290
+
291
+ # Get confusion matrix
292
+ cm = confusion_matrix(test_lab, val_pred, labels=clf.classes_)
293
+ disp = ConfusionMatrixDisplay(
294
+ confusion_matrix=cm, display_labels=clf.classes_)
295
+ disp.plot()
296
+ plt.tight_layout()
297
+ mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
298
+
299
+
300
+ # Example of code to show explainability (one vs rest for a specific incidence)
301
+
302
+ # Split the data into training and testing sets
303
+ X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
304
+
305
+ # Train a binary classifier for each class
306
+ binary_classifiers = {}
307
+ for i in range(len(iris.target_names)):
308
+ binary_y_train = np.where(y_train == i, 1, 0)
309
+ model = DecisionTreeClassifier(random_state=42)
310
+ model.fit(X_train, binary_y_train)
311
+ binary_classifiers[i] = model
312
+
313
+ # Choose a specific instance to explain (e.g., the first instance in the test set)
314
+ instance = X_test[7]
315
+
316
+ # Get the predicted probability scores for each class for the instance
317
+ probs = []
318
+ for i in range(len(iris.target_names)):
319
+ binary_classifier = binary_classifiers[i]
320
+ prob = binary_classifier.predict_proba(instance.reshape(1, -1))[0, 1]
321
+ probs.append(prob)
322
+
323
+ # Get the index of the class with the highest probability score
324
+ predicted_class = np.argmax(probs)
325
+
326
+ # Extract the binary classifier with the highest probability score
327
+ binary_classifier = binary_classifiers[predicted_class]
328
+
329
+ # Plot the decision tree for the binary classifier with the highest probability score
330
+ fig, ax = plt.subplots(figsize=(12, 12))
331
+ plot_tree(binary_classifier, filled=True, rounded=True, ax=ax, feature_names=iris.feature_names, class_names=['not ' + iris.target_names[predicted_class], iris.target_names[predicted_class]])
332
+ plt.show()
333
+
334
+ # Print the predicted class and probability for the instance
335
+ predicted_prob = probs[predicted_class]
336
+ print('Predicted Class:', predicted_class)
337
+ print('Predicted Probability:', predicted_prob)
338
+
339
+
340
+ # Create a table with the ID, characteristics, true class label, and predicted class label for each sample in the test data
341
+ table_test = np.column_stack((np.arange(len(y_test)) + 1, X_test, y_test, y_pred_ovo, y_pred_DTC))
342
+ header_test = np.concatenate((['ID'], iris.feature_names, ['True Class', 'Predicted Class_BLR', 'Predicted Class_DTC']))
343
+ table_test = np.vstack((header_test, table_test))
344
+
345
+ # Print the table for the test data
346
+ print(tabulate(table_test))
training/src/modelling/dtc_params.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "random_state": 42
3
+ }
training/src/modelling/event_calculations.py ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Find information on COPD, respiratory, rescue med and death event tracking
3
+ for patients within a timeframe
4
+ """
5
+ import json
6
+ import pandas as pd
7
+ import numpy as np
8
+
9
+
10
+ merged_cols = ['adm_per_year', 'copd_resp_per_year',
11
+ 'anxiety_depression_per_year',
12
+ 'rescue_meds_per_year', 'anxiety_depression_presc_per_year']
13
+ base_cols = ['admission_any', 'admission_copd_resp',
14
+ 'admission_anxiety_depression',
15
+ 'presc_rescue_med', 'presc_anxiety_depression']
16
+ n_cols = ["n_" + col for col in base_cols]
17
+ adm_cols = ['SafeHavenID', 'ADMDATE', 'admission_any', 'admission_copd_resp']
18
+ presc_cols = ['SafeHavenID', 'PRESC_DATE', 'rescue_meds']
19
+
20
+
21
+ def read_deaths(extract_data_path):
22
+ """
23
+ Read in deaths dataset
24
+ --------
25
+ :param extract_data_path: path to data extracts
26
+ :return: dataframe
27
+ """
28
+ filename = extract_data_path + 'Deaths_Cohort3R.csv'
29
+ cols = ['SafeHavenID', 'DOD']
30
+ df = pd.read_csv(filename, usecols=cols).drop_duplicates()
31
+ df['DOD'] = pd.to_datetime(df.DOD)
32
+
33
+ return df
34
+
35
+
36
+ def filter_data(df, date_col, eoy_date, start_date, end_date, typ):
37
+ """
38
+ Filter data to only include events occurring within given date range
39
+ --------
40
+ :param df: dataframe
41
+ :param date_col: str name of date column
42
+ :param eoy_date: end of year date
43
+ :param start_date: validation data start date
44
+ :param end_date: validation data end date
45
+ :param typ: type of data: 'adm', 'presc', 'merged', 'deaths'
46
+ :return: filtered dataframe
47
+ """
48
+ if typ == 'merged':
49
+ df = df[df.eoy == eoy_date]
50
+ else:
51
+ df = df[(df[date_col] >= start_date) & (df[date_col] < end_date)]
52
+
53
+ return df
54
+
55
+
56
+ def calc_time_to_event(df, date_col, start_date, new_col):
57
+ """
58
+ Calculate time to next event
59
+ --------
60
+ :param df: dataframe
61
+ :param date_col: str name of date column
62
+ :param start_date: validation data start date
63
+ :param new_col: new column name
64
+ :return: dataframe with SafeHavenID days to event
65
+ """
66
+ df_next = df.groupby('SafeHavenID').agg(next_event=(date_col, min))
67
+ df_next = (df_next - start_date) / np.timedelta64(1, 'M')
68
+ df_next.columns = ['time_to_' + new_col]
69
+
70
+ return df_next
71
+
72
+
73
+ def bucket_time_to_event(df):
74
+ """
75
+ Calculate time in months to next event and bucket into
76
+ 1, 3, 6, 12, 12+ months.
77
+ --------
78
+ :param df: dataframe
79
+ :return: dataframe with event times in categories
80
+ """
81
+ month = [-1, 1, 3, 6, 12, 13]
82
+ label = ['1', '3', '6', '12', '12+']
83
+ df = df.apply(lambda x: pd.cut(x, month, labels=label))
84
+ df = df.fillna('12+')
85
+
86
+ return df
87
+
88
+
89
+ def calculate_event_metrics(data_path, eoy_date, start_date, end_date):
90
+ """
91
+ Generate tables with number of events in 12 months and
92
+ boolean for events
93
+ --------
94
+ :param data_path: path to generated data
95
+ :param eoy_date: end of year date
96
+ :param start_date: validation data start date
97
+ :param end_date: validation data end date
98
+ """
99
+ # Load in data
100
+ merged = pd.read_pickle(data_path + 'merged.pkl')
101
+
102
+ # Select relevant dates and columns
103
+ merged = filter_data(
104
+ merged, 'eoy', eoy_date, start_date, end_date, 'merged')
105
+ df_event = merged[['SafeHavenID'] + merged_cols]
106
+
107
+ # Create frame with total events within 12mo period
108
+ df_count = df_event.copy()
109
+ df_count.columns = ['SafeHavenID'] + n_cols
110
+ df_count.to_pickle(data_path + 'metric_table_counts.pkl')
111
+
112
+ # Create frame with boolean events within 12mo period
113
+ df_event[merged_cols] = (df_event[merged_cols] > 0).astype(int)
114
+ df_event.columns = ['SafeHavenID'] + base_cols
115
+ df_event.to_pickle(data_path + 'metric_table_events.pkl')
116
+
117
+
118
+ def calculate_next_event(data_path, extract_data_path, eoy_date,
119
+ start_date, end_date):
120
+ """
121
+ Generate table with the time in 1, 3, 6, 12, 12+ months
122
+ --------
123
+ :param data_path: path to generated data
124
+ :param extract_data_path: path to data extracts
125
+ :param eoy_date: end of year date
126
+ :param start_date: validation data start date
127
+ :param end_date: validation data end date
128
+ """
129
+ # Find next adm events
130
+ adm = pd.read_pickle(data_path + 'validation_adm_proc.pkl')
131
+ adm = filter_data(
132
+ adm, 'ADMDATE', eoy_date, start_date, end_date, 'adm')
133
+ adm['admission_any'] = 1
134
+ adm['admission_copd_resp'] = adm.copd_event | adm.resp_event
135
+ adm = adm[adm_cols]
136
+ time_to_adm_any = calc_time_to_event(
137
+ adm, 'ADMDATE', start_date, 'admission_any')
138
+ time_to_adm_copd = calc_time_to_event(
139
+ adm[adm.admission_copd_resp == 1], 'ADMDATE', start_date,
140
+ 'admission_copd_resp')
141
+
142
+ # Find next presc events
143
+ presc = pd.read_pickle(data_path + 'validation_presc_proc.pkl')
144
+ presc = filter_data(
145
+ presc, 'PRESC_DATE', eoy_date, start_date, end_date, 'presc')
146
+ presc = presc[presc_cols]
147
+ presc = presc[presc.rescue_meds == 1]
148
+ time_to_rescue = calc_time_to_event(
149
+ presc, 'PRESC_DATE', start_date, 'presc_rescue_med')
150
+
151
+ # Find next deaths
152
+ deaths = read_deaths(extract_data_path)
153
+ deaths = filter_data(
154
+ deaths, 'DOD', eoy_date, start_date, end_date, 'deaths')
155
+ deaths['death'] = 1
156
+ time_to_death = calc_time_to_event(
157
+ deaths, 'DOD', start_date, 'death')
158
+
159
+ # Merge results
160
+ frames = [time_to_adm_any, time_to_adm_copd, time_to_rescue, time_to_death]
161
+ results = pd.concat(frames, join='outer', axis=1)
162
+ results = bucket_time_to_event(results)
163
+ results.to_pickle(data_path + 'metric_table_next.pkl')
164
+
165
+
166
+ def main():
167
+
168
+ # Load in config items
169
+ with open('../../../config.json') as json_config_file:
170
+ config = json.load(json_config_file)
171
+
172
+ data_path = config['model_data_path']
173
+ extract_data_path = config['extract_data_path']
174
+ eoy_date = pd.to_datetime(config['date'])
175
+ start_date = eoy_date + pd.Timedelta(days=1)
176
+ end_date = eoy_date + pd.offsets.DateOffset(years=1)
177
+
178
+ calculate_event_metrics(data_path, eoy_date, start_date, end_date)
179
+ calculate_next_event(data_path, extract_data_path, eoy_date,
180
+ start_date, end_date)
181
+
182
+
183
+ main()
training/src/modelling/hierarchical_params.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "n_clusters": 3,
3
+ "linkage": "ward"
4
+ }
training/src/modelling/kmeans_params.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "n_clusters": 3,
3
+ "random_state": 10
4
+ }
training/src/modelling/one_vs_rest_BLR.py ADDED
@@ -0,0 +1,377 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Modelling process
3
+ """
4
+ import pandas as pd
5
+ import numpy as np
6
+ import pickle
7
+ import matplotlib.pyplot as plt
8
+ import mlflow
9
+ from matplotlib import rcParams
10
+ from sklearn.cluster import AgglomerativeClustering, KMeans
11
+ from sklearn.decomposition import PCA
12
+ from sklearn.metrics import (davies_bouldin_score, silhouette_score,
13
+ accuracy_score, confusion_matrix,
14
+ ConfusionMatrixDisplay)
15
+ from sklearn.linear_model import LogisticRegression
16
+ # from sklearn.multiclass import OneVsRestClassifier
17
+ import os
18
+
19
+
20
+ # Set-up figures
21
+ rcParams['figure.figsize'] = 20, 5
22
+ rcParams['axes.spines.top'] = False
23
+ rcParams['axes.spines.right'] = False
24
+
25
+ # Set parameters for current run
26
+ year = 2019
27
+ model_type = 'hierarchical'
28
+ data_type = 'train'
29
+ k = 3
30
+ stamp = str(pd.Timestamp.now(tz='GMT+0'))[:16].replace(':', '').replace(' ', '_')
31
+ data_path = '<YOUR_DATA_PATH>/Model_E_Extracts/'
32
+
33
+ # Set MLFlow parameters
34
+ mlflow.set_tracking_uri("file:/.")
35
+ tracking_uri = mlflow.get_tracking_uri()
36
+ experiment_name = 'Model E: one vs rest adaption BLR ' + model_type
37
+ run_name = "_".join((str(year), model_type, stamp))
38
+ description = "Clustering model with one vs rest adaption (BLR) for COPD data in " + str(year)
39
+
40
+
41
+ def extract_year(df, year):
42
+ """
43
+ Extract 1 year of data
44
+ --------
45
+ :param df: dataframe to extract from
46
+ :param year: year to select data from
47
+ :return: data from chosen year
48
+ """
49
+ return df[df.year == year]
50
+
51
+
52
+ def read_yearly_data(typ, year):
53
+ """
54
+ Read in data for year required
55
+ --------
56
+ :param typ: type of data to read in
57
+ :param year: year to select data from
58
+ :return: data from chosen year and ids
59
+ """
60
+ df = pd.read_pickle(data_path + 'min_max_' + typ + '.pkl')
61
+ df_year = extract_year(df, year)
62
+ ids = df_year.pop('SafeHavenID').to_list()
63
+ df_year = df_year.drop('year', axis=1)
64
+
65
+ return df_year, ids
66
+
67
+
68
+ def plot_variance(df, typ):
69
+ """
70
+ Plot PCA variance
71
+ ---------
72
+ :param df: dataframe to process with PCA
73
+ :param typ: type of plot - for 'full' data or 'reduced'
74
+ :return: pca object
75
+ """
76
+ pca = PCA().fit(df)
77
+ n = list(range(1, len(df.columns) + 1))
78
+ evr = pca.explained_variance_ratio_.cumsum()
79
+ fig, ax = plt.subplots()
80
+ ax.plot(n, evr)
81
+ title = 'PCA Variance - ' + typ
82
+ ax.set_title(title, size=20)
83
+ ax.set_xlabel('Number of principal components')
84
+ ax.set_ylabel('Cumulative explained variance')
85
+ ax.grid()
86
+ plt.tight_layout()
87
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
88
+
89
+ return pca
90
+
91
+
92
+ def extract_pca_loadings(df, pca_object):
93
+ """
94
+ Extract PCA loadings
95
+ --------
96
+ :param df: dataframe to reduce with pca
97
+ :param pca_object: pca object with feature loadings
98
+ :return: loadings table
99
+ """
100
+ cols = df.columns
101
+ loadings = pd.DataFrame(
102
+ data=pca_object.components_.T * np.sqrt(pca_object.explained_variance_),
103
+ columns=[f'PC{i}' for i in range(1, len(cols) + 1)],
104
+ index=cols)
105
+
106
+ return loadings
107
+
108
+
109
+ def plot_loadings(loadings):
110
+ """
111
+ Plot loadings for PC1 returned from PCA
112
+ --------
113
+ :param loadings: table of feature correlations to PC1
114
+ :return: updated loadings table
115
+ """
116
+ loadings_abs = loadings.abs().sort_values(by='PC1', ascending=False)
117
+ pc1_abs = loadings_abs[['PC1']].reset_index()
118
+ col_map = {'index': 'Attribute', 'PC1': 'AbsCorrWithPC1'}
119
+ pc1_abs = pc1_abs.rename(col_map, axis=1)
120
+ fig, ax = plt.subplots()
121
+ pc1_abs.plot(ax=ax, kind='bar')
122
+ title = 'PCA loading scores (PC1)'
123
+ ax.set_title(title, size=20)
124
+ ax.set_xticks(ticks=pc1_abs.index, labels=pc1_abs.Attribute, rotation='vertical')
125
+ ax.set_xlabel('Attribute')
126
+ ax.set_ylabel('AbsCorrWithPC1')
127
+ plt.tight_layout()
128
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
129
+
130
+ return pc1_abs
131
+
132
+
133
+ def extract_array(df, pca_object, typ):
134
+ """
135
+ Extract data to pass to clustering algos
136
+ --------
137
+ :param df: dataframe to convert
138
+ :param pca_object: initialised PCA object
139
+ :param typ: type of return needed, either 'train' or 'test'
140
+ :return: converted array (and PCA object if training)
141
+ """
142
+ if typ == 'train':
143
+ pca_func = pca_object.fit_transform
144
+ else:
145
+ pca_func = pca_object.transform
146
+
147
+ pca_data = pd.DataFrame(pca_func(df)).to_numpy()
148
+
149
+ if typ == 'train':
150
+ pca_file = data_path + run_name + '_pca.pkl'
151
+ pickle.dump(pca_object, open(pca_file, 'wb'))
152
+
153
+ return pca_data
154
+
155
+
156
+ def get_kmeans_score(data, k):
157
+ '''
158
+ Calculate K-Means Davies Bouldin and Silhouette scores
159
+ --------
160
+ :param data: dataset to fit K-Means to
161
+ :param k: number of centers/clusters
162
+ :return: Scores
163
+ '''
164
+ kmeans = KMeans(n_clusters=k)
165
+ model = kmeans.fit_predict(data)
166
+ db_score = davies_bouldin_score(data, model)
167
+ sil_score = silhouette_score(data, model)
168
+
169
+ return db_score, sil_score
170
+
171
+
172
+ def plot_DB(df):
173
+ """
174
+ Extract David Bouldin score and plot for a range of cluster numbers,
175
+ applied using K-Means clustering.
176
+
177
+ "Davies Bouldin index represents the average 'similarity' of clusters,
178
+ where similarity is a measure that relates cluster distance to cluster
179
+ size" - the lowest score indicates best cluster set.
180
+ --------
181
+ :param df: dataframe to plot from
182
+ """
183
+ db_scores = []
184
+ sil_scores = []
185
+ centers = list(range(2, 10))
186
+ for center in centers:
187
+ db_score, sil_score = get_kmeans_score(df, center)
188
+ db_scores.append(db_score)
189
+ sil_scores.append(sil_score)
190
+
191
+ # Plot DB
192
+ fig, ax = plt.subplots()
193
+ ax.plot(centers, db_scores, linestyle='--', marker='o', color='b')
194
+ ax.set_xlabel('K')
195
+ ax.set_ylabel('Davies Bouldin score')
196
+ title = 'Davies Bouldin score vs. K'
197
+ ax.set_title(title, size=20)
198
+ plt.tight_layout()
199
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
200
+
201
+ # Plot silhouette
202
+ fig, ax = plt.subplots()
203
+ ax.plot(centers, sil_scores, linestyle='--', marker='o', color='b')
204
+ ax.set_xlabel('K')
205
+ ax.set_ylabel('Silhouette score')
206
+ title = 'Silhouette score vs. K'
207
+ ax.set_title(title, size=20)
208
+ plt.tight_layout()
209
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
210
+
211
+
212
+ def plot_clust(df, labels):
213
+ """
214
+ Plot clusters
215
+ --------
216
+ :param df: dataframe to plot clusters from
217
+ :param labels: cluster labels
218
+ """
219
+ fig = plt.figure(figsize=(10, 10))
220
+ ax = fig.add_subplot(111, projection='3d')
221
+ sc = ax.scatter(df[:, 0], df[:, 1], df[:, 2], c=labels)
222
+ ax.set_xlabel('Principal Component 1')
223
+ ax.set_ylabel('Principal Component 2')
224
+ ax.set_zlabel('Principal Component 3')
225
+ ax.legend(*sc.legend_elements(), title='clusters')
226
+ title = 'Clusters'
227
+ ax.set_title(title, size=20)
228
+ plt.tight_layout()
229
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
230
+
231
+
232
+ def save_clusters(typ, labels):
233
+ """
234
+ Save results from clustering
235
+ --------
236
+ :param typ: type of datasets - train, val
237
+ :param labels: labels from clustering to add to df
238
+ :param cols: columns to use for training
239
+ :return: reduced dataframe in numpy format
240
+ """
241
+ df_full = pd.read_pickle(data_path + 'filled_' + typ + '.pkl')
242
+ df = df_full[df_full.year == year]
243
+ df['cluster'] = labels
244
+ df.to_pickle(data_path + '_'.join((run_name, typ, 'clusters.pkl')))
245
+
246
+
247
+ def main():
248
+
249
+ # Read in data
250
+ df_train, train_ids = read_yearly_data('train', year)
251
+ df_val, val_ids = read_yearly_data('val', year)
252
+
253
+ # Set up ML Flow
254
+ print('Setting up ML Flow run')
255
+ mlflow.set_tracking_uri('http://127.0.0.1:5000/')
256
+ mlflow.set_experiment(experiment_name)
257
+ mlflow.start_run(run_name=run_name, description=description)
258
+ mlflow.set_tag("model.name", model_type)
259
+ mlflow.set_tag("model.training_data", "EXAMPLE_STUDY_DATA")
260
+ mlflow.set_tag("model.training_year", year)
261
+ mlflow.log_param("n_cols", len(df_train.columns) - 1)
262
+ mlflow.log_param("k", k)
263
+
264
+ # Select top features using PCA feature importance
265
+ print('Feature reduction stage 1')
266
+ pca = plot_variance(df_train, 'full')
267
+ loadings = extract_pca_loadings(df_train, pca)
268
+ pc1_abs_loadings = plot_loadings(loadings)
269
+ variance_full = pca.explained_variance_ratio_.cumsum()
270
+
271
+ n_cols = np.argmax(variance_full >= 0.9) + 1
272
+
273
+ mlflow.log_param("pca_stage_1", n_cols)
274
+ columns = pc1_abs_loadings.Attribute[:n_cols].values
275
+ np.save(data_path + run_name + '_cols.npy', columns)
276
+
277
+ # Reduce data by selecting n columns
278
+ df_train_reduced = df_train[columns]
279
+ df_val_reduced = df_val[columns]
280
+
281
+ # Convert columns to principal components
282
+ print('Feature reduction stage 2')
283
+ pca_n_cols = plot_variance(df_train_reduced, 'reduced')
284
+ variance_reduced = pca_n_cols.explained_variance_ratio_.cumsum()
285
+
286
+ n_components = np.argmax(variance_reduced >= 0.8) + 1
287
+ mlflow.log_param("pca_stage_2", n_components)
288
+ pca_reduced = PCA(n_components=n_components)
289
+ data_train = extract_array(df_train_reduced, pca_reduced, 'train')
290
+ data_val = extract_array(df_val_reduced, pca_reduced, 'test')
291
+
292
+ # Find best cluster number
293
+ print('Detecting best cluster number')
294
+ plot_DB(data_train)
295
+
296
+ # Fit clustering model
297
+ print('Cluster model training')
298
+ data = np.concatenate((data_train, data_val))
299
+ cluster_model = AgglomerativeClustering(n_clusters=k, linkage="ward")
300
+ # cluster_model = KMeans(n_clusters=k, random_state=10)
301
+ cluster_model.fit(data)
302
+ cluster_model_file = data_path + "_".join((run_name, model_type, 'cluster_model.pkl'))
303
+ pickle.dump(cluster_model, open(cluster_model_file, 'wb'))
304
+
305
+ # Split labels
306
+ labels = cluster_model.labels_
307
+ train_labels = labels[:len(train_ids)]
308
+ val_labels = labels[len(train_ids):]
309
+ save_clusters('train', train_labels)
310
+ save_clusters('val', val_labels)
311
+
312
+ # Plot cluster results
313
+ plot_clust(data, labels)
314
+
315
+ # Train and validate classifier
316
+ print('BLR classifier training')
317
+
318
+ # Create a One-vs-Rest logistic regression classifier
319
+ clf = LogisticRegression(random_state=42, multi_class='ovr')
320
+ clf.fit(df_train_reduced.to_numpy(), train_labels)
321
+ clf_model_file = data_path + run_name + '_dtc_model.pkl'
322
+ pickle.dump(clf, open(clf_model_file, 'wb'))
323
+
324
+ # Create a figure with one feature importance subplot for each class
325
+ n_classes = len(set(train_labels))
326
+ n_features = df_train_reduced.shape[1]
327
+
328
+ fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes))
329
+
330
+ # Set the vertical spacing between subplots
331
+ fig.subplots_adjust(hspace=0.99)
332
+
333
+ # Loop over each class
334
+ for i in range(n_classes):
335
+ # Get the feature importances for the current class
336
+ coef = clf.coef_[i]
337
+ importance = coef
338
+
339
+ # Sort the feature importances in descending order
340
+ indices = np.argsort(importance)[::-1]
341
+
342
+ # Create a bar plot of the feature importances
343
+ axs[i].bar(range(n_features), importance[indices])
344
+ axs[i].set_xticks(range(n_features))
345
+ axs[i].set_xticklabels(np.array(df_train_reduced.columns)[indices], rotation=90, fontsize=9)
346
+ axs[i].set_xlabel('Features')
347
+ axs[i].set_ylabel('Importance')
348
+ axs[i].set_title('Class {} Feature Importance'.format(i))
349
+
350
+ # save the plot to a temporary file
351
+ tmpfile = "plot.png"
352
+ fig.savefig(tmpfile)
353
+
354
+ # log the plot to MLflow
355
+ with open(tmpfile, "rb") as fig:
356
+ mlflow.log_artifact(tmpfile, "feature_importance.png")
357
+
358
+ # remove the temporary file
359
+ os.remove(tmpfile)
360
+
361
+ # Make predictions on the test data
362
+ val_pred = clf.predict(df_val_reduced.to_numpy())
363
+ accuracy = accuracy_score(val_labels, val_pred)
364
+ mlflow.log_metric('dtc accuracy', accuracy)
365
+
366
+ cm = confusion_matrix(val_labels, val_pred, labels=clf.classes_)
367
+ disp = ConfusionMatrixDisplay(
368
+ confusion_matrix=cm, display_labels=clf.classes_)
369
+ disp.plot()
370
+ plt.tight_layout()
371
+ mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
372
+
373
+ # Stop ML Flow
374
+ mlflow.end_run()
375
+
376
+
377
+ main()
training/src/modelling/one_vs_rest_DTC.py ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Modelling process
3
+ """
4
+ import pandas as pd
5
+ import numpy as np
6
+ import pickle
7
+ import matplotlib.pyplot as plt
8
+ import mlflow
9
+ from matplotlib import rcParams
10
+ from sklearn.cluster import AgglomerativeClustering, KMeans
11
+ from sklearn.decomposition import PCA
12
+ from sklearn.metrics import (davies_bouldin_score, silhouette_score,
13
+ accuracy_score, confusion_matrix,
14
+ ConfusionMatrixDisplay)
15
+ from sklearn.multiclass import OneVsRestClassifier
16
+ from sklearn.tree import DecisionTreeClassifier # , export_text
17
+ import os
18
+
19
+
20
+ # Set-up figures
21
+ rcParams['figure.figsize'] = 20, 5
22
+ rcParams['axes.spines.top'] = False
23
+ rcParams['axes.spines.right'] = False
24
+
25
+ # Set parameters for current run
26
+ year = 2019
27
+ model_type = 'hierarchical'
28
+ data_type = 'train'
29
+ k = 3
30
+ stamp = str(pd.Timestamp.now(tz='GMT+0'))[:16].replace(':', '').replace(' ', '_')
31
+ data_path = '<YOUR_DATA_PATH>/Model_E_Extracts/'
32
+
33
+ # Set MLFlow parameters
34
+ mlflow.set_tracking_uri("file:/.")
35
+ tracking_uri = mlflow.get_tracking_uri()
36
+ experiment_name = 'Model E: one vs rest adaption DTC ' + model_type
37
+ run_name = "_".join((str(year), model_type, stamp))
38
+ description = "Clustering model with one vs rest adaption (DTC) for COPD data in " + str(year)
39
+
40
+
41
+ def extract_year(df, year):
42
+ """
43
+ Extract 1 year of data
44
+ --------
45
+ :param df: dataframe to extract from
46
+ :param year: year to select data from
47
+ :return: data from chosen year
48
+ """
49
+ return df[df.year == year]
50
+
51
+
52
+ def read_yearly_data(typ, year):
53
+ """
54
+ Read in data for year required
55
+ --------
56
+ :param typ: type of data to read in
57
+ :param year: year to select data from
58
+ :return: data from chosen year and ids
59
+ """
60
+ df = pd.read_pickle(data_path + 'min_max_' + typ + '.pkl')
61
+ df_year = extract_year(df, year)
62
+ ids = df_year.pop('SafeHavenID').to_list()
63
+ df_year = df_year.drop('year', axis=1)
64
+
65
+ return df_year, ids
66
+
67
+
68
+ def plot_variance(df, typ):
69
+ """
70
+ Plot PCA variance
71
+ ---------
72
+ :param df: dataframe to process with PCA
73
+ :param typ: type of plot - for 'full' data or 'reduced'
74
+ :return: pca object
75
+ """
76
+ pca = PCA().fit(df)
77
+ n = list(range(1, len(df.columns) + 1))
78
+ evr = pca.explained_variance_ratio_.cumsum()
79
+ fig, ax = plt.subplots()
80
+ ax.plot(n, evr)
81
+ title = 'PCA Variance - ' + typ
82
+ ax.set_title(title, size=20)
83
+ ax.set_xlabel('Number of principal components')
84
+ ax.set_ylabel('Cumulative explained variance')
85
+ ax.grid()
86
+ plt.tight_layout()
87
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
88
+
89
+ return pca
90
+
91
+
92
+ def extract_pca_loadings(df, pca_object):
93
+ """
94
+ Extract PCA loadings
95
+ --------
96
+ :param df: dataframe to reduce with pca
97
+ :param pca_object: pca object with feature loadings
98
+ :return: loadings table
99
+ """
100
+ cols = df.columns
101
+ loadings = pd.DataFrame(
102
+ data=pca_object.components_.T * np.sqrt(pca_object.explained_variance_),
103
+ columns=[f'PC{i}' for i in range(1, len(cols) + 1)],
104
+ index=cols)
105
+
106
+ return loadings
107
+
108
+
109
+ def plot_loadings(loadings):
110
+ """
111
+ Plot loadings for PC1 returned from PCA
112
+ --------
113
+ :param loadings: table of feature correlations to PC1
114
+ :return: updated loadings table
115
+ """
116
+ loadings_abs = loadings.abs().sort_values(by='PC1', ascending=False)
117
+ pc1_abs = loadings_abs[['PC1']].reset_index()
118
+ col_map = {'index': 'Attribute', 'PC1': 'AbsCorrWithPC1'}
119
+ pc1_abs = pc1_abs.rename(col_map, axis=1)
120
+ fig, ax = plt.subplots()
121
+ pc1_abs.plot(ax=ax, kind='bar')
122
+ title = 'PCA loading scores (PC1)'
123
+ ax.set_title(title, size=20)
124
+ ax.set_xticks(ticks=pc1_abs.index, labels=pc1_abs.Attribute, rotation='vertical')
125
+ ax.set_xlabel('Attribute')
126
+ ax.set_ylabel('AbsCorrWithPC1')
127
+ plt.tight_layout()
128
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
129
+
130
+ return pc1_abs
131
+
132
+
133
+ def extract_array(df, pca_object, typ):
134
+ """
135
+ Extract data to pass to clustering algos
136
+ --------
137
+ :param df: dataframe to convert
138
+ :param pca_object: initialised PCA object
139
+ :param typ: type of return needed, either 'train' or 'test'
140
+ :return: converted array (and PCA object if training)
141
+ """
142
+ if typ == 'train':
143
+ pca_func = pca_object.fit_transform
144
+ else:
145
+ pca_func = pca_object.transform
146
+
147
+ pca_data = pd.DataFrame(pca_func(df)).to_numpy()
148
+
149
+ if typ == 'train':
150
+ pca_file = data_path + run_name + '_pca.pkl'
151
+ pickle.dump(pca_object, open(pca_file, 'wb'))
152
+
153
+ return pca_data
154
+
155
+
156
+ def get_kmeans_score(data, k):
157
+ '''
158
+ Calculate K-Means Davies Bouldin and Silhouette scores
159
+ --------
160
+ :param data: dataset to fit K-Means to
161
+ :param k: number of centers/clusters
162
+ :return: Scores
163
+ '''
164
+ kmeans = KMeans(n_clusters=k)
165
+ model = kmeans.fit_predict(data)
166
+ db_score = davies_bouldin_score(data, model)
167
+ sil_score = silhouette_score(data, model)
168
+
169
+ return db_score, sil_score
170
+
171
+
172
+ def plot_DB(df):
173
+ """
174
+ Extract David Bouldin score and plot for a range of cluster numbers,
175
+ applied using K-Means clustering.
176
+
177
+ "Davies Bouldin index represents the average 'similarity' of clusters,
178
+ where similarity is a measure that relates cluster distance to cluster
179
+ size" - the lowest score indicates best cluster set.
180
+ --------
181
+ :param df: dataframe to plot from
182
+ """
183
+ db_scores = []
184
+ sil_scores = []
185
+ centers = list(range(2, 10))
186
+ for center in centers:
187
+ db_score, sil_score = get_kmeans_score(df, center)
188
+ db_scores.append(db_score)
189
+ sil_scores.append(sil_score)
190
+
191
+ # Plot DB
192
+ fig, ax = plt.subplots()
193
+ ax.plot(centers, db_scores, linestyle='--', marker='o', color='b')
194
+ ax.set_xlabel('K')
195
+ ax.set_ylabel('Davies Bouldin score')
196
+ title = 'Davies Bouldin score vs. K'
197
+ ax.set_title(title, size=20)
198
+ plt.tight_layout()
199
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
200
+
201
+ # Plot silhouette
202
+ fig, ax = plt.subplots()
203
+ ax.plot(centers, sil_scores, linestyle='--', marker='o', color='b')
204
+ ax.set_xlabel('K')
205
+ ax.set_ylabel('Silhouette score')
206
+ title = 'Silhouette score vs. K'
207
+ ax.set_title(title, size=20)
208
+ plt.tight_layout()
209
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
210
+
211
+
212
+ def plot_clust(df, labels):
213
+ """
214
+ Plot clusters
215
+ --------
216
+ :param df: dataframe to plot clusters from
217
+ :param labels: cluster labels
218
+ """
219
+ fig = plt.figure(figsize=(10, 10))
220
+ ax = fig.add_subplot(111, projection='3d')
221
+ sc = ax.scatter(df[:, 0], df[:, 1], df[:, 2], c=labels)
222
+ ax.set_xlabel('Principal Component 1')
223
+ ax.set_ylabel('Principal Component 2')
224
+ ax.set_zlabel('Principal Component 3')
225
+ ax.legend(*sc.legend_elements(), title='clusters')
226
+ title = 'Clusters'
227
+ ax.set_title(title, size=20)
228
+ plt.tight_layout()
229
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
230
+
231
+
232
+ def save_clusters(typ, labels):
233
+ """
234
+ Save results from clustering
235
+ --------
236
+ :param typ: type of datasets - train, val
237
+ :param labels: labels from clustering to add to df
238
+ :param cols: columns to use for training
239
+ :return: reduced dataframe in numpy format
240
+ """
241
+ df_full = pd.read_pickle(data_path + 'filled_' + typ + '.pkl')
242
+ df = df_full[df_full.year == year]
243
+ df['cluster'] = labels
244
+ df.to_pickle(data_path + '_'.join((run_name, typ, 'clusters.pkl')))
245
+
246
+
247
+ def main():
248
+
249
+ # Read in data
250
+ df_train, train_ids = read_yearly_data('train', year)
251
+ df_val, val_ids = read_yearly_data('val', year)
252
+
253
+ # Set up ML Flow
254
+ print('Setting up ML Flow run')
255
+ mlflow.set_tracking_uri('http://127.0.0.1:5000/')
256
+ mlflow.set_experiment(experiment_name)
257
+ mlflow.start_run(run_name=run_name, description=description)
258
+ mlflow.set_tag("model.name", model_type)
259
+ mlflow.set_tag("model.training_data", "EXAMPLE_STUDY_DATA")
260
+ mlflow.set_tag("model.training_year", year)
261
+ mlflow.log_param("n_cols", len(df_train.columns) - 1)
262
+ mlflow.log_param("k", k)
263
+
264
+ # Select top features using PCA feature importance
265
+ print('Feature reduction stage 1')
266
+ pca = plot_variance(df_train, 'full')
267
+ loadings = extract_pca_loadings(df_train, pca)
268
+ pc1_abs_loadings = plot_loadings(loadings)
269
+ variance_full = pca.explained_variance_ratio_.cumsum()
270
+
271
+ n_cols = np.argmax(variance_full >= 0.9) + 1
272
+
273
+ mlflow.log_param("pca_stage_1", n_cols)
274
+ columns = pc1_abs_loadings.Attribute[:n_cols].values
275
+ np.save(data_path + run_name + '_cols.npy', columns)
276
+
277
+ # Reduce data by selecting n columns
278
+ df_train_reduced = df_train[columns]
279
+ df_val_reduced = df_val[columns]
280
+
281
+ # Convert columns to principal components
282
+ print('Feature reduction stage 2')
283
+ pca_n_cols = plot_variance(df_train_reduced, 'reduced')
284
+ variance_reduced = pca_n_cols.explained_variance_ratio_.cumsum()
285
+
286
+ n_components = np.argmax(variance_reduced >= 0.8) + 1
287
+ mlflow.log_param("pca_stage_2", n_components)
288
+ pca_reduced = PCA(n_components=n_components)
289
+ data_train = extract_array(df_train_reduced, pca_reduced, 'train')
290
+ data_val = extract_array(df_val_reduced, pca_reduced, 'test')
291
+
292
+ # Find best cluster number
293
+ print('Detecting best cluster number')
294
+ plot_DB(data_train)
295
+
296
+ # Fit clustering model
297
+ print('Cluster model training')
298
+ data = np.concatenate((data_train, data_val))
299
+ cluster_model = AgglomerativeClustering(n_clusters=k, linkage="ward")
300
+ # cluster_model = KMeans(n_clusters=k, random_state=10)
301
+ cluster_model.fit(data)
302
+ cluster_model_file = data_path + "_".join((run_name, model_type, 'cluster_model.pkl'))
303
+ pickle.dump(cluster_model, open(cluster_model_file, 'wb'))
304
+
305
+ # Split labels
306
+ labels = cluster_model.labels_
307
+ train_labels = labels[:len(train_ids)]
308
+ val_labels = labels[len(train_ids):]
309
+ save_clusters('train', train_labels)
310
+ save_clusters('val', val_labels)
311
+
312
+ # Plot cluster results
313
+ plot_clust(data, labels)
314
+
315
+ # Train and validate classifier
316
+ print('BLR classifier training')
317
+
318
+ # Create a One-vs-Rest DecisionTreeClassifier
319
+ clf_pre = DecisionTreeClassifier(random_state=42)
320
+ clf = OneVsRestClassifier(clf_pre)
321
+ clf.fit(df_train_reduced.to_numpy(), train_labels)
322
+ clf_model_file = data_path + run_name + '_dtc_model.pkl'
323
+ pickle.dump(clf, open(clf_model_file, 'wb'))
324
+
325
+ # Create a figure with one feature importance subplot for each class
326
+ n_classes = len(set(train_labels))
327
+ n_features = df_train_reduced.shape[1]
328
+
329
+ fig, axs = plt.subplots(n_classes, 1, figsize=(10, 5 * n_classes))
330
+
331
+ # Set the vertical spacing between subplots
332
+ fig.subplots_adjust(hspace=0.99)
333
+
334
+ # Loop over each class
335
+ for i in range(n_classes):
336
+ # Get the feature importances for the current class
337
+ importance = clf.estimators_[i].feature_importances_
338
+
339
+ # Sort the feature importances in descending order
340
+ indices = np.argsort(importance)[::-1]
341
+
342
+ # Create a bar plot of the feature importances
343
+ axs[i].bar(range(n_features), importance[indices])
344
+ axs[i].set_xticks(range(n_features))
345
+ axs[i].set_xticklabels(np.array(df_train_reduced.columns)[indices], rotation=90, fontsize=9)
346
+ axs[i].set_xlabel('Features')
347
+ axs[i].set_ylabel('Importance')
348
+ axs[i].set_title('Class {} Feature Importance'.format(i))
349
+
350
+ # Adjust the spacing between the subplots
351
+ plt.subplots_adjust(hspace=0.5)
352
+
353
+ # save the plot to a temporary file
354
+ tmpfile = "plot.png"
355
+ fig.savefig(tmpfile)
356
+
357
+ # log the plot to MLflow
358
+ with open(tmpfile, "rb") as fig:
359
+ mlflow.log_artifact(tmpfile, "feature_importance.png")
360
+
361
+ # remove the temporary file
362
+ os.remove(tmpfile)
363
+
364
+ # Make predictions on the test data
365
+ val_pred = clf.predict(df_val_reduced.to_numpy())
366
+ accuracy = accuracy_score(val_labels, val_pred)
367
+ mlflow.log_metric('dtc accuracy', accuracy)
368
+
369
+ cm = confusion_matrix(val_labels, val_pred, labels=clf.classes_)
370
+ disp = ConfusionMatrixDisplay(
371
+ confusion_matrix=cm, display_labels=clf.classes_)
372
+ disp.plot()
373
+ plt.tight_layout()
374
+ mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
375
+
376
+ # Stop ML Flow
377
+ mlflow.end_run()
378
+
379
+
380
+ main()
training/src/modelling/predict_clusters.py ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import json
3
+ import pandas as pd
4
+ import numpy as np
5
+ import pickle
6
+
7
+
8
+ def extract_year(df, eoy_date):
9
+ """
10
+ Extract 1 year of data
11
+ --------
12
+ :param df: dataframe to extract from
13
+ :param eoy_date: user-specified end of year date
14
+ :return: data from chosen year
15
+ """
16
+ return df[df.eoy == eoy_date]
17
+
18
+
19
+ def read_yearly_data(data_path, data_type, eoy_date):
20
+ """
21
+ Read in data for year required
22
+ --------
23
+ :param data_path: path to generated data
24
+ :param data_type: type of data to read in
25
+ :param eoy_date: user-specified end of year date
26
+ :return: data from chosen year and ids
27
+ """
28
+ df = pd.read_pickle(data_path + 'min_max_' + data_type + '.pkl')
29
+ df_year = extract_year(df, eoy_date)
30
+ ids = df_year.pop('SafeHavenID').to_list()
31
+ df_year = df_year.drop('eoy', axis=1)
32
+
33
+ return df_year, ids
34
+
35
+
36
+ def main():
37
+
38
+ # Load in config items
39
+ with open('../../../config.json') as json_config_file:
40
+ config = json.load(json_config_file)
41
+
42
+ # Set model parameters
43
+ eoy_date = config['date']
44
+ data_path = config['model_data_path']
45
+
46
+ # Get datatype from cmd line
47
+ data_type = sys.argv[1]
48
+ run_name = sys.argv[2]
49
+
50
+ # Read data
51
+ print('Loading data')
52
+ columns = np.load(data_path + run_name + '_cols.npy', allow_pickle=True)
53
+ df_scaled, ids = read_yearly_data(data_path, data_type, eoy_date)
54
+ df_scaled_reduced = df_scaled[columns]
55
+ df_unscaled_full = pd.read_pickle(data_path + 'filled_' + data_type + '.pkl')
56
+ df_unscaled = extract_year(df_unscaled_full, eoy_date)
57
+
58
+ # Load model
59
+ print('Loading model')
60
+ clf_model_file = data_path + run_name + '_dtc_model.pkl'
61
+ clf = pickle.load(open(clf_model_file, 'rb'))
62
+
63
+ # Predict on new data
64
+ print('Predicting clusters')
65
+ labels = clf.predict(df_scaled_reduced.to_numpy())
66
+ df_unscaled['cluster'] = labels
67
+ df_unscaled.to_pickle(data_path + '_'.join((run_name, data_type, 'clusters.pkl')))
68
+
69
+
70
+ main()
training/src/modelling/run_model.py ADDED
@@ -0,0 +1,355 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Modelling process
3
+ """
4
+ import json
5
+ import pandas as pd
6
+ import numpy as np
7
+ import pickle
8
+ import matplotlib.pyplot as plt
9
+ import mlflow
10
+ from matplotlib import rcParams
11
+ from sklearn.cluster import AgglomerativeClustering, KMeans
12
+ from sklearn.tree import DecisionTreeClassifier as DTC
13
+ from sklearn.decomposition import PCA
14
+ from sklearn.metrics import (davies_bouldin_score, silhouette_score,
15
+ accuracy_score, confusion_matrix,
16
+ ConfusionMatrixDisplay)
17
+
18
+
19
+ # Set-up figures
20
+ rcParams['figure.figsize'] = 20, 5
21
+ rcParams['axes.spines.top'] = False
22
+ rcParams['axes.spines.right'] = False
23
+
24
+
25
+ def extract_year(df, eoy_date):
26
+ """
27
+ Extract 1 year of data
28
+ --------
29
+ :param df: dataframe to extract from
30
+ :param eoy_date: user-specified EOY date for training
31
+ :return: data from chosen year
32
+ """
33
+ return df[df.eoy == eoy_date]
34
+
35
+
36
+ def read_yearly_data(data_path, typ, eoy_date):
37
+ """
38
+ Read in data for year required
39
+ --------
40
+ :param data_path: path to generated data
41
+ :param typ: type of data to read in
42
+ :param eoy_date: end of year date to select data from
43
+ :return: data from chosen year and ids
44
+ """
45
+ df = pd.read_pickle(data_path + 'min_max_' + typ + '.pkl')
46
+ df_year = extract_year(df, eoy_date)
47
+ ids = df_year.pop('SafeHavenID').to_list()
48
+ df_year = df_year.drop('eoy', axis=1)
49
+
50
+ return df_year, ids
51
+
52
+
53
+ def plot_variance(df, typ):
54
+ """
55
+ Plot PCA variance
56
+ ---------
57
+ :param df: dataframe to process with PCA
58
+ :param typ: type of plot - for 'full' data or 'reduced'
59
+ :return: pca object
60
+ """
61
+ pca = PCA().fit(df)
62
+ n = list(range(1, len(df.columns) + 1))
63
+ evr = pca.explained_variance_ratio_.cumsum()
64
+ fig, ax = plt.subplots()
65
+ ax.plot(n, evr)
66
+ title = 'PCA Variance - ' + typ
67
+ ax.set_title(title, size=20)
68
+ ax.set_xlabel('Number of principal components')
69
+ ax.set_ylabel('Cumulative explained variance')
70
+ ax.grid()
71
+ plt.tight_layout()
72
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
73
+
74
+ return pca
75
+
76
+
77
+ def extract_pca_loadings(df, pca_object):
78
+ """
79
+ Extract PCA loadings
80
+ --------
81
+ :param df: dataframe to reduce with pca
82
+ :param pca_object: pca object with feature loadings
83
+ :return: loadings table
84
+ """
85
+ cols = df.columns
86
+ loadings = pd.DataFrame(
87
+ data=pca_object.components_.T * np.sqrt(pca_object.explained_variance_),
88
+ columns=[f'PC{i}' for i in range(1, len(cols) + 1)],
89
+ index=cols)
90
+
91
+ return loadings
92
+
93
+
94
+ def plot_loadings(loadings):
95
+ """
96
+ Plot loadings for PC1 returned from PCA
97
+ --------
98
+ :param loadings: table of feature correlations to PC1
99
+ :return: updated loadings table
100
+ """
101
+ loadings_abs = loadings.abs().sort_values(by='PC1', ascending=False)
102
+ pc1_abs = loadings_abs[['PC1']].reset_index()
103
+ col_map = {'index': 'Attribute', 'PC1': 'AbsCorrWithPC1'}
104
+ pc1_abs = pc1_abs.rename(col_map, axis=1)
105
+ fig, ax = plt.subplots()
106
+ pc1_abs.plot(ax=ax, kind='bar')
107
+ title = 'PCA loading scores (PC1)'
108
+ ax.set_title(title, size=20)
109
+ ax.set_xticks(ticks=pc1_abs.index, labels=pc1_abs.Attribute, rotation='vertical')
110
+ ax.set_xlabel('Attribute')
111
+ ax.set_ylabel('AbsCorrWithPC1')
112
+ plt.tight_layout()
113
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
114
+
115
+ return pc1_abs
116
+
117
+
118
+ def extract_array(df, data_path, run_name, pca_object, typ):
119
+ """
120
+ Extract data to pass to clustering algos
121
+ --------
122
+ :param df: dataframe to convert
123
+ :param data_path: path to generated data
124
+ :param run_name: name of run in ML Flow
125
+ :param pca_object: initialised PCA object
126
+ :param typ: type of return needed, either 'train' or 'test'
127
+ :return: converted array (and PCA object if training)
128
+ """
129
+ if typ == 'train':
130
+ pca_func = pca_object.fit_transform
131
+ else:
132
+ pca_func = pca_object.transform
133
+
134
+ pca_data = pd.DataFrame(pca_func(df)).to_numpy()
135
+
136
+ if typ == 'train':
137
+ pca_file = data_path + run_name + '_pca.pkl'
138
+ pickle.dump(pca_object, open(pca_file, 'wb'))
139
+
140
+ return pca_data
141
+
142
+
143
+ def get_kmeans_score(data, k):
144
+ '''
145
+ Calculate K-Means Davies Bouldin and Silhouette scores
146
+ --------
147
+ :param data: dataset to fit K-Means to
148
+ :param k: number of centers/clusters
149
+ :return: Scores
150
+ '''
151
+ kmeans = KMeans(n_clusters=k)
152
+ model = kmeans.fit_predict(data)
153
+ db_score = davies_bouldin_score(data, model)
154
+ sil_score = silhouette_score(data, model)
155
+
156
+ return db_score, sil_score
157
+
158
+
159
+ def plot_DB(df):
160
+ """
161
+ Extract David Bouldin score and plot for a range of cluster numbers,
162
+ applied using K-Means clustering.
163
+
164
+ "Davies Bouldin index represents the average 'similarity' of clusters,
165
+ where similarity is a measure that relates cluster distance to cluster
166
+ size" - the lowest score indicates best cluster set.
167
+ --------
168
+ :param df: dataframe to plot from
169
+ """
170
+ db_scores = []
171
+ sil_scores = []
172
+ centers = list(range(2, 10))
173
+ for center in centers:
174
+ db_score, sil_score = get_kmeans_score(df, center)
175
+ db_scores.append(db_score)
176
+ sil_scores.append(sil_score)
177
+
178
+ # Plot DB
179
+ fig, ax = plt.subplots()
180
+ ax.plot(centers, db_scores, linestyle='--', marker='o', color='b')
181
+ ax.set_xlabel('K')
182
+ ax.set_ylabel('Davies Bouldin score')
183
+ title = 'Davies Bouldin score vs. K'
184
+ ax.set_title(title, size=20)
185
+ plt.tight_layout()
186
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
187
+
188
+ # Plot silhouette
189
+ fig, ax = plt.subplots()
190
+ ax.plot(centers, sil_scores, linestyle='--', marker='o', color='b')
191
+ ax.set_xlabel('K')
192
+ ax.set_ylabel('Silhouette score')
193
+ title = 'Silhouette score vs. K'
194
+ ax.set_title(title, size=20)
195
+ plt.tight_layout()
196
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
197
+
198
+
199
+ def plot_clust(df, labels):
200
+ """
201
+ Plot clusters
202
+ --------
203
+ :param df: dataframe to plot clusters from
204
+ :param labels: cluster labels
205
+ """
206
+ fig = plt.figure(figsize=(10, 10))
207
+ ax = fig.add_subplot(111, projection='3d')
208
+ sc = ax.scatter(df[:, 0], df[:, 1], df[:, 2], c=labels)
209
+ ax.set_xlabel('Principal Component 1')
210
+ ax.set_ylabel('Principal Component 2')
211
+ ax.set_zlabel('Principal Component 3')
212
+ ax.legend(*sc.legend_elements(), title='clusters')
213
+ title = 'Clusters'
214
+ ax.set_title(title, size=20)
215
+ plt.tight_layout()
216
+ mlflow.log_figure(fig, 'fig/' + title + '.png')
217
+
218
+
219
+ def save_clusters(data_path, run_name, eoy_date, typ, labels):
220
+ """
221
+ Save results from clustering
222
+ --------
223
+ :param typ: type of datasets - train, val
224
+ :param labels: labels from clustering to add to df
225
+ :param cols: columns to use for training
226
+ :return: reduced dataframe in numpy format
227
+ """
228
+ df_full = pd.read_pickle(data_path + 'filled_' + typ + '.pkl')
229
+ df = df_full[df_full.eoy == eoy_date]
230
+ df['cluster'] = labels
231
+ df.to_pickle(data_path + '_'.join((run_name, typ, 'clusters.pkl')))
232
+
233
+
234
+ def main():
235
+
236
+ # Load in config files
237
+ with open('../../../config.json') as json_config_file:
238
+ config = json.load(json_config_file)
239
+
240
+ # Set model parameters
241
+ eoy_date = config['date']
242
+ data_path = config['model_data_path']
243
+ model_type = config['model_type']
244
+
245
+ # Load in model config
246
+ with open(model_type + '_params.json') as json_params_file:
247
+ model_params = json.load(json_params_file)
248
+
249
+ # Create ML Flow run details
250
+ stamp = str(pd.Timestamp.now(tz='GMT+0'))[:16].replace(':', '').replace(' ', '_')
251
+ experiment_name = 'Model E - Date Specific: ' + model_type
252
+ run_name = "_".join((str(eoy_date), model_type, stamp))
253
+ description = "Clustering model for COPD data in the year prior to " + str(eoy_date)
254
+
255
+ # Set up ML Flow
256
+ print('Setting up ML Flow run')
257
+ mlflow.set_tracking_uri('http://127.0.0.1:5000/')
258
+ mlflow.set_experiment(experiment_name)
259
+ mlflow.start_run(run_name=run_name, description=description)
260
+ mlflow.set_tag("model.name", model_type)
261
+ mlflow.set_tag("model.training_data", config['extract_data_path'])
262
+ mlflow.set_tag("model.training_date", eoy_date)
263
+ mlflow.log_param("k", model_params['n_clusters'])
264
+
265
+ # Read in data
266
+ df_train, train_ids = read_yearly_data(data_path, 'train', eoy_date)
267
+ df_val, val_ids = read_yearly_data(data_path, 'val', eoy_date)
268
+ mlflow.log_param("n_cols", len(df_train.columns))
269
+
270
+ # Read in data
271
+ df_train, train_ids = read_yearly_data(data_path, 'train', eoy_date)
272
+ df_val, val_ids = read_yearly_data(data_path, 'val', eoy_date)
273
+ mlflow.log_param("n_cols", len(df_train.columns))
274
+
275
+ # Select top features using PCA feature importance
276
+ print('Feature reduction stage 1')
277
+ pca = plot_variance(df_train, 'full')
278
+ loadings = extract_pca_loadings(df_train, pca)
279
+ pc1_abs_loadings = plot_loadings(loadings)
280
+ variance_full = pca.explained_variance_ratio_.cumsum()
281
+ n_cols = np.argmax(variance_full >= 0.9) + 1
282
+ mlflow.log_param("pca_stage_1", n_cols)
283
+ columns = pc1_abs_loadings.Attribute[:n_cols].values
284
+ np.save(data_path + run_name + '_cols.npy', columns)
285
+
286
+ # Reduce data by selecting n columns
287
+ df_train_reduced = df_train[columns]
288
+ df_val_reduced = df_val[columns]
289
+
290
+ # Convert columns to principal components
291
+ print('Feature reduction stage 2')
292
+ pca_n_cols = plot_variance(df_train_reduced, 'reduced')
293
+ variance_reduced = pca_n_cols.explained_variance_ratio_.cumsum()
294
+ n_components = np.argmax(variance_reduced >= 0.8) + 1
295
+ mlflow.log_param("pca_stage_2", n_components)
296
+ pca_reduced = PCA(n_components=n_components)
297
+ data_train = extract_array(
298
+ df_train_reduced, data_path, run_name, pca_reduced, 'train')
299
+ data_val = extract_array(
300
+ df_val_reduced, data_path, run_name, pca_reduced, 'test')
301
+
302
+ # Find best cluster number
303
+ print('Detecting best cluster number')
304
+ plot_DB(data_train)
305
+
306
+ # Fit clustering model
307
+ print('Cluster model training')
308
+ data = np.concatenate((data_train, data_val))
309
+ if model_type == 'hierarchical':
310
+ cluster_model = AgglomerativeClustering(**model_params)
311
+ else:
312
+ cluster_model = KMeans(**model_params)
313
+ cluster_model.fit(data)
314
+ cluster_model_file = data_path + "_".join((run_name, model_type, 'cluster_model.pkl'))
315
+ pickle.dump(cluster_model, open(cluster_model_file, 'wb'))
316
+
317
+ # Split labels
318
+ labels = cluster_model.labels_
319
+ train_labels = labels[:len(train_ids)]
320
+ val_labels = labels[len(train_ids):]
321
+ save_clusters(data_path, run_name, eoy_date, 'train', train_labels)
322
+ save_clusters(data_path, run_name, eoy_date, 'val', val_labels)
323
+
324
+ # Plot cluster results
325
+ plot_clust(data, labels)
326
+
327
+ # Read in DTC parameters
328
+ with open('dtc_params.json') as dtc_params_file:
329
+ dtc_params = json.load(dtc_params_file)
330
+
331
+ # Train and validate classifier
332
+ print('Decision tree classifier training')
333
+ clf = DTC(**dtc_params).fit(df_train_reduced.to_numpy(), train_labels)
334
+ clf_model_file = data_path + run_name + '_dtc_model.pkl'
335
+ pickle.dump(clf, open(clf_model_file, 'wb'))
336
+
337
+ # Calculate metrics
338
+ val_pred = clf.predict(df_val_reduced.to_numpy())
339
+
340
+ accuracy = accuracy_score(val_labels, val_pred)
341
+ mlflow.log_metric('dtc accuracy', accuracy)
342
+
343
+ # Plot confusion matrix
344
+ cm = confusion_matrix(val_labels, val_pred, labels=clf.classes_)
345
+ disp = ConfusionMatrixDisplay(
346
+ confusion_matrix=cm, display_labels=clf.classes_)
347
+ disp.plot()
348
+ plt.tight_layout()
349
+ mlflow.log_figure(disp.figure_, 'fig/' + 'confusion_matrix' + '.png')
350
+
351
+ # Stop ML Flow
352
+ mlflow.end_run()
353
+
354
+
355
+ main()
training/src/modelling/validate.py ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Validation process
3
+ """
4
+ import sys
5
+ import json
6
+ import pandas as pd
7
+ import numpy as np
8
+ import matplotlib.pyplot as plt
9
+ import mlflow
10
+ from matplotlib import rcParams
11
+ from tableone import TableOne
12
+
13
+
14
+ # Set-up figures
15
+ rcParams['figure.figsize'] = 20, 5
16
+ rcParams['axes.spines.top'] = False
17
+ rcParams['axes.spines.right'] = False
18
+
19
+
20
+ def plot_cluster_size(df, data_type):
21
+ """
22
+ Produce a bar plot of cluster size
23
+ --------
24
+ :param df: dataframe to plot
25
+ :param data_type: type of data - train, test, val, rec, sup
26
+ """
27
+ # Number of patients
28
+ fig, ax = plt.subplots()
29
+ df.groupby('cluster').size().plot(ax=ax, kind='barh')
30
+ title = "Patient Cohorts"
31
+ ax.set_title(title)
32
+ ax.set_xlabel("Number of Patients", size=20)
33
+ ax.set_ylabel("Cluster")
34
+ plt.tight_layout()
35
+ mlflow.log_figure(fig, 'fig/' + title.replace(' ', '_') + '_' + data_type + '.png')
36
+
37
+
38
+ def plot_feature_hist(df, col, data_type):
39
+ """
40
+ Produce a histogram plot for a chosen feature
41
+ --------
42
+ :param df: dataframe to plot
43
+ :param col: feature column to plot
44
+ :param data_type: type of data - train, test, val, rec, sup
45
+ """
46
+ fig, ax = plt.subplots()
47
+ df.groupby('cluster')[col].plot(ax=ax, kind='hist', alpha=0.5)
48
+ ax.set_xlabel(col)
49
+ title = col + ' Histogram'
50
+ ax.set_title(title, size=20)
51
+ ax.legend()
52
+ plt.tight_layout()
53
+ mlflow.log_figure(fig, 'fig/' + title.replace(' ', '_') + '_' + data_type + '.png')
54
+
55
+
56
+ def plot_feature_bar(data, col, typ, data_type):
57
+ """
58
+ Produce a bar plot for a chosen feature
59
+ --------
60
+ :param df: dataframe to plot
61
+ :param col: feature column to plot
62
+ :param typ: 'count' or 'percentage'
63
+ :param data_type: type of data - train, test, val, rec, sup
64
+ """
65
+ if typ == 'count':
66
+ to_plot = data.groupby(['cluster']).apply(
67
+ lambda x: x.groupby(col).size())
68
+ x_label = "Number"
69
+ else:
70
+ to_plot = data.groupby(['cluster']).apply(
71
+ lambda x: 100 * x.groupby(col).size() / len(x))
72
+ x_label = "Percentage"
73
+ fig, ax = plt.subplots()
74
+ to_plot.plot(ax=ax, kind='barh')
75
+ title = "Patient Cohorts"
76
+ ax.set_title(title, size=20)
77
+ ax.set_xlabel(x_label + " of patients")
78
+ ax.set_ylabel("Cluster")
79
+ plt.tight_layout()
80
+ mlflow.log_figure(fig, 'fig/' + '_'.join((title.replace(' ', '_'), col, data_type + '.png')))
81
+
82
+
83
+ def plot_cluster_bar(data, typ, data_type):
84
+ """
85
+ Produce a bar plot for a chosen feature
86
+ --------
87
+ :param data: data to plot
88
+ :param typ: 'count' or 'percentage'
89
+ :param data_type: type of data - train, test, val, rec, sup
90
+ """
91
+ fig, ax = plt.subplots()
92
+ data.plot(ax=ax, kind='bar')
93
+ ax.set_title(typ, size=20)
94
+ ax.set_xlabel("Cluster")
95
+ ax.set_ylabel("Percentage")
96
+ ax.set_ylim(0, 100)
97
+ plt.tight_layout()
98
+ mlflow.log_figure(fig, 'fig/' + typ + '_' + data_type + '.png')
99
+
100
+
101
+ def plot_events(df, data_type):
102
+ """
103
+ Plot events in the next 12 months based on metric table
104
+ --------
105
+ :param df: metric table
106
+ :param data_type: type of data - train, test, val, rec, sup
107
+ """
108
+ df = df.drop('SafeHavenID', axis=1).set_index('cluster')
109
+ events = df.groupby('cluster').apply(lambda x: 100 * x.apply(
110
+ lambda x: len(x[x == 1]) / len(x)))
111
+ plot_cluster_bar(events, 'events', data_type)
112
+
113
+
114
+ def process_deceased_metrics(col):
115
+ """
116
+ Process deceased column for plotting
117
+ -------
118
+ :param col: column to process
119
+ """
120
+ n_deceased = 100 * ((col[col < '12+']).count()) / len(col)
121
+ res = pd.DataFrame({'alive': [100 - n_deceased], 'deceased': [n_deceased]})
122
+
123
+ return res
124
+
125
+
126
+ def plot_deceased(df, data_type):
127
+ """
128
+ Plot events in the next 12 months based on metric table
129
+ --------
130
+ :param df: metric table
131
+ :param data_type: type of data - train, test, val, rec, sup
132
+ """
133
+ survival = df.groupby('cluster')['time_to_death'].apply(
134
+ process_deceased_metrics).reset_index().drop(
135
+ 'level_1', axis=1).set_index('cluster')
136
+ plot_cluster_bar(survival, 'survival', data_type)
137
+
138
+
139
+ def plot_therapies(df_year, results, data_type):
140
+ """
141
+ Plot patient therapies per cluster
142
+ --------
143
+ :param df_year: unscaled data for current year
144
+ :param results: cluster results and safehaven id
145
+ :param data_type: type of data - train, test, val, rec, sup
146
+ """
147
+ # Inhaler data for training group
148
+ therapies = df_year[['SafeHavenID', 'single_inhaler', 'double_inhaler', 'triple_inhaler']]
149
+ res_therapies = pd.merge(therapies, results, on='SafeHavenID', how='inner')
150
+
151
+ # Find counts/percentage per cluster
152
+ inhaler_cols = ['single_inhaler', 'double_inhaler', 'triple_inhaler']
153
+ inhals = res_therapies[['cluster'] + inhaler_cols].set_index('cluster')
154
+ in_res = inhals.groupby('cluster').apply(
155
+ lambda x: x.apply(lambda x: 100 * (x[x > 0].count()) / len(x)))
156
+
157
+ # Number of people without an inhaler presc
158
+ no_in = res_therapies.groupby('cluster').apply(
159
+ lambda x: 100 * len(x[(x[inhaler_cols] == 0).all(axis=1)]) / len(x)).values
160
+
161
+ # Rename columns for plotting
162
+ in_res.columns = [c[0] for c in in_res.columns.str.split('_')]
163
+
164
+ # Add those with no inhaler
165
+ in_res['no_inhaler'] = no_in
166
+
167
+ plot_cluster_bar(in_res, 'therapies', data_type)
168
+
169
+
170
+ def main():
171
+
172
+ # Load in config items
173
+ with open('../../../config.json') as json_config_file:
174
+ config = json.load(json_config_file)
175
+ data_path = config['model_data_path']
176
+
177
+ # Get datatype from cmd line
178
+ data_type = sys.argv[1]
179
+ run_name = sys.argv[2]
180
+ run_id = sys.argv[3]
181
+
182
+ # Set MLFlow parameters
183
+ model_type = 'hierarchical'
184
+ experiment_name = 'Model E - Date Specific: ' + model_type
185
+ mlflow.set_tracking_uri('http://127.0.0.1:5000/')
186
+ mlflow.set_experiment(experiment_name)
187
+ mlflow.start_run(run_id=run_id)
188
+
189
+ # Read in unscaled data, results and column names used to train model
190
+ columns = np.load(data_path + run_name + '_cols.npy', allow_pickle=True)
191
+ df_clusters = pd.read_pickle(data_path + "_".join((run_name, data_type, 'clusters.pkl')))
192
+ df_reduced = df_clusters[list(columns) + ['cluster']]
193
+
194
+ # Number of patients
195
+ plot_cluster_size(df_reduced, data_type)
196
+
197
+ # Generate mean/std table
198
+ t1_year = TableOne(df_reduced, categorical=[], groupby='cluster', pval=True)
199
+ t1yr_file = data_path + 't1_year_' + run_name + '_' + data_type + '.html'
200
+ t1_year.to_html(t1yr_file)
201
+ mlflow.log_artifact(t1yr_file)
202
+
203
+ # Histogram feature plots
204
+ plot_feature_hist(df_clusters, 'age', data_type)
205
+ plot_feature_hist(df_clusters, 'albumin_med_2yr', data_type)
206
+
207
+ # Bar plots
208
+ df_clusters['sex'] = df_clusters['sex_bin'].map({0: 'Male', 1: 'Female'})
209
+ plot_feature_bar(df_clusters, 'sex', 'percent', data_type)
210
+ plot_feature_bar(df_clusters, 'simd_decile', 'precent', data_type)
211
+
212
+ # Metrics for following 12 months
213
+ df_events = pd.read_pickle(data_path + 'metric_table_events.pkl')
214
+ df_counts = pd.read_pickle(data_path + 'metric_table_counts.pkl')
215
+ df_next = pd.read_pickle(data_path + 'metric_table_next.pkl')
216
+
217
+ # Merge cluster number with SafeHavenID and metrics
218
+ clusters = df_clusters[['SafeHavenID', 'cluster']]
219
+ df_events = clusters.merge(df_events, on='SafeHavenID', how='left').fillna(0)
220
+ df_counts = clusters.merge(df_counts, on='SafeHavenID', how='left').fillna(0)
221
+ df_next = clusters.merge(df_next, on='SafeHavenID', how='left').fillna('12+')
222
+
223
+ # Generate TableOne for events
224
+ cat_cols = df_events.columns[2:]
225
+ df_events[cat_cols] = df_events[cat_cols].astype('int')
226
+ event_limit = dict(zip(cat_cols, 5 * [1]))
227
+ event_order = dict(zip(cat_cols, 5 * [[1, 0]]))
228
+ t1_events = TableOne(df_events[df_events.columns[1:]], groupby='cluster',
229
+ limit=event_limit, order=event_order)
230
+ t1_events_file = data_path + '_'.join(('t1', data_type, 'events', run_name + '.html'))
231
+ t1_events.to_html(t1_events_file)
232
+ mlflow.log_artifact(t1_events_file)
233
+
234
+ # Generate TableOne for event counts
235
+ count_cols = df_counts.columns[2:]
236
+ df_counts[count_cols] = df_counts[count_cols].astype('int')
237
+ t1_counts = TableOne(df_counts[df_counts.columns[1:]], categorical=[], groupby='cluster')
238
+ t1_counts_file = data_path + '_'.join(('t1', data_type, 'counts', run_name + '.html'))
239
+ t1_counts.to_html(t1_counts_file)
240
+ mlflow.log_artifact(t1_counts_file)
241
+
242
+ # Generate TableOne for time to next events
243
+ next_cols = df_next.columns[2:]
244
+ next_event_order = dict(zip(next_cols, 5 * [['1', '3', '6', '12', '12+']]))
245
+ t1_next = TableOne(df_next[df_next.columns[1:]], groupby='cluster',
246
+ order=next_event_order)
247
+ t1_next_file = data_path + '_'.join(('t1', data_type, 'next', run_name + '.html'))
248
+ t1_next.to_html(t1_next_file)
249
+ mlflow.log_artifact(t1_next_file)
250
+
251
+ # Plot metrics
252
+ plot_events(df_events, data_type)
253
+ plot_deceased(df_next, data_type)
254
+ plot_therapies(df_clusters, clusters, data_type)
255
+
256
+ # Stop ML Flow
257
+ mlflow.end_run()
258
+
259
+
260
+ main()
training/src/processing/README.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Processing
2
+
3
+ This folder contains scripts for processing raw EHR data, along with the mappings required to carry out the initial processing steps.
4
+
5
+ Before running any scripts, first create a directory called 'Model_E_Extracts' within the 'S:/data' directory.
6
+
7
+ _NB: The below processing scripts can be run in any order._
8
+
9
+ ### Admissions
10
+
11
+ - process_admissions.py - SMR01 COPD/Resp admissions per patient per year
12
+ - process_comorbidities.py - SMR01 comorbidities per patient per year
13
+
14
+ ### Demographics
15
+
16
+ - process_demographics.py - DOB, sex, marital status and SIMD data
17
+
18
+ ### Labs
19
+
20
+ - process_labs.py - lab test values per patient per year, taking the median lab test value from the 2 years prior
21
+
22
+ ### Prescribing
23
+
24
+ - process_prescribing.py - prescriptions per patient per year
training/src/processing/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Empty file for folder to be recognised as module
training/src/processing/mappings/Comorbidity feature review for models & clin summary update v2 May 2021.xlsx ADDED
Binary file (98.3 kB). View file
 
training/src/processing/mappings/README.md ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ # Mappings
2
+
3
+ This folder contains a range of mappings used within the processing stages of model E:
4
+ - `inhaler_mapping.json`: Inhaler mappings for any Chapter 3 BNF code inhaler prescriptions present in the SafeHaven prescribing dataset. Information on NHS inhaler types, found [here](https://www.coch.nhs.uk/media/172781/3-respiratory-system.pdf), was used to create the mapping.
5
+ - `test_mapping.json`: A mapping created for any of the top 20 most frequently occurring lab tests, plus any lab tests found relevant for indicating COPD severity in Model A. This mapping creates a common name for a specific test and lists any related names the test may appear under within the SCI Store dataset.
6
+ - `Comorbidity feature review for models & clin summary update v2 May 2021.xlsx`: A mapping between diagnosis names found in SMR01 and their associated comorbidities (taken from Model A).
7
+ - `diag_copd_resp_desc.json`: DIAGDesc for COPD and respiratory admissions
training/src/processing/mappings/diag_copd_resp_desc.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "copd": "CHRONIC OBSTRUCTIVE PULMONARY DISEASE",
3
+ "resp": ["PNEUMONITIS DUE TO FOOD AND VOMIT", "RESPIRATORY FAILURE, UNSPECIFIED; TYPE UNSPECIFIED", "CHRONIC RESPIRATORY FAILURE; TYPE II [HYPERCAPNIC]", "BRONCHOPNEUMONIA, UNSPECIFIED", "DYSPNOEA", "PLEURAL EFFUSION IN CONDITIONS CLASSIFIED ELSEWHERE", "RESPIRATORY FAILURE, UNSPECIFIED; TYPE [HYPERCAPNIC]", "PLEURAL EFFUSION, NOT ELSEWHERE CLASSIFIED", "CHRONIC RESPIRATORY FAILURE", "OTHER BACTERIAL PNEUMONIA", "ABN MICROBIOLOGICAL FINDINGS IN SPECS FROM RESPIRATORY ORGANS AND THORAX", "RESPIRATORY FAILURE, UNSPECIFIED", "PNEUMONIA, UNSPECIFIED", "LOBAR PNEUMONIA, UNSPECIFIED", "COUGH", "PLEURAL PLAQUE WITH PRESENCE OF ASBESTOS", "PLEURAL PLAQUE WITHOUT ASBESTOS", "OTHER DISORDERS OF LUNG", "OTHER SPECIFIED PLEURAL CONDITIONS", "PULMONARY COLLAPSE", "ACQUIRED ABSENCE OF LUNG [PART OF]", "ASPHYXIATION", "RESPIRATORY FAILURE, UNSPECIFIED; TYPE [HYPOXIC]", "TRACHEOSTOMY STATUS", "ACUTE RESPIRATORY FAILURE", "UNSPECIFIED ACUTE LOWER RESPIRATORY INFECTION", "OTHER SPECIFIED SYMPTOMS AND SIGNS INVOLVING THE CIRC AND RESP SYSTEMS", "BACTERIAL PNEUMONIA, UNSPECIFIED", "PYOTHORAX WITHOUT FISTULA", "DISEASES OF BRONCHUS, NOT ELSEWHERE CLASSIFIED", "PNEUMONIA DUE TO HAEMOPHILUS INFLUENZAE", "ABNORMAL SPUTUM", "OTHER POSTPROCEDURAL RESPIRATORY DISORDERS", "OTHER AND UNSPECIFIED ABNORMALITIES OF BREATHING", "INFLUENZA WITH OTHER RESP MANIFESTATIONS, SEASONAL INFLUENZA VIRUS IDENTIF", "PERSONAL HISTORY OF DISEASES OF THE RESPIRATORY SYSTEM", "PNEUMONIA DUE TO STREPTOCOCCUS PNEUMONIAE", "WHEEZING", "CHEST PAIN ON BREATHING", "HAEMOPTYSIS", "INFLUENZA WITH OTHER MANIFESTATIONS, VIRUS NOT IDENTIFIED", "OTHER SPECIFIED RESPIRATORY DISORDERS", "ACUTE UPPER RESPIRATORY INFECTION, UNSPECIFIED", "T.B. OF LUNG, W/O MENTION OF BACTERIOLOGICAL OR HISTOLOGICAL CONFIRMATION", "DEPENDENCE ON RESPIRATOR", "PLEURISY", "BRONCHITIS, NOT SPECIFIED AS ACUTE OR CHRONIC"],
4
+ "anxiety_depression": ["ADVERSE EFFECTS OF OTHER SEDATIVES, HYPNOTICS AND ANTIANXIETY DRUGS", "ADVERSE EFFECTS OF SEDATIVE, HYPNOTIC AND ANTIANXIETY DRUG, UNSPECIFIED", "ANXIETY DISORDER, UNSPECIFIED", "ANXIOUS [AVOIDANT] PERSONALITY DISORDER", "GENERALIZED ANXIETY DISORDER", "MIXED ANXIETY AND DEPRESSIVE DISORDER", "OTHER MIXED ANXIETY DISORDERS", "OTHER PHOBIC ANXIETY DISORDERS", "OTHER SPECIFIED ANXIETY DISORDERS", "PANIC DISORDER [EPISODIC PAROXYSMAL ANXIETY]", "PHOBIC ANXIETY DISORDER, UNSPECIFIED", "ADVERSE EFFECTS OF MONOAMINE-OXIDASE-INHIBITOR ANTIDEPRESSANTS", "ADVERSE EFFECTS OF OTHER AND UNSPECIFIED ANTIDEPRESSANTS", "ADVERSE EFFECTS OF TRICYCLIC AND TETRACYCLIC ANTIDEPRESSANTS", "BIPOLAR AFFECT DISORDER, CUR EPISODE SEVERE DEPRESSION WITH PSYCHOTIC SYMP", "BIPOLAR AFFECTIVE DISORDER, CURR EPISODE SEV DEPRESSION W/O PSYCHOTIC SYMP", "BIPOLAR AFFECTIVE DISORDER, CURRENT EPISODE MILD OR MODERATE DEPRESSION", "BRIEF DEPRESSIVE REACTION", "DEPRESSIVE EPISODE, UNSPECIFIED", "MILD DEPRESSIVE EPISODE", "MIXED ANXIETY AND DEPRESSIVE DISORDER", "MODERATE DEPRESSIVE EPISODE", "OTHER DEPRESSIVE EPISODES", "OTHER RECURRENT DEPRESSIVE DISORDERS", "POISONING BY MONOAMINE-OXIDASE-INHIBITOR ANTIDEPRESSANTS", "POISONING BY OTHER AND UNSPECIFIED ANTIDEPRESSANTS", "POISONING BY TRICYCLIC AND TETRACYCLIC ANTIDEPRESSANTS", "POST-SCHIZOPHRENIC DEPRESSION", "RECURRENT DEPRESSIVE DISORDER, CURRENT EPISODE MODERATE", "RECURRENT DEPRESSIVE DISORDER, CURRENT EPISODE SEVERE W/O PSYCHOTIC SYMPT", "RECURRENT DEPRESSIVE DISORDER, CURRENT EPISODE SEVERE WITH PSYCHOTIC SYMPT", "RECURRENT DEPRESSIVE DISORDER, UNSPECIFIED", "SCHIZOAFFECTIVE DISORDER, DEPRESSIVE TYPE", "SEVERE DEPRESSIVE EPISODE WITH PSYCHOTIC SYMPTOMS", "SEVERE DEPRESSIVE EPISODE WITHOUT PSYCHOTIC SYMPTOMS"]
5
+ }
training/src/processing/mappings/inhaler_mapping.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "SABA": [
3
+ "BAMBUTEROL HYDROCHLORIDE",
4
+ "SALBUTAMOL",
5
+ "TERBUTALINE SULFATE"
6
+ ],
7
+ "LABA": [
8
+ "FORMOTEROL FUMARATE",
9
+ "INDACATEROL",
10
+ "OLODATEROL",
11
+ "SALMETEROL"
12
+ ],
13
+ "LAMA": [
14
+ "ACLIDINIUM BROMIDE",
15
+ "TIOTROPIUM",
16
+ "UMECLIDINIUM BROMIDE"
17
+ ],
18
+ "SAMA": [
19
+ "IPRATROPIUM BROMIDE",
20
+ "LABA-LAMA",
21
+ "ACLIDINIUM BROMIDE AND FORMOTEROL FUMARATE",
22
+ "INDACATEROL WITH GLYCOPYRRONIUM BROMIDE",
23
+ "TIOTROPIUM AND OLODATEROL",
24
+ "UMECLIDINIUM BROMIDE AND VILANTEROL TRIFENATATE"
25
+ ],
26
+ "ICS": [
27
+ "BECLOMETASONE DIPROPIONATE",
28
+ "BUDESONIDE",
29
+ "CICLESONIDE",
30
+ "FLUTICASONE PROPIONATE",
31
+ "MOMETASONE FUROATE"
32
+ ],
33
+ "LABA-ICS": [
34
+ "BECLOMETASONE DIPROPIONATE AND FORMOTEROL FUMARATE",
35
+ "BUDESONIDE WITH FORMOTEROL FUMARATE",
36
+ "FLUTICASONE FUROATE AND VILANTEROL",
37
+ "FLUTICASONE PROPIONATE AND FORMOTEROL FUMARATE",
38
+ "SALMETEROL WITH FLUTICASONE PROPIONATE"
39
+ ],
40
+ "LAMA +LABA-ICS": [
41
+ "BECLOMETASONE DIPROPIONATE AND FORMOTEROL FUMARATE AND GLYCOPYRRONIUM",
42
+ "FLUTICASONE FUROATE WITH UMECLIDINIUM BROMIDE AND VILANTEROL TRIFENATATE"
43
+ ],
44
+ "LABA-LAMA-ICS": [],
45
+ "SABA + SAMA": [
46
+ "SALBUTAMOL WITH IPRATROPIUM"
47
+ ],
48
+ "MCS": [
49
+ "NEDOCROMIL SODIUM",
50
+ "SODIUM CROMOGLICATE"
51
+ ],
52
+ "Ignore": [
53
+ "MENTHOL WITH EUCALYPTUS"
54
+ ]
55
+ }
training/src/processing/mappings/test_mapping.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"ALT": ["A.L.T.", "Alanine Transaminase"], "AST": ["A.S.T.", "Aspartate Transam", "Aspartate Transamina"], "Alkaline Phosphatase": "Alkaline Phos.", "Basophils": ["ABS BASOPHIL", "BASOPHIL (MANUAL)", "BASOPHILS", "Basophil count"], "C Reactive Protein": "C-reactive Protein", "Eosinophils": ["EOSINOPHIL (MANUAL)", "EOSINOPHILS", "Eosinophil count", "EOSINOPHILS ABSOLUTE", "Eosinophils\u017d"], "Haematocrit": "HAEMATOCRIT", "Haemoglobin": ["HAEMOGLOBIN", "HAEMOGLOBIN A1c"], "Lymphocytes": ["ABSOLUTE LYMPHOCYTES", "LYMPHOCYTES", "Lymphocyte Count", "Lymphocyte count"], "Mean Cell Volume": ["MEAN CELL VOLUME", "Mean cell volume"], "Monocytes": ["ABSOLUTE MONOCYTES", "MONOCYTES", "Monocyte count"], "Neutrophils": ["ABSOLUTE NEUTROPHILS", "NEUTROPHILS", "Neutrophil count"], "PCO2 Temp Corrected": "PCO2 temp corrected", "Platelets": ["PLATELET COUNT", "PLATELETS", "Platelet Count", "Platelet count"], "Red Blood Count": ["Red Cell Count", "RED BLOOD COUNT", "RED CELL COUNT", "Red Blood Cell Count", "Red blood cell (RBC) count", "Red blood count"], "Serum Vitamin B12": ["Serum vitamin B12", "SERUM B12"], "White Blood Count": ["WBC", "WBC - Biological Fl", "WHITE BLOOD CELLS", "WHITE BLOOD COUNT", "White Cell Count", "White blood count"]}
training/src/processing/misc/process_gples.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Process GPLES data
3
+ --------
4
+ Extract the number of COPD GP events per patient per year
5
+ """
6
+ import pandas as pd
7
+ from utils.common import read_data, first_patient_appearance
8
+
9
+
10
+ def initialize_gples_data(file):
11
+ """
12
+ Load in and convert GPLES dataset to correct format
13
+ --------
14
+ :param file: filename to read from
15
+ :return: gples dataframe with correct column names and types
16
+ """
17
+ print('Loading GPLES data')
18
+
19
+ # Read in data
20
+ gp_cols = ['SafeHavenID', 'EventDate', 'ShortName']
21
+ gp_types = ['int', 'object', 'str']
22
+ df = read_data(file, gp_cols, gp_types)
23
+
24
+ # Drop nulls and duplicates
25
+ df = df.dropna().drop_duplicates()
26
+
27
+ # Convert date columns to correct type
28
+ df.columns = ['SafeHavenID', 'ADMDATE', 'ShortName']
29
+ df['ADMDATE'] = pd.to_datetime(df['ADMDATE'])
30
+
31
+ # Only track COPD events
32
+ df = df[df.ShortName == 'COPD'][['SafeHavenID', 'ADMDATE']]
33
+ df['gp_copd_event'] = 1
34
+
35
+ return df
36
+
37
+
38
+ def extract_yearly_data(df):
39
+ """
40
+ Extract data per year from GPLES dataset
41
+ --------
42
+ :param df: gples dataframe to be processed
43
+ :return: reduced gples dataset
44
+ """
45
+ print('Reducing GPLES data')
46
+
47
+ # Extract year column for historical features
48
+ df['year'] = df.ADMDATE.dt.year
49
+
50
+ # Extract yearly data
51
+ group_cols = ['SafeHavenID', 'year']
52
+ gples_events = df.groupby(group_cols)[['gp_copd_event']].sum()
53
+
54
+ return gples_events
55
+
56
+
57
+ def main():
58
+
59
+ # Load data
60
+ gp_file = "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/GPLES_Cohort3R.csv"
61
+ gples = initialize_gples_data(gp_file)
62
+
63
+ # Save first date in dataset
64
+ first_patient_appearance(gples, 'ADMDATE', 'gples')
65
+
66
+ # Reduce GPLES to 1 row per year per ID
67
+ gples_yearly = extract_yearly_data(gples)
68
+
69
+ # Save data
70
+ gples_yearly.to_pickle('<YOUR_DATA_PATH>/Model_E_Extracts/gples_proc.pkl')
71
+
72
+
73
+ main()
training/src/processing/misc/process_validation_adm.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from utils.adm_common import (initialize_adm_data, correct_stays,
2
+ track_copd_resp)
3
+
4
+
5
+ def main():
6
+
7
+ # Load in data
8
+ adm_file = "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/SMR01_Cohort3R.csv"
9
+ adm = initialize_adm_data(adm_file)
10
+
11
+ # Fill null STAY data and combine transfer admissions
12
+ adm = correct_stays(adm)
13
+
14
+ # Track COPD and respiratory events
15
+ adm = track_copd_resp(adm)
16
+
17
+ # Select relevant columns
18
+ adm_reduced = adm[['SafeHavenID', 'ADMDATE', 'copd_event', 'resp_event']]
19
+
20
+ # Track events
21
+ adm_reduced['copd_resp_event'] = adm_reduced['copd_event'] | adm_reduced['resp_event']
22
+ adm_reduced['adm_event'] = 1
23
+
24
+ # Save data
25
+ adm_reduced.to_pickle('<YOUR_DATA_PATH>/Model_E_Extracts/validation_adm_proc-og.pkl')
26
+
27
+
28
+ main()
training/src/processing/misc/process_validation_presc.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from utils.presc_common import initialize_presc_data, track_medication
2
+
3
+
4
+ def main():
5
+
6
+ # Read in data
7
+ presc_file = "<YOUR_DATA_PATH>/EXAMPLE_STUDY_DATA/Pharmacy_Cohort3R.csv"
8
+ presc = initialize_presc_data(presc_file)
9
+
10
+ # Track salbutamol and rescue meds
11
+ presc = track_medication(presc)
12
+
13
+ # Reduce columns
14
+ presc_reduced = presc[['SafeHavenID', 'PRESC_DATE', 'rescue_meds']]
15
+
16
+ # Save data
17
+ presc_reduced.to_pickle('<YOUR_DATA_PATH>/Model_E_Extracts/validation_presc_proc-og.pkl')
18
+
19
+
20
+ main()
training/src/processing/process_admissions.py ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Process SMR01 admission data
3
+ --------
4
+ Clean and process admission data while adding tracking for COPD and respiratory
5
+ admissions per year for each SafeHavenID
6
+ """
7
+ import json
8
+ import pandas as pd
9
+ from datetime import date
10
+ from dateutil.relativedelta import relativedelta
11
+ from utils.common import add_hist_adm_presc, first_patient_appearance
12
+ from utils.adm_common import (initialize_adm_data, correct_stays,
13
+ track_copd_resp)
14
+ from utils.adm_processing import (convert_ethgrp_desc, mode_ethnicity,
15
+ search_diag)
16
+ from utils.adm_reduction import fill_missing_years, calc_adm_per_year
17
+
18
+
19
+ def process_ethnicity(df):
20
+ """
21
+ Find relevant ethnic group for each patient, accounting for null data
22
+ --------
23
+ :param df: admission dataframe to be updated
24
+ :return: admission dataframe with ethnicity cleaned and updated
25
+ """
26
+ print('Processing ethnicity')
27
+
28
+ # Fill in missing ethnicities
29
+ df = df.rename(columns={'ETHGRP': 'eth_grp'})
30
+ df['eth_grp'] = df.eth_grp.str.strip()
31
+ df['eth_grp'] = df.groupby('SafeHavenID')['eth_grp'].apply(
32
+ lambda x: x.ffill().bfill().fillna('Unknown'))
33
+
34
+ # Convert to 1 of 7 ethnic groups
35
+ df['eth_grp'] = [convert_ethgrp_desc(eth) for eth in df.eth_grp]
36
+
37
+ # Find most commonly occurring ethnicity per SafeHavenID
38
+ df = df.groupby('SafeHavenID').apply(mode_ethnicity, 'eth_grp')
39
+
40
+ return df
41
+
42
+
43
+ def add_eoy_column(df, dt_col, eoy_date):
44
+ """
45
+ Add EOY relative to user-specified end date
46
+ --------
47
+ :param df: dataframe
48
+ :param dt_col: date column in dataframe
49
+ :param eoy_date: EOY date from config
50
+ :return: updated df with EOY column added
51
+ """
52
+ # Needed to stop error with creating a new column
53
+ df = df.reset_index(drop=True)
54
+
55
+ # Add column with user-specified end of year date
56
+ end_date = pd.to_datetime(eoy_date)
57
+ end_month = end_date.month
58
+ end_day = end_date.day
59
+
60
+ # Add for every year
61
+ df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year]
62
+
63
+ # Check that EOY date is after dt_col for each entry
64
+ eoy_index = df.columns[df.columns == 'eoy']
65
+ adm_vs_eoy = df[dt_col] > df.eoy
66
+ row_index = df.index[adm_vs_eoy]
67
+ df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1)
68
+ df['eoy'] = pd.to_datetime(df.eoy)
69
+
70
+ return df
71
+
72
+
73
+ def extract_yearly_data(df):
74
+ """
75
+ Extract features on a yearly basis for each SafeHavenID
76
+ --------
77
+ :param adm: admission dataframe to be updated
78
+ :return: dataframe with feature values per year
79
+ """
80
+ print('Reducing to 1 row SafeHavenID per year')
81
+
82
+ # Track rows which are admissions
83
+ df['adm'] = 1
84
+
85
+ # Add rows from years where patient did not have admissions
86
+ df = df.groupby('SafeHavenID').apply(fill_missing_years)
87
+ df = df.reset_index(drop=True)
88
+
89
+ # Add any historical count columns
90
+ df = df.groupby('SafeHavenID').apply(add_hist_adm_presc, 'adm', 'ADMDATE')
91
+ df = df.reset_index(drop=True)
92
+
93
+ # Reduce data to 1 row per year
94
+ df = calc_adm_per_year(df)
95
+
96
+ # Select columns in final order
97
+ final_cols = ['eth_grp', 'adm_per_year', 'total_hosp_days',
98
+ 'mean_los', 'copd_per_year', 'resp_per_year',
99
+ 'anxiety_depression_per_year', 'days_since_copd',
100
+ 'days_since_resp', 'days_since_adm', 'adm_to_date',
101
+ 'copd_to_date', 'resp_to_date', 'anxiety_depression_to_date',
102
+ 'copd_date', 'resp_date', 'adm_date']
103
+
104
+ df = df[final_cols]
105
+
106
+ return df
107
+
108
+
109
+ def main():
110
+
111
+ # Load in config items
112
+ with open('../../../config.json') as json_config_file:
113
+ config = json.load(json_config_file)
114
+
115
+ # Load in data
116
+ adm_file = config['extract_data_path'] + 'SMR01_Cohort3R.csv'
117
+ adm = initialize_adm_data(adm_file)
118
+
119
+ # Fill null STAY data and combine transfer admissions
120
+ adm = correct_stays(adm)
121
+
122
+ # Save first date in dataset
123
+ data_path = config['model_data_path']
124
+ first_patient_appearance(adm, 'ADMDATE', 'adm', data_path)
125
+
126
+ # Process ethnicity data
127
+ adm = process_ethnicity(adm)
128
+
129
+ # Track COPD and respiratory events
130
+ adm = track_copd_resp(adm)
131
+
132
+ # Track anxiety event
133
+ adm = search_diag(adm, 'anxiety_depression')
134
+
135
+ # Select relevant columns
136
+ reduced_cols = ['SafeHavenID', 'eth_grp', 'ADMDATE', 'STAY', 'copd_event',
137
+ 'resp_event', 'anxiety_depression_event']
138
+ adm_reduced = adm[reduced_cols]
139
+
140
+ # Save per event dataset
141
+ adm_reduced.to_pickle(data_path + 'validation_adm_proc.pkl')
142
+
143
+ # Add column relative to user-specified date
144
+ adm_reduced = add_eoy_column(adm_reduced, 'ADMDATE', config['date'])
145
+
146
+ # Extract yearly data
147
+ adm_yearly = extract_yearly_data(adm_reduced)
148
+
149
+ # Save data
150
+ adm_yearly.to_pickle(data_path + 'adm_proc.pkl')
151
+
152
+
153
+ main()
training/src/processing/process_comorbidities.py ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Process SMR01 comorbidities data
3
+ --------
4
+ Clean and process comorbidities, tracking specific comorbidities and returning
5
+ the total number of comorbidities per patient per year
6
+ """
7
+ import json
8
+ import pandas as pd
9
+ from datetime import date
10
+ from dateutil.relativedelta import relativedelta
11
+ from utils.common import track_event
12
+ from utils.adm_common import initialize_adm_data, correct_stays
13
+ from utils.comorb_processing import diagnosis_mapping_lists
14
+
15
+
16
+ def track_comorbidity(df, excel_file, sheet_name, diag_names):
17
+ """
18
+ Map from admission descriptions to comorbidities using provided sheet.
19
+ Add new column for each comorbidity.
20
+ --------
21
+ :param df: pandas dataframe
22
+ :param excel_file: str filename for diagnosis mapping
23
+ :param sheet_name: str sheet name for diagnosis mapping
24
+ :param diag_names: list of diagnoses
25
+ :return: dataframe update with diagnosis mapping
26
+ """
27
+ print('Tracking comorbidities')
28
+
29
+ # Load in mappings
30
+ mapping = diagnosis_mapping_lists(excel_file, sheet_name, diag_names)
31
+
32
+ # Select relevant columns
33
+ diag_columns = ['DIAG1Desc', 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc',
34
+ 'DIAG5Desc', 'DIAG6Desc']
35
+ df_diag = df[diag_columns]
36
+
37
+ # Create column for each comorbidity
38
+ for key in mapping:
39
+ com = mapping[key]
40
+ com_bool = df_diag.apply(lambda x: track_event(x, com, False))
41
+ com_int = com_bool.any(axis=1).astype(int)
42
+ df[key] = com_int
43
+
44
+ return df
45
+
46
+
47
+ def fill_comorbidities(df, diag_names):
48
+ """
49
+ Fill comorbidites
50
+ --------
51
+ :param df: dataframe of groupby values
52
+ :param diag_names: list of diagnoses
53
+ :return: updated dataframe
54
+ """
55
+
56
+ df[diag_names] = df[diag_names].replace(to_replace=0, method='ffill')
57
+
58
+ return df
59
+
60
+
61
+ def add_eoy_column(df, dt_col, eoy_date):
62
+ """
63
+ Add EOY relative to user-specified end date
64
+ --------
65
+ :param df: dataframe
66
+ :param dt_col: date column in dataframe
67
+ :param eoy_date: EOY date from config
68
+ :return: updated df with EOY column added
69
+ """
70
+ # Needed to stop error with creating a new column
71
+ df = df.reset_index(drop=True)
72
+
73
+ # Add column with user-specified end of year date
74
+ end_date = pd.to_datetime(eoy_date)
75
+ end_month = end_date.month
76
+ end_day = end_date.day
77
+
78
+ # Add for every year
79
+ df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year]
80
+
81
+ # Check that EOY date is after dt_col for each entry
82
+ eoy_index = df.columns[df.columns == 'eoy']
83
+ adm_vs_eoy = df[dt_col] > df.eoy
84
+ row_index = df.index[adm_vs_eoy]
85
+ df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1)
86
+ df['eoy'] = pd.to_datetime(df.eoy)
87
+
88
+ return df
89
+
90
+
91
+ def add_yearly_stats(df):
92
+ """
93
+ Sum comorbidities per patient per year
94
+ --------
95
+ :param df: dataframe to update
96
+ :return: sum of comorbidities per patient per year
97
+ """
98
+ print('Adding comorbidity count per year')
99
+
100
+ # Drop cols not required anymore
101
+ cols_2_drop = ['ADMDATE', 'DISDATE', 'STAY', 'ETHGRP', 'DIAG1Desc',
102
+ 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc', 'DIAG5Desc',
103
+ 'DIAG6Desc', 'DISDATE', 'STAY']
104
+ df = df.drop(cols_2_drop, axis=1)
105
+
106
+ # Sum comorbidities
107
+ df = df.groupby(['SafeHavenID', 'eoy']).last().sum(axis=1)
108
+ df = df.to_frame().rename(columns={0: 'comorb_per_year'})
109
+
110
+ return df
111
+
112
+
113
+ def main():
114
+
115
+ # Load in config items
116
+ with open('../../../config.json') as json_config_file:
117
+ config = json.load(json_config_file)
118
+
119
+ # Load in data
120
+ adm_file = config['extract_data_path'] + 'SMR01_Cohort3R.csv'
121
+ adm = initialize_adm_data(adm_file)
122
+
123
+ # Fill null STAY data and combine transfer admissions
124
+ adm = correct_stays(adm)
125
+
126
+ # Prepare text data - strip string columns
127
+ adm = adm.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
128
+
129
+ # Track comorbidities
130
+ excel_file = "mappings/Comorbidity feature review for models & clin " \
131
+ "summary update v2 May 2021.xlsx"
132
+ sheet_name = 'Diagnosis category mapping3'
133
+ diag_names = ['Ischaemic_hd', 'Atrial_fib', 'pacemake', 'periph_vasc',
134
+ 'cog_imp', 'HF1', 'LV_sys', 'valv_hd', 'HF_pres_ejec',
135
+ 'hypertension', 'Cerebrovascula_dis', 'Diabetes_mel',
136
+ 'Osteoporosis', 'frailty', 'liver_dis', 'metastat_canc',
137
+ 'headneck_canc', 'breast_canc', 'gi_canc', 'other_canc',
138
+ 'kidney_dis', 'Asthma_ov', 'Pulmonary_fib',
139
+ 'Obstructive_apnoea', 'Pulmonary_hyp', 'Previous_pneum',
140
+ 'DVT_PTE', 'Lung_cancer', 'Bronchiectasis', 'Resp_fail']
141
+ adm_comorb = track_comorbidity(adm, excel_file, sheet_name, diag_names)
142
+
143
+ # Drop date column
144
+ adm_comorb = adm_comorb.sort_values('ADMDATE').reset_index(drop=True)
145
+
146
+ # Drop fill comorb cols
147
+ print('Filling comorbidities')
148
+ adm_filled = adm_comorb.groupby('SafeHavenID').apply(
149
+ fill_comorbidities, diag_names)
150
+
151
+ # Add column relative to user-specified date
152
+ adm_filled = add_eoy_column(adm_filled, 'ADMDATE', config['date'])
153
+
154
+ # Add yearly stats
155
+ adm_yearly = add_yearly_stats(adm_filled)
156
+
157
+ # Save data
158
+ adm_yearly.to_pickle(config['model_data_path'] + 'comorb_proc.pkl')
159
+
160
+
161
+ main()
training/src/processing/process_demographics.py ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Process demographics data
3
+ --------
4
+ Process DOB, sex, marital status and SIMD data
5
+ """
6
+ import json
7
+ from utils.common import read_data, correct_column_names
8
+
9
+
10
+ def initialize_demo_data(demo_file):
11
+ """
12
+ Load in demographics dataset to correct format
13
+ --------
14
+ :param demo_file: demographics data file name
15
+ :return: demographics dataframe with correct column names and types
16
+ """
17
+ print('Loading demographic data')
18
+
19
+ # Read in data
20
+ demo_cols = ['SafeHavenID', 'OBF_DOB', 'SEX', 'MARITAL_STATUS',
21
+ 'SIMD_2009_QUINTILE', 'SIMD_2009_DECILE',
22
+ 'SIMD_2009_VIGINTILE', 'SIMD_2012_QUINTILE',
23
+ 'SIMD_2012_DECILE', 'SIMD_2012_VIGINTILE',
24
+ 'SIMD_2016_QUINTILE', 'SIMD_2016_DECILE',
25
+ 'SIMD_2016_VIGINTILE']
26
+ demo_types = ['int', 'object', 'str', 'str', 'float', 'float', 'float',
27
+ 'float', 'float', 'float', 'float', 'float', 'float']
28
+ df = read_data(demo_file, demo_cols, demo_types)
29
+
30
+ # Nulls dropped later in process, only drop duplicates
31
+ df = df.drop_duplicates()
32
+
33
+ return df
34
+
35
+
36
+ def process_sex(df):
37
+ """
38
+ Process sex column in demographics
39
+ --------
40
+ :param df: dataframe to update
41
+ :return: updated dataframe
42
+ """
43
+ print('One-hot encoding sex')
44
+
45
+ df['sex_bin'] = (df.SEX == 'F').astype(int)
46
+
47
+ return df
48
+
49
+
50
+ def main():
51
+
52
+ # Load in config items
53
+ with open('../../../config.json') as json_config_file:
54
+ config = json.load(json_config_file)
55
+
56
+ # Load in data
57
+ demo_file = config['extract_data_path'] + 'Demographics_Cohort3R.csv'
58
+ demo = initialize_demo_data(demo_file)
59
+
60
+ # Create binary sex column
61
+ demo = process_sex(demo)
62
+
63
+ # Drop original columns
64
+ demo = demo.drop('SEX', axis=1)
65
+
66
+ # Correct column names
67
+ new_cols = correct_column_names(demo.columns[1:], 'demo')
68
+ demo.columns = ['SafeHavenID'] + new_cols
69
+
70
+ # Save data
71
+ demo.to_pickle(config['model_data_path'] + 'demo_proc.pkl')
72
+
73
+
74
+ main()
training/src/processing/process_labs.py ADDED
@@ -0,0 +1,247 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Script for preprocessing labs data
3
+ --------
4
+ Track median values for labs tests over the previous 2 years for patients
5
+ with resulting dataset containing 1 row of information per patient per year
6
+ """
7
+ import json
8
+ import pandas as pd
9
+ import numpy as np
10
+ from datetime import date
11
+ from dateutil.relativedelta import relativedelta
12
+ from utils.common import (read_data, correct_column_names,
13
+ first_patient_appearance)
14
+ from utils.labs_processing import add_total_labs
15
+
16
+
17
+ def initialize_labs_data(labs_file):
18
+ """
19
+ Load in labs dataset to correct format
20
+ --------
21
+ :param labs_file: labs data file name
22
+ :return: labs dataframe with correct column names and types
23
+ """
24
+ print('Loading labs data')
25
+
26
+ # Read in data
27
+ old_cols = ['SafeHavenID', 'SAMPLEDATE', 'CLINICALCODEDESCRIPTION',
28
+ 'QUANTITYVALUE', 'RANGEHIGHVALUE', 'RANGELOWVALUE']
29
+ labs_types = ['int', 'object', 'str', 'float', 'float', 'float']
30
+ df = read_data(labs_file, old_cols, labs_types)
31
+
32
+ # Rename columns to CamelCase
33
+ new_cols = ['SafeHavenID', 'SampleDate', 'ClinicalCodeDescription',
34
+ 'QuantityValue', 'RangeHighValue', 'RangeLowValue']
35
+ mapping = dict(zip(old_cols, new_cols))
36
+ df = df.rename(columns=mapping)
37
+
38
+ # Drop any nulls, duplicates or negative (broken) test values
39
+ df = df.dropna().drop_duplicates()
40
+
41
+ # Check tests are valid (values > -1)
42
+ num_cols = ['QuantityValue', 'RangeHighValue', 'RangeLowValue']
43
+ df = df[(df[num_cols] > -1).all(axis=1)]
44
+
45
+ # Select final columns
46
+ final_cols = ['SafeHavenID', 'SampleDate', 'ClinicalCodeDescription',
47
+ 'QuantityValue']
48
+ df = df[final_cols]
49
+
50
+ # Convert date
51
+ df['SampleDate'] = pd.to_datetime(df.SampleDate)
52
+
53
+ return df
54
+
55
+
56
+ def clean_labs(df):
57
+ """
58
+ Clean descriptions and select relevant tests
59
+ --------
60
+ :param df: pandas dataframe
61
+ :return: cleaned dataframe
62
+ """
63
+ print('Cleaning labs data')
64
+
65
+ lab_tests = ['ALT', 'AST', 'Albumin', 'Alkaline Phosphatase', 'Basophils',
66
+ 'C Reactive Protein', 'Chloride', 'Creatinine', 'Eosinophils',
67
+ 'Estimated GFR', 'Haematocrit', 'Haemoglobin', 'Lymphocytes',
68
+ 'MCH', 'Mean Cell Volume', 'Monocytes', 'Neutrophils',
69
+ 'PCO2 (temp corrected', 'Platelets', 'Potassium',
70
+ 'Red Blood Count', 'Serum vitamin B12', 'Sodium',
71
+ 'Total Bilirubin', 'Urea', 'White Blood Count']
72
+
73
+ # Strip any whitespaces
74
+ str_col = 'ClinicalCodeDescription'
75
+ df[str_col] = df[str_col].str.strip()
76
+
77
+ # Read in test mapping
78
+ with open('mappings/test_mapping.json') as json_file:
79
+ test_mapping = json.load(json_file)
80
+
81
+ # Correct names for relevant tests
82
+ for k, v in test_mapping.items():
83
+ df[str_col] = df[str_col].replace(v, k)
84
+
85
+ # Select relevant tests
86
+ df = df[[desc in lab_tests for desc in df[str_col]]]
87
+
88
+ return df
89
+
90
+
91
+ def add_neut_lypmh(df):
92
+ """
93
+ Pivot dataframe and calculate neut_lypmh feature
94
+ --------
95
+ :param df: pandas dataframe
96
+ :return: pivoted dataframe
97
+ """
98
+ print('Calculating neut_lypmh data')
99
+
100
+ # Pivot table with CCDesc as headers and QuantityValue as values
101
+ df = pd.pivot_table(
102
+ df, index=['SafeHavenID', 'SampleDate'],
103
+ columns=['ClinicalCodeDescription'], values='QuantityValue',
104
+ dropna=True).reset_index()
105
+
106
+ # Add neut_lymph feature
107
+ df['neut_lymph'] = df.Neutrophils / df.Lymphocytes
108
+
109
+ # Replace any infinite values
110
+ df['neut_lymph'] = df.neut_lymph.replace([np.inf, -np.inf], np.nan)
111
+
112
+ return df
113
+
114
+
115
+ def add_eoy_column(df, dt_col, eoy_date):
116
+ """
117
+ Add EOY relative to user-specified end date
118
+ --------
119
+ :param df: dataframe
120
+ :param dt_col: date column in dataframe
121
+ :param eoy_date: EOY date from config
122
+ :return: updated df with EOY column added
123
+ """
124
+ # Needed to stop error with creating a new column
125
+ df = df.reset_index(drop=True)
126
+
127
+ # Add column with user-specified end of year date
128
+ end_date = pd.to_datetime(eoy_date)
129
+ end_month = end_date.month
130
+ end_day = end_date.day
131
+
132
+ # Add for every year
133
+ df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year]
134
+
135
+ # Check that EOY date is after dt_col for each entry
136
+ eoy_index = df.columns[df.columns == 'eoy']
137
+ adm_vs_eoy = df[dt_col] > df.eoy
138
+ row_index = df.index[adm_vs_eoy]
139
+ df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1)
140
+ df['eoy'] = pd.to_datetime(df.eoy)
141
+
142
+ return df
143
+
144
+
145
+ def reduce_labs_data(df, dt_col):
146
+ """
147
+ Reduce dataset to 1 row per ID per year looking back at the median values
148
+ over the previous 2 years
149
+ --------
150
+ :param df: pandas dataframe
151
+ :param dt_col: date column
152
+ :return: reduced labs dataframe
153
+ """
154
+ print('Reducing labs to 1 row per patient per year')
155
+
156
+ group_cols = ['SafeHavenID', 'eoy']
157
+ med_cols = ['ALT', 'AST', 'Albumin', 'Alkaline Phosphatase', 'Basophils',
158
+ 'C Reactive Protein', 'Chloride', 'Creatinine', 'Eosinophils',
159
+ 'Estimated GFR', 'Haematocrit', 'Haemoglobin', 'Lymphocytes',
160
+ 'MCH', 'Mean Cell Volume', 'Monocytes', 'Neutrophils',
161
+ 'Platelets', 'Potassium', 'Red Blood Count', 'Sodium',
162
+ 'Total Bilirubin', 'Urea', 'White Blood Count', 'neut_lymph']
163
+
164
+ # Add column to track labs per year
165
+ df['labs'] = 1
166
+
167
+ # Sort by date and extract year
168
+ df = df.sort_values(dt_col)
169
+
170
+ # Include data from previous year
171
+ shifted = df[['eoy']] + pd.DateOffset(years=1)
172
+ new_tab = df[['SafeHavenID', dt_col] + med_cols].join(shifted)
173
+ combined_cols = ['SafeHavenID', 'eoy', dt_col] + med_cols
174
+ combined = pd.concat([df[combined_cols], new_tab])
175
+ combined = combined.sort_values(dt_col)
176
+
177
+ # Extract median data for last 2 years
178
+ df_med = combined.groupby(group_cols).median()
179
+
180
+ # Rename median columns
181
+ new_med_cols = [col + '_med_2yr' for col in df_med.columns]
182
+ df_med.columns = new_med_cols
183
+
184
+ # Only carry forward year data that appeared in df
185
+ test = []
186
+ for k, v in df.groupby('SafeHavenID')['eoy'].unique().to_dict().items():
187
+ test.append(df_med.loc[(k, v), ])
188
+ df_med = pd.concat(test)
189
+
190
+ # Extract features to find last value of
191
+ df_last = df[group_cols + ['labs_to_date']]
192
+ df_last = df_last.groupby(group_cols).last()
193
+
194
+ # Extract features to calculate sum of
195
+ df_sum = df[group_cols + ['labs']]
196
+ df_sum = df.groupby(group_cols)['labs'].sum()
197
+
198
+ # Rename sum columns
199
+ df_sum = df_sum.to_frame()
200
+ df_sum.columns = ['labs_per_year']
201
+
202
+ # Merge datasets
203
+ df_annual = df_med.join(df_last).join(df_sum)
204
+
205
+ return df_annual
206
+
207
+
208
+ def main():
209
+
210
+ # Load in config items
211
+ with open('../../../config.json') as json_config_file:
212
+ config = json.load(json_config_file)
213
+
214
+ # Load in data
215
+ labs_file = config['extract_data_path'] + 'SCI_Store_Cohort3R.csv'
216
+ labs = initialize_labs_data(labs_file)
217
+
218
+ # Clean data
219
+ labs = clean_labs(labs)
220
+
221
+ # Save first date in dataset
222
+ data_path = config['model_data_path']
223
+ first_patient_appearance(labs, 'SampleDate', 'labs', data_path)
224
+
225
+ # Pivot and add neut_lypmh
226
+ labs = add_neut_lypmh(labs)
227
+
228
+ # Add EOY column relative to user specified date
229
+ labs = add_eoy_column(labs, 'SampleDate', config['date'])
230
+ labs = labs.sort_values('SampleDate')
231
+
232
+ # Track each lab event
233
+ labs['labs_to_date'] = 1
234
+ labs = labs.groupby('SafeHavenID').apply(add_total_labs)
235
+ labs = labs.reset_index(drop=True)
236
+
237
+ # Reduce labs to 1 row per ID per year
238
+ labs_yearly = reduce_labs_data(labs, 'SampleDate')
239
+
240
+ # Correct column names
241
+ labs_yearly.columns = correct_column_names(labs_yearly.columns, 'labs')
242
+
243
+ # Save data
244
+ labs_yearly.to_pickle(data_path + 'labs_proc.pkl')
245
+
246
+
247
+ main()
training/src/processing/process_prescribing.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Script for preprocessing pharmacy data
3
+ --------
4
+ Process pharmacy data and track inhaler prescriptions and rescue meds
5
+ """
6
+ import json
7
+ import pandas as pd
8
+ from datetime import date
9
+ from dateutil.relativedelta import relativedelta
10
+ from utils.common import (add_hist_adm_presc, correct_column_names,
11
+ first_patient_appearance)
12
+ from utils.presc_common import initialize_presc_data, track_medication
13
+
14
+
15
+ def add_inhaler_mappings(df):
16
+ """
17
+ Load inhaler prescription mappings and track where they appear in the data
18
+ --------
19
+ :param df: dataframe
20
+ :return: dataframe with column added for each inhaler type
21
+ """
22
+ print('Mapping inhaler prescriptions')
23
+
24
+ # Load in inhaler mapping
25
+ with open('mappings/inhaler_mapping.json') as json_file:
26
+ inhaler_mapping = json.load(json_file)
27
+
28
+ for k, v in inhaler_mapping.items():
29
+ df[k + '_inhaler'] = df.PI_Approved_Name.str.contains(
30
+ '|'.join(v)).astype(int)
31
+
32
+ # Remove for now as empty
33
+ df = df.drop(['LABA-LAMA-ICS_inhaler', 'Ignore_inhaler'], axis=1)
34
+
35
+ return df
36
+
37
+
38
+ def add_eoy_column(df, dt_col, eoy_date):
39
+ """
40
+ Add EOY relative to user-specified end date
41
+ --------
42
+ :param df: dataframe
43
+ :param dt_col: date column in dataframe
44
+ :param eoy_date: EOY date from config
45
+ :return: updated df with EOY column added
46
+ """
47
+ # Needed to stop error with creating a new column
48
+ df = df.reset_index(drop=True)
49
+
50
+ # Add column with user-specified end of year date
51
+ end_date = pd.to_datetime(eoy_date)
52
+ end_month = end_date.month
53
+ end_day = end_date.day
54
+
55
+ # Add for every year
56
+ df['eoy'] = [date(y, end_month, end_day) for y in df[dt_col].dt.year]
57
+
58
+ # Check that EOY date is after dt_col for each entry
59
+ eoy_index = df.columns[df.columns == 'eoy']
60
+ adm_vs_eoy = df[dt_col] > df.eoy
61
+ row_index = df.index[adm_vs_eoy]
62
+ df.loc[row_index, eoy_index] = df[adm_vs_eoy].eoy + relativedelta(years=1)
63
+ df['eoy'] = pd.to_datetime(df.eoy)
64
+
65
+ return df
66
+
67
+
68
+ def calc_presc_per_year(df):
69
+ """
70
+ Reduce data to 1 row per year
71
+ --------
72
+ :param df: dataframe to reduced
73
+ :return: reduced dataframe
74
+ """
75
+ print('Reducing to 1 row per year')
76
+
77
+ # Add end of year columns
78
+ eoy_cols = ['presc_to_date', 'days_since_rescue', 'rescue_to_date',
79
+ 'anxiety_depression_presc_to_date', 'rescue_date']
80
+ last = df.groupby(['SafeHavenID', 'eoy'])[eoy_cols].last()
81
+
82
+ # Total columns
83
+ sum_cols = ['SALBUTAMOL', 'SABA_inhaler', 'LABA_inhaler', 'LAMA_inhaler',
84
+ 'SAMA_inhaler', 'ICS_inhaler', 'LABA-ICS_inhaler',
85
+ 'LAMA +LABA-ICS_inhaler', 'SABA + SAMA_inhaler',
86
+ 'MCS_inhaler', 'rescue_meds', 'presc', 'anxiety_depression_presc']
87
+ total_cols = [col + '_per_year' for col in sum_cols]
88
+ total = df.groupby(['SafeHavenID', 'eoy'])[sum_cols].sum()
89
+ total.columns = total_cols
90
+
91
+ # Join together
92
+ results = last.join(total)
93
+
94
+ return results
95
+
96
+
97
+ def main():
98
+
99
+ # Load in config items
100
+ with open('../../../config.json') as json_config_file:
101
+ config = json.load(json_config_file)
102
+
103
+ # Load in data
104
+ presc_file = config['extract_data_path'] + 'Pharmacy_Cohort3R.csv'
105
+ presc = initialize_presc_data(presc_file)
106
+
107
+ # Save first date in dataset
108
+ data_path = config['model_data_path']
109
+ first_patient_appearance(presc, 'PRESC_DATE', 'presc', data_path)
110
+
111
+ # Add inhaler mapping
112
+ presc = add_inhaler_mappings(presc)
113
+
114
+ # Track salbutamol and rescue meds
115
+ presc = track_medication(presc)
116
+
117
+ # Drop columns
118
+ cols_2_drop = ['PI_Approved_Name', 'PI_BNF_Item_Code', 'code']
119
+ presc = presc.drop(cols_2_drop, axis=1)
120
+
121
+ # Add column relative to user-specified date
122
+ presc = add_eoy_column(presc, 'PRESC_DATE', config['date'])
123
+
124
+ # Track rows which are admissions
125
+ presc['presc'] = 1
126
+
127
+ # Add any historical count columns
128
+ presc = presc.groupby('SafeHavenID').apply(
129
+ add_hist_adm_presc, 'presc', 'PRESC_DATE')
130
+ presc = presc.reset_index(drop=True)
131
+
132
+ # Save per event dataset
133
+ presc.to_pickle(data_path + 'validation_presc_proc.pkl')
134
+
135
+ # Reduce data to 1 row per year
136
+ presc_yearly = calc_presc_per_year(presc)
137
+
138
+ # Correct column names
139
+ presc_yearly.columns = correct_column_names(presc_yearly.columns, 'presc')
140
+
141
+ # Save data
142
+ presc_yearly.to_pickle(data_path + 'presc_proc.pkl')
143
+
144
+
145
+ main()
training/src/processing/utils/README.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Processing Utilities
2
+
3
+ This folder contains processing utilities called within the main processing scripts in the folder above.
4
+
5
+ - `adm/comorb/labs_processing.py` contain utilities for processing each type of specific data
6
+
7
+ - `adm_reduction.py` contains reduction functions required for processing admissions
8
+
9
+ - `common.py` functions are used across processing for all datasets
10
+
11
+ - `adm_common.py` functions are used in both admissions and comorbidities
training/src/processing/utils/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Empty file for folder to be recognised as module
training/src/processing/utils/adm_common.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utility functions common across admission processing
3
+ (admissions/comorbidities/gples)
4
+ """
5
+ import pandas as pd
6
+ from utils.common import read_data
7
+ from utils.adm_processing import (update_null_stay, calculate_total_stay,
8
+ search_diag)
9
+
10
+
11
+ def initialize_adm_data(adm_file):
12
+ """
13
+ Load in and convert admission dataset to correct format
14
+ --------
15
+ :param adm_file: admission data file name
16
+ :return: admission dataframe with correct column names and types
17
+ """
18
+ print('Loading admission data')
19
+
20
+ # Read in data
21
+ adm_cols = ['SafeHavenID', 'ETHGRP', 'ADMDATE', 'DISDATE', 'STAY',
22
+ 'DIAG1Desc', 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc',
23
+ 'DIAG5Desc', 'DIAG6Desc']
24
+ adm_types = ['int', 'object', 'object', 'object', 'int',
25
+ 'str', 'str', 'str', 'str', 'str', 'str']
26
+ df = read_data(adm_file, adm_cols, adm_types)
27
+
28
+ # Drop duplicates - nulls needed in DIAGDesc columns
29
+ df = df.drop_duplicates()
30
+
31
+ # Convert date columns to correct type
32
+ df['ADMDATE'] = pd.to_datetime(df['ADMDATE'])
33
+ df['DISDATE'] = pd.to_datetime(df['DISDATE'])
34
+
35
+ return df
36
+
37
+
38
+ def correct_stays(df):
39
+ """
40
+ Fill any null STAY data and consolidate any transfer admissions into single
41
+ admission occurrences
42
+ --------
43
+ :param df: admission dataframe to be corrected
44
+ :return: admission dataframe with null stays filled and transfers combined
45
+ """
46
+ print('Correcting stays')
47
+
48
+ # Update any null STAY data using ADM and DIS dates
49
+ df = update_null_stay(df)
50
+
51
+ # Correct stays for patients passed across departments
52
+ df = df.sort_values(['SafeHavenID', 'ADMDATE', 'DISDATE'])
53
+ df = df.groupby('SafeHavenID').apply(calculate_total_stay)
54
+ df = df.reset_index(drop=True)
55
+
56
+ return df
57
+
58
+
59
+ def track_copd_resp(df):
60
+ """
61
+ Search for COPD and/or respiratory admissions
62
+ --------
63
+ :param df: admission dataframe to be updated
64
+ :return: updated dataframe with events tracked
65
+ """
66
+ print('Tracking events')
67
+
68
+ # Strip DIAGDesc columns
69
+ df = df.apply(lambda x: x.str.strip() if x.dtype == 'object' else x)
70
+
71
+ # Track COPD admissions
72
+ df = search_diag(df, 'copd')
73
+
74
+ # Track respiratory admissions
75
+ df = search_diag(df, 'resp')
76
+
77
+ return df
training/src/processing/utils/adm_processing.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Admission processing utilities
3
+ """
4
+ import json
5
+ import numpy as np
6
+ from utils.common import track_event
7
+
8
+
9
+ def update_null_stay(df):
10
+ """
11
+ Calculate length of stay based on ADM/DISDATE for null STAY values
12
+ --------
13
+ :param df: pandas dataframe to be updated
14
+ :return: updated dataframe
15
+ """
16
+ # Check for nulls
17
+ is_null = df.STAY.isnull()
18
+
19
+ # If null calculate total length of stay
20
+ if sum(is_null) > 0:
21
+ null_stay = np.where(is_null)
22
+ for i in null_stay:
23
+ stay = df.loc[i, 'DISDATE'].item() - df.loc[i, 'ADMDATE'].item()
24
+ df.loc[i, 'STAY'] = float(stay.days)
25
+
26
+ return df
27
+
28
+
29
+ def calculate_total_stay(df):
30
+ """
31
+ Convert admissions with same ADMDATE as previous DISDATE to single
32
+ admission where patient has been transferred between departments
33
+ --------
34
+ :param df: pandas dataframe to be updated
35
+ :return: updated dataframe
36
+ """
37
+ df.reset_index(inplace=True, drop=True)
38
+ rows_to_drop = []
39
+
40
+ # If ADMDATE matches previous DISDATE, mark as transfer and combine
41
+ df['transfer'] = df.ADMDATE.eq(df.DISDATE.shift())
42
+ for index, row in df.iloc[1:].iterrows():
43
+ if row.transfer is True:
44
+ df.loc[index, 'ADMDATE'] = df.iloc[index - 1].ADMDATE
45
+ df.loc[index, 'STAY'] = row.STAY + df.iloc[index - 1].STAY
46
+ rows_to_drop.append(index - 1)
47
+
48
+ # Drop original individual rows in transfer
49
+ df.drop(rows_to_drop, inplace=True)
50
+
51
+ # Drop tracking column
52
+ df.drop('transfer', axis=1, inplace=True)
53
+
54
+ return df
55
+
56
+
57
+ def convert_ethgrp_desc(eth):
58
+ """
59
+ Find ethnic group based on given ETHGRP string
60
+ --------
61
+ :param eth: str ethnic group description in the style of SMR01 data
62
+ :return: string ethnicity
63
+ """
64
+ if ("White" in eth) | ("Irish" in eth) | ("Welsh" in eth) | ("English" in eth):
65
+ return "White"
66
+
67
+ elif eth.startswith("British"):
68
+ return "White"
69
+
70
+ elif "mixed" in eth:
71
+ return "Mixed"
72
+
73
+ elif ("Asian" in eth) | ("Pakistani" in eth) | ("Indian" in eth) | ("Bangladeshi" in eth) | ("Chinese" in eth):
74
+ return "Asian"
75
+
76
+ elif ("Black" in eth) | ("Caribbean" in eth) | ("African" in eth):
77
+ return "Black"
78
+
79
+ elif ("Arab" in eth) | ("other ethnic" in eth):
80
+ return "Other"
81
+
82
+ elif "Refused" in eth:
83
+ return "Refused"
84
+
85
+ else:
86
+ return "Unknown"
87
+
88
+
89
+ def mode_ethnicity(v, eth_col):
90
+ """
91
+ Select the most commonly occuring ethnicity for each patient in groupby
92
+ --------
93
+ :param v: pandas patient dataframe to be updated
94
+ :param eth_col: str ethnicity column
95
+ :return: updated subset of data with common ethnicity per ID
96
+ """
97
+ eth = v[eth_col]
98
+ n = eth.nunique()
99
+ has_unk = eth.str.contains('Unknown')
100
+ any_unk = any(has_unk)
101
+ wout_unk = has_unk.apply(lambda x: x is False)
102
+ has_ref = eth.str.contains('Refused')
103
+ any_ref = any(has_ref)
104
+ wout_ref = has_ref.apply(lambda x: x is False)
105
+
106
+ # Select ethnicities excluding 'Unknown' or 'Refused' where possible
107
+ if any_unk & any_ref & (n > 2):
108
+ eth = eth[wout_unk & wout_ref]
109
+ elif any_unk & (n > 1):
110
+ eth = eth[wout_unk]
111
+ elif any_ref & (n > 1):
112
+ eth = eth[wout_ref]
113
+
114
+ # Select the most commonly appearing ethnicity
115
+ main_eth = eth.mode().values[0]
116
+ v[eth_col] = main_eth
117
+
118
+ return v
119
+
120
+
121
+ def search_diag(df, typ):
122
+ """
123
+ Search diagnosis columns for descriptions indicative of copd or resp events
124
+ --------
125
+ :param df: dataframe to search
126
+ :param typ: 'copd', 'resp' or 'anxiety_depression'
127
+ :return: dataframe with column added tracking specific type of admission
128
+ """
129
+ # Columns to search
130
+ diag_cols = ['DIAG1Desc', 'DIAG2Desc', 'DIAG3Desc', 'DIAG4Desc',
131
+ 'DIAG5Desc', 'DIAG6Desc']
132
+
133
+ # Load mappings
134
+ copd_resp_desc = json.load(open('mappings/diag_copd_resp_desc.json'))
135
+
136
+ # Select mappings relevant to desired type of admission
137
+ desc = copd_resp_desc[typ]
138
+
139
+ # copd descriptions will only require searching a single specific phrase
140
+ single = typ == 'copd'
141
+
142
+ # Search columns and track
143
+ df[typ + '_event'] = df[diag_cols].apply(
144
+ lambda x: track_event(x, desc, single)).any(axis=1).astype(int)
145
+
146
+ return df
training/src/processing/utils/adm_reduction.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Admission reduction utilities
3
+ """
4
+ import pandas as pd
5
+ from datetime import date
6
+
7
+
8
+ def fill_missing_years(df):
9
+ """
10
+ Add admission data from years where patient is missing from the dataset
11
+ --------
12
+ :param df: dataframe to be updated
13
+ :return: dataframe with missing years added
14
+ """
15
+ df = df.sort_values('ADMDATE')
16
+ year_col = df.eoy.dt.year.tolist()
17
+ end_month = df.eoy.dt.month.iloc[0]
18
+ end_day = df.eoy.dt.day.iloc[0]
19
+
20
+ # We only want missing years
21
+ year_range = range(year_col[0] + 1, year_col[-1])
22
+ years = [y for y in year_range if not (y in year_col)]
23
+
24
+ # If any years missing add rows
25
+ if len(years) > 0:
26
+ sh_id = df.SafeHavenID.iloc[0]
27
+ eth_grp = df.eth_grp.iloc[0]
28
+ adm_dates = pd.to_datetime([date(y, end_month, end_day) for y in years])
29
+ data = {'SafeHavenID': sh_id, 'eth_grp': eth_grp, 'ADMDATE': adm_dates,
30
+ 'STAY': 0, 'copd_event': 0, 'resp_event': 0, 'eoy': adm_dates,
31
+ 'adm': 0, 'anxiety_depression_event': 0}
32
+ missed_years = pd.DataFrame(data)
33
+ df = pd.concat([df, missed_years]).sort_values('ADMDATE')
34
+
35
+ return df
36
+
37
+
38
+ def calc_adm_per_year(df):
39
+ """
40
+ Reduce data to 1 row per year
41
+ --------
42
+ :param df: dataframe to reduced
43
+ :return: reduced dataframe
44
+ """
45
+ # Last EOY columns
46
+ eoy_cols = ['eth_grp', 'days_since_copd', 'days_since_resp', 'days_since_adm',
47
+ 'adm_to_date', 'copd_to_date', 'resp_to_date',
48
+ 'anxiety_depression_to_date', 'copd_date', 'resp_date', 'adm_date']
49
+ last = df.groupby(['SafeHavenID', 'eoy'])[eoy_cols].last()
50
+
51
+ # Average column
52
+ los = df.groupby(['SafeHavenID', 'eoy'])[['STAY']].mean()
53
+ los.columns = ['mean_los']
54
+
55
+ # Total columns
56
+ sum_cols = ['adm', 'copd_event', 'resp_event', 'anxiety_depression_event', 'STAY']
57
+ total_cols = ['adm_per_year', 'copd_per_year', 'resp_per_year',
58
+ 'anxiety_depression_per_year', 'total_hosp_days']
59
+ total = df.groupby(['SafeHavenID', 'eoy'])[sum_cols].sum()
60
+ total.columns = total_cols
61
+
62
+ # Join together
63
+ results = last.join(los).join(total)
64
+
65
+ return results
training/src/processing/utils/common.py ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utilities required across all processing scripts
3
+ """
4
+ import pandas as pd
5
+ import numpy as np
6
+
7
+
8
+ def read_data(file, cols, types):
9
+ """
10
+ Read in data source
11
+ --------
12
+ :param file: string filename
13
+ :param cols: string list of column names
14
+ :param types: string list of column types
15
+ :return: dataframe
16
+ """
17
+ schema = dict(zip(cols, types))
18
+ df = pd.read_csv(file, usecols=cols, encoding="cp1252", dtype=schema)
19
+
20
+ return df
21
+
22
+
23
+ def first_patient_appearance(df, dt_col, typ, data_path):
24
+ """
25
+ Save first appearance of patient in dataset
26
+ --------
27
+ :param df: dataframe to check
28
+ :param dt_col: date column to sort by
29
+ :param typ: type of dataset being used
30
+ :param data_path: path to data extracts
31
+ :return: None, dataframe with first dates saved
32
+ """
33
+ df = df.sort_values(dt_col)
34
+ df_first = df.groupby('SafeHavenID')[dt_col].first()
35
+ df_first = df_first.to_frame().reset_index()
36
+ df_first.columns = ['SafeHavenID', 'first_adm']
37
+ df_first.to_pickle(data_path + typ + '_first_dates.pkl')
38
+
39
+
40
+ def add_days_since_event(df, typ, dt_col):
41
+ """
42
+ Historical features: add days since features e.g. copd/resp/rescue
43
+ --------
44
+ :param df: dataframe to be updated
45
+ :param typ: 'rescue', 'copd' or 'resp' feature to be created
46
+ :param dt_col: str date column name
47
+ :return: updated dataframe with historical column added
48
+ """
49
+ if typ == 'rescue':
50
+ event_col = 'rescue_meds'
51
+ elif typ == 'adm':
52
+ event_col = 'adm'
53
+ else:
54
+ event_col = typ + '_event'
55
+ date_col = typ + '_date'
56
+ days_col = 'days_since_' + typ
57
+ df[date_col] = df.apply(
58
+ lambda x: x[dt_col] if x[event_col] else np.nan, axis=1).ffill()
59
+ if df[date_col].isna().all():
60
+ df[days_col] = np.nan
61
+ else:
62
+ df[days_col] = (df.eoy - df[date_col]).dt.days
63
+
64
+ return df
65
+
66
+
67
+ def track_event(x, desc, single):
68
+ """
69
+ Fill nulls and search to see if x matches a description
70
+ --------
71
+ :param x: str list of features to track
72
+ :param desc: str list to compare
73
+ :param single: boolean for checking against single description e.g.
74
+ "COPD" True otherwise False
75
+ :return: tracked feature list
76
+ """
77
+ x = x.fillna('')
78
+
79
+ # COPD only has single description to search
80
+ if single:
81
+ result = [desc in s for s in x]
82
+
83
+ # Respiratory has a list of descriptions to search
84
+ else:
85
+ result = [s in desc for s in x]
86
+
87
+ return result
88
+
89
+
90
+ def add_hist_adm_presc(df, typ, dt_col):
91
+ """
92
+ Historical features: add days since and to-date features
93
+ --------
94
+ :param df: dataframe to be updated
95
+ :param typ: type of data - 'adm' or 'presc'
96
+ :param dt_col: string name of date column
97
+ :return: updated dataframe with historical columns added
98
+ """
99
+ if typ == 'presc':
100
+ df = df.sort_values(dt_col).reset_index(drop=True)
101
+ df = add_days_since_event(df, 'rescue', dt_col)
102
+ df['rescue_to_date'] = df.rescue_meds.cumsum()
103
+ df['anxiety_depression_presc_to_date'] = df.anxiety_depression_presc.cumsum()
104
+ else:
105
+ for col in ['adm', 'copd', 'resp']:
106
+ df = add_days_since_event(df, col, dt_col)
107
+ for col in ['copd', 'resp', 'anxiety_depression']:
108
+ df[col + '_to_date'] = df[col + '_event'].cumsum()
109
+
110
+ # Add counter for events to date
111
+ df[typ + '_to_date'] = df[typ].cumsum()
112
+
113
+ return df
114
+
115
+
116
+ def correct_column_names(cols, typ):
117
+ """
118
+ Convert column names to lower case and fill any spaces with underscores
119
+ --------
120
+ :param cols: string list of column names
121
+ :param typ: type of dataset being updated
122
+ :return: cleaned column names
123
+ """
124
+ print('Correcting column headers')
125
+
126
+ if typ == 'presc':
127
+ lower_cols = cols.str.replace('[+-]', ' ').str.lower()
128
+ new_cols = ["_".join(col.split()) for col in lower_cols]
129
+ else:
130
+ new_cols = cols.str.lower().str.replace(' ', '_').tolist()
131
+
132
+ return new_cols
training/src/processing/utils/comorb_processing.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Comorbidities processing utilities
3
+ """
4
+ import pandas as pd
5
+
6
+
7
+ def diagnosis_mapping_lists(excel_file, sheet_name, diagnosis_names):
8
+ """
9
+ Create mapping between diagnoses and comorbidities
10
+ --------
11
+ :param excel_file: str filename for diagnosis mapping
12
+ :param sheet_name: str sheet name for diagnosis mapping
13
+ :param diagnosis_names: str list of diagnoses
14
+ :return: dictionary of diagnosis names and values
15
+ """
16
+ df_diag = pd.read_excel(excel_file, sheet_name, skiprows=range(0, 1))
17
+ df_lists = df_diag.T.values.tolist()
18
+ diag_lists = [[s.strip() for s in x if pd.notna(s)] for x in df_lists]
19
+
20
+ return dict(zip(diagnosis_names, diag_lists))
training/src/processing/utils/labs_processing.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Labs processing utilities
3
+ """
4
+
5
+
6
+ def add_total_labs(df):
7
+ """
8
+ Historical features: to-date features
9
+ --------
10
+ :param df: dataframe to be updated
11
+ :return: updated dataframe with historical columns added
12
+ """
13
+ # Add counter for rescue meds to date
14
+ df['labs_to_date'] = df.labs_to_date.cumsum()
15
+
16
+ return df
training/src/processing/utils/presc_common.py ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from utils.common import read_data
3
+
4
+
5
+ steroid_codes = ['0603020T0AAACAC', '0603020T0AABKBK', '0603020T0AAAXAX',
6
+ '0603020T0AAAGAG', '0603020T0AABHBH', '0603020T0AAACAC',
7
+ '0603020T0AABKBK', '0603020T0AABNBN', '0603020T0AAAGAG',
8
+ '0603020T0AABHBH']
9
+
10
+ antib_codes = ['0501013B0AAAAAA', '0501013B0AAABAB', '0501030I0AAABAB',
11
+ '0501030I0AAAAAA', '0501050B0AAAAAA', '0501050B0AAADAD',
12
+ '0501013K0AAAJAJ']
13
+
14
+ exac_meds = steroid_codes + antib_codes
15
+
16
+
17
+ def initialize_presc_data(presc_file):
18
+ """
19
+ Load in prescribing dataset to correct format
20
+ --------
21
+ :param presc_file: prescribing data file name
22
+ :return: prescribing dataframe with correct column names and types
23
+ """
24
+ print('Loading prescribing data')
25
+
26
+ # Read in data
27
+ presc_cols = ['SafeHavenID', 'PRESC_DATE', 'PI_Approved_Name',
28
+ 'PI_BNF_Item_Code']
29
+ presc_types = ['int', 'object', 'str', 'str']
30
+ df = read_data(presc_file, presc_cols, presc_types)
31
+
32
+ # Drop any nulls or duplicates
33
+ df = df.dropna()
34
+ df = df.drop_duplicates()
35
+
36
+ # Convert date
37
+ df['PRESC_DATE'] = pd.to_datetime(df.PRESC_DATE)
38
+
39
+ return df
40
+
41
+
42
+ def track_medication(df):
43
+ """
44
+ Track salbutamol and rescue med prescriptions
45
+ https://openprescribing.net/bnf/
46
+ --------
47
+ :param df: dataframe
48
+ :return: dataframe with tracked meds
49
+ """
50
+ print('Tracking medication')
51
+
52
+ # Extract BNF codes without brand info
53
+ df['code'] = df.PI_BNF_Item_Code.apply(lambda x: x[0:9])
54
+
55
+ # Add flag for salbutamol - marked important by Chris
56
+ df['SALBUTAMOL'] = (df.code == '0301011R0').astype(int)
57
+
58
+ # Track rescue meds
59
+ df['rescue_meds'] = df.PI_BNF_Item_Code.str.contains(
60
+ '|'.join(exac_meds)).astype(int)
61
+
62
+ # Track anxiety and depression medication
63
+ ad_bnf = ('040102', '0403', '0204000R0', '0408010AE')
64
+ ad_events = df.PI_BNF_Item_Code.str.startswith(ad_bnf).fillna(False)
65
+ drop_dummy = (df.PI_Approved_Name != 'DUMMY') & (df.PI_Approved_Name != 'DUMMY REJECTED')
66
+ df['anxiety_depression_presc'] = (drop_dummy & ad_events).astype(int)
67
+
68
+ return df
training/src/reduction/README.md ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reduction
2
+
3
+ This folder contains scripts for combining, reducing, filling and scaling processed EHR data for modelling. Scripts should be run in the below order.
4
+
5
+ Note that scripts must be run in the below order:
6
+ 1. `combine.py` - combine datasets and perform any post-processing
7
+ 2. `post_prod_reduction.py` - Combine columns to reduce 0 values
8
+ 3. `remove_ids.py` - remove receiver, scale up and test IDs
9
+ 4. `clean_and_scale_train.py` - impute nulls and min-max scale training data
10
+ 5. `clean_and_scale_test.py` - impute nulls and min-max scale testing data
11
+
12
+ _NB: The data_type in `clean_and_scale_test.py` can be changed to rec, sup, val and test._
training/src/reduction/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Empty file for folder to be recognised as module
training/src/reduction/clean_and_scale_test.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ TESTING
3
+ Impute any null data, save ethnicity info for each ID and scale
4
+ final dataset
5
+
6
+ NB: This script can be used for merged receiver, scale up or testing data
7
+ """
8
+ import json
9
+ import sys
10
+ import joblib
11
+ import pandas as pd
12
+ import numpy as np
13
+ from numpy import loadtxt
14
+
15
+
16
+ ds_cols = ['days_since_copd_resp', 'days_since_adm', 'days_since_rescue']
17
+
18
+ null_cols = ['alt_med_2yr', 'ast_med_2yr', 'albumin_med_2yr',
19
+ 'alkaline_phosphatase_med_2yr', 'basophils_med_2yr',
20
+ 'c_reactive_protein_med_2yr', 'chloride_med_2yr',
21
+ 'creatinine_med_2yr', 'eosinophils_med_2yr',
22
+ 'estimated_gfr_med_2yr', 'haematocrit_med_2yr',
23
+ 'haemoglobin_med_2yr', 'lymphocytes_med_2yr',
24
+ 'mch_med_2yr', 'mean_cell_volume_med_2yr',
25
+ 'monocytes_med_2yr', 'neutrophils_med_2yr',
26
+ 'platelets_med_2yr', 'potassium_med_2yr',
27
+ 'red_blood_count_med_2yr', 'sodium_med_2yr',
28
+ 'total_bilirubin_med_2yr', 'urea_med_2yr',
29
+ 'white_blood_count_med_2yr', 'neut_lymph_med_2yr',
30
+ 'days_since_copd_resp', 'days_since_adm', 'days_since_rescue']
31
+
32
+ cols2drop = ['eth_grp', 'entry_dataset', 'first_entry', 'obf_dob',
33
+ 'marital_status', 'label', 'simd_vigintile', 'simd_decile',
34
+ 'simd_quintile', 'sex_bin']
35
+
36
+
37
+ def calc_age_bins_test(df, data_path):
38
+ """
39
+ Load training bins and assign to testing data
40
+ --------
41
+ :param df: dataframe to be updated
42
+ :param data_path: path to generated data
43
+ :return: updated dataframe
44
+ """
45
+ ed = loadtxt(data_path + 'age_bins_train.csv', delimiter=',')
46
+ categories, edges = pd.qcut(
47
+ df['age'], q=10, precision=0, retbins=True, labels=ed[1:])
48
+ df['age_bin'] = categories.astype(int)
49
+
50
+ return df
51
+
52
+
53
+ def create_label(df):
54
+ """
55
+ Create a label containing the age and sex bins of the data
56
+ --------
57
+ :param df: dataframe
58
+ :return: dataframe with label added
59
+ """
60
+ df['label'] = df['age_bin'].astype(str) + '_' + df['sex_bin'].astype(str)
61
+ df = df.drop('age_bin', axis=1)
62
+
63
+ return df
64
+
65
+
66
+ def fill_nulls(label, df, medians):
67
+ """
68
+ Fill any null values in testing/REC/SUP data with median values from
69
+ training data.
70
+ --------
71
+ :param label: string label containing age and sex bin values, e.g. '51_0'
72
+ for a male patient in the less than 51 age bin
73
+ :param df: dataframe
74
+ :param medians: dataframe of training set medians for each label and
75
+ column
76
+ :return: filled dataframe for specified label
77
+ """
78
+ meds = medians[medians['label'] == label].iloc[0]
79
+ df_2_fill = df[df['label'] == label]
80
+ for col in null_cols:
81
+ df_2_fill[col] = df_2_fill[col].fillna(meds[col])
82
+
83
+ return df_2_fill
84
+
85
+
86
+ def ds_fill_5year_test(df, col, max_vals):
87
+ """
88
+ Fill days_since_X columns where patient has been in the dataset less than
89
+ 5 years
90
+ --------
91
+ :param df: dataframe to be updated
92
+ :param col: column to check
93
+ :param max_vals: series with columns and their max value from training
94
+ :return: dataframe with column nulls filled where patient has ggc_years < 5
95
+ """
96
+ df_5years = df.ggc_years < 5
97
+ df.loc[df_5years, col] = df.loc[df_5years, col].fillna(max_vals[col])
98
+
99
+ return df
100
+
101
+
102
+ def scale_data_test(df, scaler):
103
+ """
104
+ Min-max scale final dataset
105
+ -----
106
+ :param df: dataframe to be scaled
107
+ :param scaler: scaler object to apply to df
108
+ :return: scaled dataset for modelling
109
+ """
110
+ all_cols = df.columns
111
+ all_cols = all_cols.drop(['SafeHavenID', 'eoy'])
112
+ data_scaled = scaler.transform(df[all_cols].to_numpy())
113
+ df_scaled = pd.DataFrame(data_scaled, columns=all_cols)
114
+ df_final = (df[['SafeHavenID', 'eoy']]
115
+ .reset_index(drop=True)
116
+ .join(df_scaled))
117
+
118
+ return df_final
119
+
120
+
121
+ def main():
122
+
123
+ # Load in config items
124
+ with open('../../../config.json') as json_config_file:
125
+ config = json.load(json_config_file)
126
+
127
+ # Get generated data_path
128
+ data_path = config['model_data_path']
129
+
130
+ # Get datatype from cmd line
131
+ data_type = sys.argv[1]
132
+
133
+ # Read in data
134
+ df = pd.read_pickle(data_path + 'merged_' + data_type + '.pkl')
135
+
136
+ # Load training age groups and apply to data
137
+ df = calc_age_bins_test(df, data_path)
138
+
139
+ # Load in training median for each age-bin/sex-bin labelled group
140
+ df_medians = pd.read_pickle(data_path + 'medians.pkl')
141
+ df_medians = df_medians.reset_index()
142
+ df_medians = create_label(df_medians)
143
+ df = create_label(df)
144
+ labels = df_medians['label']
145
+
146
+ # Fill null days_since columns for patient with ggc_years < 5
147
+ max_vals = pd.read_pickle(data_path + 'maxs.pkl')
148
+ for col in ds_cols:
149
+ df = ds_fill_5year_test(df, col, max_vals)
150
+
151
+ # Fill remaining nulls using training medians
152
+ df_filled = pd.concat([fill_nulls(x, df, df_medians) for x in labels])
153
+
154
+ # Convert ds_cols to int
155
+ for col in ds_cols:
156
+ day = np.timedelta64(1, 'D')
157
+ df_filled[col] = (df_filled[col] / day).astype(int)
158
+
159
+ # Save processed data before scaling
160
+ df_filled.to_pickle(data_path + 'filled_' + data_type + '.pkl')
161
+
162
+ # Drop non-modelling columns
163
+ df_filled = df_filled.drop(cols2drop, axis=1)
164
+
165
+ # Load in min-max scaler from training set
166
+ scaler = joblib.load(data_path + 'min_max_scaler_train.pkl')
167
+ df_filled = scale_data_test(df_filled, scaler)
168
+
169
+ # Save final dataset
170
+ df_filled.to_pickle(data_path + 'min_max_' + data_type + '.pkl')
171
+
172
+
173
+ main()
training/src/reduction/clean_and_scale_train.py ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ TRAIN
3
+ Impute any null data, save ethnicity info for each ID and scale
4
+ final dataset
5
+ """
6
+ import json
7
+ import joblib
8
+ import pandas as pd
9
+ import numpy as np
10
+ from numpy import savetxt
11
+ from sklearn.preprocessing import MinMaxScaler
12
+ from utils.reduction import calc_ds_med
13
+
14
+
15
+ demo_cols = ['age_bin', 'sex_bin']
16
+
17
+ ds_cols = ['days_since_copd_resp', 'days_since_adm', 'days_since_rescue']
18
+
19
+ null_cols = ['alt_med_2yr', 'ast_med_2yr', 'albumin_med_2yr',
20
+ 'alkaline_phosphatase_med_2yr', 'basophils_med_2yr',
21
+ 'c_reactive_protein_med_2yr', 'chloride_med_2yr',
22
+ 'creatinine_med_2yr', 'eosinophils_med_2yr',
23
+ 'estimated_gfr_med_2yr', 'haematocrit_med_2yr',
24
+ 'haemoglobin_med_2yr', 'lymphocytes_med_2yr',
25
+ 'mch_med_2yr', 'mean_cell_volume_med_2yr',
26
+ 'monocytes_med_2yr', 'neutrophils_med_2yr',
27
+ 'platelets_med_2yr', 'potassium_med_2yr',
28
+ 'red_blood_count_med_2yr', 'sodium_med_2yr',
29
+ 'total_bilirubin_med_2yr', 'urea_med_2yr',
30
+ 'white_blood_count_med_2yr', 'neut_lymph_med_2yr']
31
+
32
+ cols2drop = ['eth_grp', 'entry_dataset', 'first_entry', 'obf_dob',
33
+ 'sex_bin', 'marital_status', 'age_bin',
34
+ 'days_since_copd_resp_med', 'days_since_adm_med',
35
+ 'days_since_rescue_med', 'simd_vigintile', 'simd_decile',
36
+ 'simd_quintile']
37
+
38
+
39
+ def calc_age_bins_train(df, data_path):
40
+ """
41
+ Split ages into 10 bins and save results for median filling test data
42
+ --------
43
+ :param df: dataframe to be updated
44
+ :param data_path: path to generated data
45
+ :return: updated dataframe
46
+ """
47
+ # Split age column into 10 buckets and use the edges as labels
48
+ cat, ed = pd.qcut(df['age'], q=10, precision=0, retbins=True)
49
+ categories, edges = pd.qcut(
50
+ df['age'], q=10, precision=0, retbins=True, labels=ed[1:])
51
+ df['age_bin'] = categories.astype(int)
52
+
53
+ # Save categories for test data
54
+ savetxt(data_path + 'age_bins_train.csv', edges, delimiter=',')
55
+
56
+ return df
57
+
58
+
59
+ def calc_df_med(df, data_path):
60
+ """
61
+ Calculate the medians for all columns in the dataset
62
+ --------
63
+ :param df: dataframe to update
64
+ :param data_path: path to generated data
65
+ :return: dataframe with null columns filled with median values and days_since
66
+ median columns added to the dataframe
67
+ """
68
+ # Calculate median for all columns except SafeHavenID, year and ds_cols
69
+ all_cols = df.columns
70
+ all_cols = all_cols.drop(['SafeHavenID', 'eoy'])
71
+ df_median = df[all_cols].groupby(demo_cols).median()
72
+
73
+ # Calculate medians for ds_cols
74
+ ds_med = df[demo_cols + ds_cols].groupby(demo_cols).apply(calc_ds_med)
75
+
76
+ # Join ds_cols medians to median table and original dataframe
77
+ df_median = df_median.join(ds_med)
78
+
79
+ # Save medians for imputing testing data
80
+ df_median.to_pickle(data_path + 'medians.pkl')
81
+
82
+ # Rename and add to original dataframe
83
+ ds_med.columns += '_med'
84
+ df = df.join(ds_med, on=demo_cols)
85
+
86
+ return df
87
+
88
+
89
+ def ds_fill_5year_train(df, col):
90
+ """
91
+ Fill days_since_X columns where patient has been in the dataset less than
92
+ 5 years
93
+ --------
94
+ :param df: dataframe to be updated
95
+ :param col: column to check
96
+ :return: dataframe with column nulls filled where patient has ggc_years < 5
97
+ """
98
+ df_5years = df.ggc_years < 5
99
+ df.loc[df_5years, col] = df.loc[df_5years, col].fillna(df[col].max())
100
+
101
+ return df
102
+
103
+
104
+ def scale_data_train(df, data_path, scaler):
105
+ """
106
+ Min-max scale final dataset
107
+ -----
108
+ :param df: dataframe to be scaled
109
+ :param data_path: path to generated data
110
+ :param scaler: scaler object to apply to df
111
+ :return: scaled dataset for modelling
112
+ """
113
+ all_cols = df.columns
114
+ all_cols = all_cols.drop(['SafeHavenID', 'eoy'])
115
+ data_scaled = scaler.fit_transform(df[all_cols].to_numpy())
116
+ df_scaled = pd.DataFrame(data_scaled, columns=all_cols)
117
+ df_final = (df[['SafeHavenID', 'eoy']]
118
+ .reset_index(drop=True)
119
+ .join(df_scaled))
120
+
121
+ # Save the scaler for testing
122
+ joblib.dump(scaler, data_path + 'min_max_scaler_train.pkl')
123
+
124
+ return df_final
125
+
126
+
127
+ def main():
128
+
129
+ # Load in config items
130
+ with open('../../../config.json') as json_config_file:
131
+ config = json.load(json_config_file)
132
+ data_path = config['model_data_path']
133
+
134
+ # Read in combined data
135
+ df = pd.read_pickle(data_path + 'merged_train.pkl')
136
+
137
+ # Calculate age bins
138
+ df = calc_age_bins_train(df, data_path)
139
+
140
+ # Calculate medians for each column for imputation
141
+ df = calc_df_med(df, data_path)
142
+
143
+ # Fill null columns
144
+ df[null_cols] = df.groupby(demo_cols)[null_cols].apply(
145
+ lambda x: x.fillna(x.median()))
146
+
147
+ # Fill null days_since columns
148
+ day = np.timedelta64(1, 'D')
149
+ df[ds_cols].max().to_pickle(data_path + 'maxs.pkl')
150
+ for col in ds_cols:
151
+ df = ds_fill_5year_train(df, col)
152
+ df[col] = df[col].fillna(df[col + '_med'])
153
+ df[col] = (df[col] / day).astype(int)
154
+
155
+ # Save processed data before scaling
156
+ df.to_pickle(data_path + 'filled_train.pkl')
157
+
158
+ # Drop non-modelling columns
159
+ df = df.drop(cols2drop, axis=1)
160
+
161
+ # Initialize scaler
162
+ scaler = MinMaxScaler()
163
+
164
+ # Scale final dataset
165
+ df_final = scale_data_train(df, data_path, scaler)
166
+
167
+ # Save final dataset
168
+ df_final.to_pickle(data_path + 'min_max_train.pkl')
169
+
170
+
171
+ main()
training/src/reduction/combine.py ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ To Do:
3
+ - Refactor script to be more readable/smaller main function
4
+ """
5
+ import json
6
+ import pandas as pd
7
+ import numpy as np
8
+ from datetime import timedelta
9
+
10
+
11
+ def read_pkl_data(dataset, data_path, path_type):
12
+ """
13
+ Read in pickled dataset
14
+ --------
15
+ :param dataset: type of dataset to read in
16
+ :param data_path: path to generated data
17
+ :param path_type: type of path to read from
18
+ :return: dataframe
19
+ """
20
+ print('Reading in ' + dataset)
21
+
22
+ file_path = data_path + dataset
23
+ if path_type == 'data':
24
+ file_path += '_proc.pkl'
25
+ else:
26
+ file_path += '_first_dates.pkl'
27
+
28
+ return pd.read_pickle(file_path)
29
+
30
+
31
+ def fill_eth_grp_data(df):
32
+ """
33
+ Fill nulls in eth_grp column introduced in joining
34
+ :param df: dataframe to update
35
+ :return: Filled dataframe
36
+ """
37
+ df['eth_grp'] = df.groupby('SafeHavenID').eth_grp.apply(
38
+ lambda x: x.ffill().bfill())
39
+ df['eth_grp'] = df['eth_grp'].fillna('Unknown')
40
+
41
+ return df
42
+
43
+
44
+ def fill_to_date_columns(df):
45
+ """
46
+ Fill nulls in to_date columns introduced in joining
47
+ :param df: dataframe to update
48
+ :return: Filled dataframe
49
+ """
50
+ to_date_cols = ['adm_to_date', 'copd_to_date', 'resp_to_date',
51
+ 'presc_to_date', 'rescue_to_date', 'labs_to_date',
52
+ 'anxiety_depression_to_date',
53
+ 'anxiety_depression_presc_to_date']
54
+ df[to_date_cols] = df.groupby('SafeHavenID')[to_date_cols].apply(
55
+ lambda x: x.ffill().fillna(0))
56
+
57
+ return df
58
+
59
+
60
+ def fill_yearly_columns(df):
61
+ """
62
+ Fill nulls in yearly columns introduced in joining
63
+ :param df: dataframe to update
64
+ :return: Filled dataframe
65
+ """
66
+ zero_cols = ['adm_per_year', 'total_hosp_days', 'mean_los',
67
+ 'copd_per_year', 'resp_per_year', 'comorb_per_year',
68
+ 'salbutamol_per_year',
69
+ 'saba_inhaler_per_year', 'laba_inhaler_per_year',
70
+ 'lama_inhaler_per_year', 'sama_inhaler_per_year',
71
+ 'ics_inhaler_per_year', 'laba_ics_inhaler_per_year',
72
+ 'lama_laba_ics_inhaler_per_year', 'saba_sama_inhaler_per_year',
73
+ 'mcs_inhaler_per_year', 'rescue_meds_per_year',
74
+ 'presc_per_year', 'labs_per_year',
75
+ 'anxiety_depression_per_year', 'anxiety_depression_presc_per_year']
76
+ df[zero_cols] = df[zero_cols].fillna(0)
77
+
78
+ return df
79
+
80
+
81
+ def fill_days_since(df, typ):
82
+ """
83
+ Fill days_since_copd/resp/rescue
84
+ :param df: dataframe to update
85
+ :param typ: type of feature to fill ('copd', 'resp', 'rescue')
86
+ :return: Filled dataframe
87
+ """
88
+ df['days_since_' + typ] = df.eoy - df[typ + '_date'].ffill()
89
+
90
+ return df
91
+
92
+
93
+ def process_first_dates(df):
94
+ """
95
+ Process dataframe containing patient's first date in the health board region
96
+ --------
97
+ :param df: dataframe to process
98
+ :return: processed dataframe
99
+ """
100
+ df = df.set_index('SafeHavenID')
101
+ entry_dataset = df.idxmin(axis=1).apply(lambda x: x.split('_')[1])
102
+ first_entry = df.min(axis=1)
103
+ df['entry_dataset'] = entry_dataset
104
+ df['first_entry'] = first_entry
105
+ df_reduced = df[['entry_dataset', 'first_entry']].reset_index()
106
+
107
+ return df_reduced
108
+
109
+
110
+ def find_closest_simd(v):
111
+ """
112
+ Find closest SIMD vigintile for each row 'v'
113
+ --------
114
+ :param v: row of data from apply statement
115
+ :param typ: type of simd column to add
116
+ :return: simd value
117
+ """
118
+ simd_years = [2009, 2012, 2016]
119
+ bools = [v.eoy.year >= year for year in simd_years]
120
+ if any(bools):
121
+ simd_year = str(simd_years[np.where(bools)[0][-1]])
122
+ v['simd_quintile'] = v['simd_' + simd_year + '_quintile']
123
+ v['simd_decile'] = v['simd_' + simd_year + '_decile']
124
+ v['simd_vigintile'] = v['simd_' + simd_year + '_vigintile']
125
+ else:
126
+ v['simd_quintile'] = np.nan
127
+ v['simd_decile'] = np.nan
128
+ v['simd_vigintile'] = np.nan
129
+
130
+ return v
131
+
132
+
133
+ def main():
134
+
135
+ # Load in config items
136
+ with open('../../../config.json') as json_config_file:
137
+ config = json.load(json_config_file)
138
+ data_path = config['model_data_path']
139
+
140
+ # Read in data
141
+ adm = read_pkl_data('adm', data_path, 'data')
142
+ comorb = read_pkl_data('comorb', data_path, 'data')
143
+ presc = read_pkl_data('presc', data_path, 'data')
144
+ labs = read_pkl_data('labs', data_path, 'data')
145
+ demo = read_pkl_data('demo', data_path, 'data')
146
+
147
+ # Join datasets
148
+ df = adm.join(
149
+ comorb, how='left').join(
150
+ presc, how='outer').join(
151
+ labs, how='outer')
152
+ df = df.reset_index()
153
+
154
+ # Fill nulls introduced in joining
155
+ print('Filling data')
156
+ df = fill_eth_grp_data(df)
157
+ df = fill_to_date_columns(df)
158
+ df = fill_yearly_columns(df)
159
+
160
+ # Fill days_since columns
161
+ for typ in ['copd', 'resp', 'rescue', 'adm']:
162
+ df = df.groupby('SafeHavenID').apply(fill_days_since, typ)
163
+
164
+ # Reduce to single column
165
+ ds_cols = ['days_since_copd', 'days_since_resp']
166
+ df['days_since_copd_resp'] = df[ds_cols].min(axis=1)
167
+
168
+ # Read in first date data
169
+ print('Adding first dates')
170
+ adm_dates = read_pkl_data('adm', data_path, 'date')
171
+ presc_dates = read_pkl_data('presc', data_path, 'date')
172
+ labs_dates = read_pkl_data('labs', data_path, 'date')
173
+
174
+ # Merge first date data
175
+ first_dates = pd.merge(
176
+ pd.merge(adm_dates, presc_dates, how="outer", on='SafeHavenID'),
177
+ labs_dates, how="outer", on='SafeHavenID')
178
+
179
+ # Save first dates if needed
180
+ first_dates.to_pickle(data_path + 'overall_first_dates.pkl')
181
+
182
+ # Process first_years
183
+ date_data = process_first_dates(first_dates)
184
+
185
+ # Merge first dates data with dataframe
186
+ print('Merging data')
187
+ df_merged = pd.merge(df, date_data, on='SafeHavenID', how='inner')
188
+
189
+ # Add years in health board region
190
+ ggc_years = (df_merged.eoy - df_merged.first_entry) / np.timedelta64(1, 'Y')
191
+ df_merged['ggc_years'] = round(ggc_years)
192
+
193
+ # Merge demographics
194
+ df_merged = pd.merge(df_merged, demo, on='SafeHavenID')
195
+
196
+ # Calculate age relative to end of year
197
+ dt_diff = df_merged.eoy - pd.to_datetime(df_merged.obf_dob)
198
+ df_merged['age'] = dt_diff // timedelta(days=365.2425)
199
+
200
+ # Find closest SIMD
201
+ df_merged = df_merged.apply(find_closest_simd, axis=1)
202
+
203
+ # Drop additional columns
204
+ cols2drop = ['copd_date', 'resp_date', 'adm_date', 'rescue_date',
205
+ 'simd_2009_quintile', 'simd_2009_decile',
206
+ 'simd_2009_vigintile', 'simd_2012_quintile',
207
+ 'simd_2012_decile', 'simd_2012_vigintile',
208
+ 'simd_2016_quintile', 'simd_2016_decile',
209
+ 'simd_2016_vigintile', 'days_since_copd',
210
+ 'days_since_resp']
211
+ df_merged = df_merged.drop(cols2drop, axis=1)
212
+
213
+ # Save dataset
214
+ df_merged.to_pickle(data_path + 'merged_full.pkl')
215
+
216
+
217
+ main()
training/src/reduction/post_proc_reduction.py ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import pandas as pd
3
+
4
+
5
+ single_inhaler = ['saba_inhaler_per_year', 'laba_inhaler_per_year',
6
+ 'lama_inhaler_per_year', 'sama_inhaler_per_year',
7
+ 'ics_inhaler_per_year', 'mcs_inhaler_per_year']
8
+ double_inhaler = ['laba_ics_inhaler_per_year', 'saba_sama_inhaler_per_year']
9
+ triple_inhaler = 'lama_laba_ics_inhaler_per_year'
10
+ adm_cols = ['copd_per_year', 'resp_per_year']
11
+
12
+
13
+ def main():
14
+
15
+ # Load in config items
16
+ with open('../../../config.json') as json_config_file:
17
+ config = json.load(json_config_file)
18
+ data_path = config['model_data_path']
19
+
20
+ # Read in original data before scaling
21
+ df = pd.read_pickle(data_path + 'merged_full.pkl')
22
+
23
+ # Create new reduced columns
24
+ df['single_inhaler'] = df[single_inhaler].sum(axis=1)
25
+ df['double_inhaler'] = df[double_inhaler].sum(axis=1)
26
+ df['triple_inhaler'] = df[triple_inhaler]
27
+ df['copd_resp_per_year'] = df[adm_cols].sum(axis=1)
28
+
29
+ # Drop original columns
30
+ cols2drop = single_inhaler + double_inhaler + [triple_inhaler] + adm_cols
31
+ df = df.drop(cols2drop, axis=1)
32
+
33
+ # Save data
34
+ df.to_pickle(data_path + 'merged.pkl')
35
+
36
+
37
+ main()