52.1 kB

Medical Concept Representation Learning from Electronic Health Records and its Application on Heart Failure Prediction

Edward Choi*,MS; Andy Schuetz**,PhD; Walter F. Stewart**,PhD; Jimeng Sun*,PhD

*Georgia Institute of Technology, Atlanta, USA

**Research Development & Dissemination, Sutter Health, Walnut Creek, USA

Corresponding Author: Jimeng Sun

Georgia Institute of Technology

266 Ferst Drive, Atlanta, GA 30313

Tel: 404.894.0482

E-mail: jsun@cc.gatech.edu

Keywords:

Neural Networks, Representation learning, Predictive modeling, Heart Failure prediction.

Word Count: 3600## ABSTRACT

Objective: To transform heterogeneous clinical data from electronic health records into clinically meaningful constructed features using data driven method that rely, in part, on temporal relations among data.

Materials and Methods: The clinically meaningful representations of medical concepts and patients are the key for health analytic applications. Most of existing approaches directly construct features mapped to raw data (e.g., ICD or CPT codes), or utilize some ontology mapping such as SNOMED codes. However, none of the existing approaches leverage EHR data directly for learning such concept representation. We propose a new way to represent heterogeneous medical concepts (e.g., diagnoses, medications and procedures) based on co-occurrence patterns in longitudinal electronic health records. The intuition behind the method is to map medical concepts that are co-occurring closely in time to similar concept vectors so that their distance will be small. We also derive a simple method to construct patient vectors from the related medical concept vectors.

Results: For qualitative evaluation, we study similar medical concepts across diagnosis, medication and procedure. In quantitative evaluation, our proposed representation significantly improves the predictive modeling performance for onset of heart failure (HF), where classification methods (e.g. logistic regression, neural network, support vector machine and K-nearest neighbors) achieve up to 23% improvement in area under the ROC curve (AUC) using this proposed representation.

Conclusion: We proposed an effective method for patient and medical concept representation learning. The resulting representation can map relevant concepts together and also improves predictive modeling performance.## Introduction

Growth in use of electronic health records (EHR) in health care delivery is opening unprecedented opportunities to predict patient risk, understand what works best for a given patient, and to personalize clinical decision-making. But, raw EHR data, represented by a heterogeneous mix of elements (e.g., clinical measures, diagnoses, medications, procedures) and voluminous unstructured content, may not be optimal for analytic uses or even for clinical care. While higher order clinical features (e.g., disease phenotypes) are intuitively more meaningful and can reduce data volume, they may fail to capture meaningful information inherent to patient data. We explored whether novel data driven methods that rely on the temporal occurrence of EHR data elements could yield higher order intuitively interpretable features that both capture pathophysiologic relations inherent to data and improve performance of predictive models.

Growth in use of EHRs is raising fundamental questions on optimal ways to represent structured and unstructured data. Medical ontologies such as SNOMED, RxNorm and LOINC offer structured hierarchical means of compressing data and of understand relations among data from different domains (e.g., disease diagnosis, labs, prescriptions). But, these ontologies do not offer the means of extracting meaningful relations inherent to longitudinal patient data. Scalable methods that can detect pathophysiologic relations inherent to longitudinal EHR data and construct intuitive features may accelerate more effective use of EHR data in clinical care and advances in performance of predictive analytics.

The abstract concepts inherent to existing ontologies does not provide a means to connect elements in different domains to a common underlying pathophysiologicconstructs that are represented by how data elements co-occur in time. The data driven approach we developed logically organizes data into higher order constructs. Heterogeneous medical data were mapped to a low-dimensional space that accounted for temporal clustering of similar concepts (e.g., A1c lab test, ICD-9 code for diabetes, prescription for metformin). Co-occurring clusters (e.g., diabetes and peripheral neuropathy) were then identified and formed into higher order pathophysiologic feature sets organized by prevalence.

We propose to learn such a medical concept representation on longitudinal EHR data based on a state-of-the-art neural network model. We also propose an efficient way to derive patient representation based on the medical concept representation (or medical concept vectors). We calculated for a set of diseases their closest diseases, medications and procedures to demonstrate the clinical knowledge captured by the medical concept representations. We use those learned representation for heart failure prediction tasks, where significant performance improvement up to 23% in AUC can be obtained on many classification models (logistic regression: AUC 0.766 to 0.791, SVM: AUC 0.736 to 0.791, neural network: AUC 0.779 to 0.814, KNN: AUC 0.637 to 0.785)

BACKGROUND

Representation Learning in Natural Language Processing

Recently, neural network based representation learning has shown success in many fields such as computer vision [1] [2] [3], audio processing [4] [5] and natural language processing (NLP) [6] [7] [8]. We discuss representation learning in NLP, in particular, as our proposed method is based on Skip-gram [6] [9], a popular method for learning word representations.Mikolov et al. [6] [9] proposed Skip-gram, a simple model based on neural network that can learn real-valued multi-dimensional vectors that capture relations between words by training on massive amount of text. The trained real-valued vectors will have similar values for syntactically and semantically close words such as dog and cat or would and could, but distinct values for words that are not. Pennington et al. [10] proposed GloVe, a word representation algorithm based on the global co-occurrence matrix. While GloVe and Skip-gram essentially achieve the same goal by taking different approaches, GloVe is computationally faster than Skip-gram as it precomputes co-occurrence information before the actual learning. Skip-gram, however, requires less number of hyper-parameters to tune than GloVe, and generally shows better performance [11].

As natural language text can be seen as a sequence of codes, medical records such as diagnoses, medications and procedures can also be seen as a sequence of codes over time. In this work, we propose a framework for mapping raw medical concepts (e.g., ICD9, CPT) into related concept vectors using Skip-gram and validating the utility of the resulting medical concept vectors.

Representation Learning in the Clinical Field

A few researchers applied representation learning in the clinical field recently. Minnaro-Gimenez et al. [12] learned the representations of medical terms by applying Skip-gram to various medical text. They collected the medical text from PubMed, Merck Manuals, Medscape and Wikipedia. De Vine et al. [13] learned the representations of UMLS concepts from free-text patient records and medical journal abstracts. The first pre-processed the text to map the words to UMLS concepts, then applied Skip-gram to learn the representations of the concepts.More recently, Choi et al. [14] applied Skip-gram to structured dataset from a health insurance company, where the dataset consisted of patient visit records along with diagnosis codes(ICD9), lab test results(LOINC), and drug usage(NDC). Their goal, to learn efficient representations of medical concepts, partially overlaps with our goal. Our study however, is focused on learning the representations of medical concepts and using them to generate patient representations, apply them to a real-world prediction problem to demonstrate improved performance provided by the efficient representation learning.

MATERIALS AND METHODS


graph LR
    subgraph Training_Phase [Training Phase]
        EHR[(EHR)] --> MCLR[Medical Concept Representation Learning]
        MCLR --> PR[Patient Representation Construction]
        PR --> HFPTM[Heart Failure Prediction Model Training]
    end
    subgraph Prediction_Phase [Prediction Phase]
        Patient((Patient)) -- Mapping --> MCV[Medical Concept Vectors]
        MCV -- Aggregation --> PV[Patient Vectors]
        PV -- Input --> Model[Model]
        Model -- Output --> HFRRS[Heart Failure Risk Score]
    end
    MCLR --> MCV
    PR --> PV
    HFPTM --> Model

The flowchart illustrates the proposed method, divided into two main phases: the Training Phase and the Prediction Phase. In the Training Phase, an EHR dataset is used for Medical Concept Representation Learning, which then informs Patient Representation Construction. This patient representation is used to train a Heart Failure Prediction Model. In the Prediction Phase, a patient's medical record is mapped to Medical Concept Vectors, which are then aggregated to form Patient Vectors. These vectors are input into the trained Model to produce the Heart Failure Risk Score.

Figure 1. Flowchart of the proposed method

In Figure 1, we give a high-level overview of the steps we take to perform HF prediction. In the training phase, we first train medical concept vectors from the EHR dataset using Skip-gram. Then, we construct patient representation using the medical concept vectors. The patient representation is then used to train heart failure prediction models using various classifiers, namely logistic regression, support vector machine (SVM), multi-layer perceptron with one hidden layer (MLP) and K-nearest neighbors classifier (KNN). In the prediction phase, we map the medical record of a patient to medical concept vectors and generate patient vectors by aggregating the concept vectors. Then weplug the patient vectors into the trained model, which in turn will generate the risk score for heart failure.

In the following sections, we will describe medical concept representation learning and patient representation construction in more detail.

Medical Concept Representation Learning

Figure 2 illustrates two different ways to represent medical diagnoses. Part (a) shows one-hot encoding, where each diagnosis is represented by an $N$ -dimensional vector. For example, Bronchitis is represented as $[1, 0, 0, 0, 0, \dots, 0, 0, 0]$ , Pneumonia as $[0, 1, 0, 0, 0, \dots, 0, 0, 0]$ , and Obesity as $[0, 0, 1, 0, 0, \dots, 0, 0, 0]$ . This means the difference between Bronchitis and Pneumonia is the same as the difference between Pneumonia and Obesity. Part (b) shows a better representation using $D$ -dimensional vectors. For example, Bronchitis is represented as $[0.4, -0.2, \dots, 0.2]$ , Pneumonia as $[0.3, -0.3, \dots, 0.1]$ , and Obesity as $[-0.7, 1.4, \dots, 1.2]$ . This representation captures the latent relationships between diagnoses more effectively.

	$N$ -dimensional vector	$D$ -dimensional vector
Bronchitis:	$[1, 0, 0, 0, 0, \dots, 0, 0, 0]$	$[0.4, -0.2, \dots, 0.2]$
Pneumonia:	$[0, 1, 0, 0, 0, \dots, 0, 0, 0]$	$[0.3, -0.3, \dots, 0.1]$
Obesity:	$[0, 0, 1, 0, 0, \dots, 0, 0, 0]$	$[-0.7, 1.4, \dots, 1.2]$
⋮	⋮	⋮
Cataract:	$[0, 0, 0, 0, 0, \dots, 0, 0, 1]$	$[1.2, 0.8, \dots, 1.5]$

(a) One-hot encoding for diagnoses (b) A better representation of diagnoses

Figure 2. Two different representation of diagnoses. Typically, raw data dimensionality $N(\sim 10,000)$ is much larger than concept dimensionality $D(50\sim 1,000)$

Figure 2 depicts a straightforward motivation for using a better representation for medical concepts. Figure 2(a) shows one-hot encoding of $N$ unique diagnoses using $N$ -dimensional vectors. It is easy to see that this is not an effective representation in that the difference between Bronchitis and Pneumonia are the same as the difference between Pneumonia and Obesity. Figure 2(b) shows a better representation in that Bronchitis and Pneumonia share similar values compared to other diagnoses. By using Skip-gram, we will be able to better represent not only diagnoses but also medications and procedures as multi-dimensional real-valued vectors that will capture the latent relations between them.(a) Patient medical records on a timeline

(b) Predicting neighboring medical concepts given Fever

(d) Model architecture of Skip-gram

Figure 3. Training examples and the model architecture of Skip-gram

Figure 3(a) is an example of a patient medical record in a temporal order. Skip-gram assumes the meaning of a concept is determined by its context(or neighbors). Therefore, given a sequence of concepts, Skip-gram picks a target concept and tries to predict its neighbors, as shown by Figure 3(b). Then we slide the context window, pick the next target and do the same context prediction, which is shown by Figure 3(c). Since the goal of Skip-gram is to learn the vector representation of concepts, we need to convert medical concepts to $D$ -dimensional vectors, where $D$ is a user-chosen value typically between 50 and 1000.Therefore the actual prediction is conducted with vectors, as shown by Figure 3(d), where $c_t$ is the concept at the $t$ -th timestep, $\mathbf{v}(c_t)$ the vector that represents $c_t$ . The goal of Skip-gram is to maximize the following average log probability,

$\frac{1}{T} \sum_{t=1}^T \sum_{-w \leq j \leq w, j \neq 0} \log p(c_{t+j} | c_t), \text{ where}$

$p(c_{t+j} | c_t) = \frac{\exp(\mathbf{v}(c_{t+j})^T \mathbf{v}(c_t))}{\sum_{c=1}^N \exp(\mathbf{v}(c)^T \mathbf{v}(c_t))}$

where $T$ is the length of the sequence of medical concepts, $w$ the size of the context window, $c_t$ the target medical concept at timestep $t$ , $c_{t+j}$ the neighboring medical concept at timestep $t+j$ , $\mathbf{v}(c)$ the vector that represents the medical concept $c$ , $N$ the total number of medical concepts. The size of the context window is typically set to 5, giving us 10 concepts surrounding the target concept. Note that the conditional probability is expressed as a softmax function. Simply put, by maximizing the softmax score of the inner product of the neighboring concepts, Skip-gram learns real-valued vectors that efficiently capture the fine-grained relations between concepts. It needs to be mentioned that our formulation of Skip-gram is different from the original Skip-gram. In Mikolov et al. [9], they distinguish the vectors for the target concept and the vectors for the neighboring concept. In our formulation, we force the two sets of vectors to hold the same values as suggested by [15]. This simpler formulation allowed faster training and impressive results.

Patient Representation Construction

In this section, we describe a simple derivation of patient representation using the learned medical concept vectors. One of the impressive features of Skip-gram in Mikolov et al. [9] was that the word vectors supported syntactically and semantically meaningfullinear operations that enabled word analogy calculations such that the resulting vector of King – Man + Woman is closest to Queen vector.

We expect that the medical concept representations learned by Skip-gram will show similar properties so that the concept vectors will support clinically meaningful vector additions. Then, an efficient representation of a patient will be as simple as converting all medical concepts in his medical history to medical concept vectors, then summing all those vectors to obtain a single representation vector, as shown in Figure 4. In the experiments, we show examples of clinically meaningful concept vector additions.

The diagram illustrates the construction of a patient representation from a medical record. Part (a) shows a timeline with events: Cough, Benzonatate, Fever, Pneumonia, Chest X-ray, and Amoxicillin. Part (b) shows a row of colored bars representing medical concept vectors, with a red box highlighting a subset. Part (c) shows the sum of these vectors equals a single dark red bar representing the patient representation.

Figure 4. Patient representation construction. (a) represents a medical record of a patient on a timeline. (b) The medical concepts are represented as vectors using the trained medical concept vectors. (c) The patient is represented as a vector by summing all medical concept vectors.

EXPERIMENTS AND RESULTS

Population and Source of DataData were from Sutter Palo Alto Medical Foundation (Sutter-PAMF) primary care patients. Sutter-PAMF is a large primary care and multispecialty group practice that has used an EHR for more than a decade. The study dataset was extracted with cases and controls identified within the interval from 05/16/2000 to 05/23/2013. The EHR data included demographics, smoking and alcohol consumptions, clinical and laboratory values, International Classification of Disease version 9 (ICD-9) codes associated with encounters, order, and referrals, procedure information in Current Procedural Terminology (CPT) codes and medication prescription information in medical names. The dataset contained 265,336 patients with 555,609 unique clinical events in total.

Configuration for Medical Concept Representation Learning

To apply Skip-gram, we scanned through encounter, medication order, procedure order and problem list records of all 265,336 patients, and extracted diagnosis, medication and procedure codes assigned to each patient in temporal order. If a patient received multiple diagnoses, medications or procedures at a single visit, then those medical codes were given the same timestamp. The respective number of unique diagnoses, medications and procedures was 11,460, 17,769 and 9,370 totaling to 38,599 unique medical concepts. We used 100-dimensional vectors to represent medical concepts (i.e. $D=100$ in Figure 2(b)), considering 300 was sufficient to effectively represent 692,000 vocabularies in NLP. [9]

We used Theano [16], a Python library for evaluating mathematical expression to implement Skip-gram. Theano can also take advantage of GPUs to greatly improve the speed of calculations involving large matrices. For optimization, we used Adadelta [17], which employs adaptive learning rate. Unlike stochastic gradient descent (SGD), which is widely used for training neural networks, Adadelta does not depend very strongly on thesetting of the learning rate, and shows good performance. Using Theano 0.7 and CUDA 7 on an Ubuntu machine with Xeon E5-2697 and Nvidia Tesla K80, it took approximately 43 hours to run 10 epochs of Adadelta with the batch size of 100.

Evaluation of Medical Concept Representation Learning

Figure 5. Diagnosis vectors projected to a 2D space by t-SNEFigure 5 shows the trained diagnosis vectors plotted in a 2D space, where we used t-SNE [18] to reduce the dimensions from 100 to 2. t-SNE is a dimensionality reduction algorithm that was specifically developed for plotting high-dimensional data into a two or three dimensional space. We randomly chose 1,000 diagnoses from 10 uppermost categories of ICD-9, which are displayed at the top of the figure. It is readily visible that diagnoses are generally well grouped by their corresponding categories. However, if diagnoses from the same category are in fact quite different, they should be apart. This is shown by the red box and the blue box in Figure 5. Even though they are from the same neoplasms category, red box indicates a group of malignant skin neoplasms (172.X, 173.X) while blue box indicates a group of benign skin neoplasms (216.X). Detailed figure of the red and blue boxes are in the supplementary section. What is more, as the black box shows, diagnoses from different groups are located close to one another if they are actually related. In the black box, iridocyclitis and eye infections related to herpes zoster are closely located, which corresponds to the fact that approximately 43% herpes zoster ophthalmicus (HZO) patients develop iridocyclitis. [19]

In order to see how well the representation learning captured the relations between medications and procedures as well as diagnoses, we conduct the following study. We chose 100 diagnoses that occurred most frequently in the data, obtained for each diagnosis 50 closest vectors in terms of cosine similarity, picked 5 diagnosis, medication and procedure vectors among the 50 vectors. Table 2 depicts a portion of the entire list. Note that some cells contain less than 5 items, which is because there was less than 5 items in the 50 closest vectors. The entire list is provided in the supplementary section.Table 1. Examples of diagnoses and their closest medical concepts.

	Diagnoses	Medications	Procedures
Acute upper respiratory infections (465.9)	-Bronchitis, not specified as acute or chronic (490) -Cough (786.2) -Acute sinusitis, unspecified (461.9) -Acute bronchitis (466.0) -Acute pharyngitis (462)	-Azithromycin 250 mg po tabs -Promethazine-Codeine 6.25-10 mg/5ml po syrup -Amoxicillin 500 mg po caps -Fluticasone Propionate 50 mcg/act na susp -Flonase 50 mcg/act na susp	-Pulse oximetry single -Serv prov during reg sched eve/wkend/hol hrs -Chest PA & lateral -Gyn cytology (pap) pa -Influenza vac (flu clinic only) 3+yo pa
Diabetes mellitus (250.02)	-Diabetes mellitus (250.00) -Mixed hyperlipidemia (272.2) -Other abnormal glucose (790.29) -Obesity, unspecified (278.00) -Pure hypercholesterolemia (272.0)	-Metformin hcl 500 mg po tabs -Metformin hcl 1000 mg po tabs -Glucose blood vi strp -Lisinopril 10 mg po tabs -Lisinopril 20 mg po tabs	-Diabetic eye exam (no bill) -Diabetes education, int -Ophthalmology, int -Diabetic foot exam (no bill) -Influenza vac 3+yr (v04.81) im
Edema (782.3)	-Anemia, unspecified (285.9) -Congestive heart failure, unspecified (428.0) -Unspecified essential hypertension (401.9) -Atrial fibrillation (427.31) -Chronic kidney disease, Stage III (moderate) (585.3)	-Furosemide 20 mg po tabs -Hydrochlorothiazide 25 mg po tabs -Hydrocodone-Acetaminophen 5-500 mg po tabs -Cephalexin 500 mg po caps -Furosemide 40 mg po tabs	-Debridement of nails, 6 or more -OV est pt min serv -EKG -ECG and interpretation -Chest PA & lateral
Tear film insufficiency, unspecified (375.15)	-Blepharitis, unspecified (373.00) -Senile cataract, unspecified (366.10) -Presbyopia (367.4) -Preglaucoma, unspecified (365.00) -Other chronic allergic conjunctivitis (372.14)	-Glasses -Erythromycin 5 mg/gm op oint -Patanol 0.1 % op soln	-Refraction -Visual field exam extended -Visual field exam limited -Referral to ophthalmology, int -Ophthalmology, int
Benign essential hypertension (401.1)	-Hyperlipidemia (272.4) -Essential hypertension (401.9) -Pure hypercholesterolemia (272.0) -Mixed hyperlipidemia (272.2) -Diabetes mellitus (250.00)	-Hydrochlorothiazide 25 mg po tabs -Atenolol 50 mg po tabs -Lisinopril 10 mg po tabs -Lisinopril 40 mg po tabs -Lisinopril 20 mg po tabs	-ECG and interpretation -Influenza vac 3+yr im -Immun admin im/sq/id/perc 1st vac only -GI, int -OV est pt lev 3

## Evaluation of Medical Concept Vector Additions

Table 2. Vector operations of medical vectors trained by Skip-gram.

	Diagnoses	Medications	Procedures
Hypertension (401.9) + Obesity (278.0)	-Hyperlipidemia (272.4) -Diabetes (250.00) -Coronary atherosclerosis (414.00) -Hypertension (401.1) -Chronic kidney disease (585.3)	-Hydrochlorothiazide -Valsartan -Nifedipine -Lisinopril -Losartan potassium	N/A
Fever (780.60) + Cough (786.2)	-Pneumonia (486) -Acute bronchitis (466.0) -Acute upper respiratory infections (465.9) -Bronchitis (490) -Acute sinusitis (461.9)	-Azithromycin -Promethazine-codeine -Guafenesin-codeine -Proair HFA -Levofloxacin	-X-ray chest -Chest PA & Lateral -Pulse oximetry -Serv prov during reg sched eve/wkend/hol hrs -Inhalation Rx for obstruction MDI/NEB
Visual Disturbance (368.8) + Pain in/around Eye (379.91)	-Tear film insufficiency (375.15) -Visual discomfort (368.13) -Regular astigmatism (367.21) -Presbyopia (367.4) -Blepharitis (373.00)	-Glasses -Erythromycin ointment -Patanol	-Ophthalmology -Peripheral refraction -Referral to ophthalmology -Visual field exam -Diabetic eye exam
Loss of Weight (783.21) + Anxiety State (300.00)	-Depressive disorder (311) -Malais & fatigue (780.79) -Insomnia (780.52) -Generalized anxiety disorder (300.02) -Esophageal reflux (530.81)	-Lorazepam -Zolpidem tartrate -Omeprazole -Alprazolam -Trazodone HCL	-Referral to GI -ECG & Interpretation -GI -EKG -Chest PA & Lateral
Hallucination (780.1) + Speech Disturbance (784.59)	-Dysarthria (784.51) -Secondary parkinsonism (332.1) -Senile dementia with delirium (290.3) -Mental disorder (294.9) -Paranoid state (297.9)	-Midorine HCL -Risperdal -Rivastigmine Tartrate -Rivastigmine	-Referral to geriatrics -Mental status exam -Referral to neuropsychology -Referral to speech therapy Home visit est pt lev 2

Due to the difficulty of generating medically interesting examples, we chose 5 intuitive examples as shown by the first column of Table 3 to give a simple demonstration of the medical concept vector additions. We again generated 50 closest vectors to the sum of two medical concept vectors and picked 5 from each diagnosis, medication and procedure category.## Setup for Heart Failure Prediction Evaluation

In this section, we first describe why we chose heart failure (HF) prediction task as an application. Then we briefly mention the models to use, followed by the description of the data processing steps to create the training data for all models. Lastly, the evaluation strategy will be followed by implementation details.

Heart failure prediction task: Onset of HF is associated with a high level of disability, health care costs and mortality (roughly ~50% risk of mortality within 5 years of diagnosis). [20] [21] There has been relatively little progress in slowing the progression of HF severity, largely because it is difficult to detect before actual diagnosis. As a consequence, intervention has primarily been confined to the time period after diagnosis, with little or no impact on disease progression. Earlier detection of HF could lead to improved outcomes through patient engagement and more assertive treatment with angiotensin converting enzyme (ACE)-inhibitors or Angiotensin II receptor blockers (ARBs), mild exercise, and reduced salt intake, and possibly other options [22] [23] [24] [25].

Models for performance comparison: We aim to emphasize the effectiveness of the medical concept representation and the patient representation derived from it. Therefore we trained four popular classifiers, namely logistic regression, MLP, SVM, and KNN using both one-hot vectors and medical concept vectors.

Definition of Cases and Controls: Criteria for incident onset of HF, are described in [26] and were adopted from [27]. The criteria are defined as: 1) Qualifying ICD-9 codes for HF appeared as a diagnosis code in either the encounter, the problem list, or the medication order fields. Qualifying ICD-9 codes are listed in the supplementary section. Qualifying ICD-9 codes with image and other related orders were excluded because these orders oftenrepresent a suspicion of HF, where the results are often negative; 2) a minimum of three clinical encounters with qualifying ICD-9 codes had to occur within 12 months of each other, where the date of diagnosis was assigned to the earliest of the three dates. If the time span between the first and second appearances of the HF diagnostic code was greater than 12 months, the date of the second encounter was used as the first qualifying encounter; 3) ages 50 or greater to less than 85 at the time of HF diagnosis.

Up to ten (nine on average) eligible primary care clinic-, sex-, and age-matched (in 5-year age intervals) controls were selected for each incident HF case. Primary care patients were eligible as controls if they had no HF diagnosis in the 12-month period before diagnosis of the incident HF case. Control subjects were required to have their first office encounter within one year of the matching HF case patient's first office visit, and have at least one office encounter 30 days before or any time after the case's HF diagnosis date to ensure similar duration of observations among cases and controls.

From 265,336 Sutter-PAMF patients, 3,884 incident HF cases and 28,903 control patients were identified.

Data processing: To train the four models, we generated the dataset again from the encounter, medication order, procedure order and problem list records of 3,884 cases and 28,903 controls. Based on the HF diagnosis date (HFDx) of each patient, we extracted all records from the 18-month period before the HFDx. To train the models with medical concept vectors, we converted the medical records to patient vectors as shown in Figure 4. To train the models with one-hot encoding, we converted the medical records to aggregated one-hot vectors in the same fashion as Figure 4, using one-hot vectors instead of medical concept vectors.In order to study the relation between the medical concept vectors trained with different sizes of data and their influence on the models' prediction performance, we used three kinds of medical concept vectors: 1) The one trained with only HF cases (3,884 patients), 2) The one trained with HF cases and controls (32,787 patients), 3) The one trained with the full sample (265,336 patients). Note that medical concept vectors trained with smaller number of patients cover less number of medical concepts. Therefore, when converting patient records to patient vectors as Figure 4, we excluded all medical codes that did not have matching medical concept vectors. All input vectors were normalized to zero mean and unit variance.

Evaluation strategy: We used six-fold cross validation to train and evaluate all models, and to estimate how well the models will generalize to independent datasets. Prediction performance was measured using area under the ROC curve (AUC), on data not used in the training. We used the confidence score to calculate its AUC for SVM. Detailed explanation of the cross validation is given in the supplementary section.

Implementation details: Logistic regression and MLP were implemented with Theano and trained with Adadelta. SVM and KNN were implemented with Python Scikit-Learn. All models were trained by the same machine used for medical concept representation learning. Hyper-parameters used for training each model are described in the supplementary section.

Evaluation of Heart Failure PredictionFigure 6. Heart failure prediction performance of various models and input vectors.

Figure 6 shows the average AUC of 6-fold cross validations of various models and input vectors. The colors represent different training input vectors. The error bars indicate the standard deviation derived from the 6-fold cross evaluation. The power of medical concept representation learning is evident as all models show significant improvement in the HF prediction performance. Logistic regression and SVM, both being linear models, show similar performance when trained with medical concept vectors, although SVM benefits slightly more from using the better representation of medical concepts. MLP also benefits from using medical concept vectors, and, being a non-linear model, shows better performance compared to logistic regression and SVM. It is interesting that KNN benefits the most from using the medical concept vectors, even the ones trained on the smallestdataset. Considering the fact that KNN classification is based on the distances between data points, this is a clear indication that proper medical concept representation can alleviate the sparsity problem induced by the simple one-hot encoding.

Figure 6 also tells us that medical concept representation is best learned with a large dataset as shown by Mikolov et al. [6] However, in most models, especially KNN, even the medical concept vectors trained with the smallest number of patients improves the prediction performance. It is quite surprising given the fact that we used less amount of information by excluding unmatched medical codes when using medical concept vectors trained with a small number of patients, the models still show better prediction performance. This again is a clear proof that medical concept representation learning provides more effective way to represent medical concepts than one-hot encoding.

Table 3. Training speed improvement. (*Since KNN does not require training, we display classification time instead)

	Logistic regression	SVM	MLP	KNN*
One-hot encoding	81.3	20.5	85.7	2900.82
Medical concept vectors	5.3	1.9	5.6	36.66
Speed -up	x15.3	x10.8	x15.3	x79.1

Table 4 depicts the training time for each model when using one-hot encoding and medical concept vectors. Considering the high-dimensionality of one-hot encoding, training the models with medical concept vectors should provide significant speed-up, as shown by the last row of Table 4. This shows that medical concept vectors not only improve performance, but also significantly reduce the training time.Before discussing future work, we would like to emphasize the fact that our entire experiments were conducted completely without expert knowledge such as medical ontologies or features designed by medical experts. Using only the medical order records, we were able to produce clinically meaningful representation of medical concepts. This is an inspiring discovery that can be extended for numerous other medical problems.

Future Work

Although medical concept vectors have shown impressive results, it would be even more effective if deeper medical information could be embedded such as lab results or patient demographic information. This would enable us to represent the medical state of patients more accurately.

Using expert knowledge is another thing we should try. Even though we have shown impressive performance only by using medical records, this does not mean we cannot benefit from well-established expert medical knowledge, such as specific features or medical ontologies.

Another natural extension of our work is to address other medical problems. Although this work focused on the early detection of heart failure, our approach is very general that it could be applied to any kind of disease prediction problem. And the medical concept vectors can also be used in numerous medical applications as well.

CONCLUSION

We proposed a new way of representing heterogeneous medical concepts as real-valued vectors and constructing efficient patient representation using the state-of-the-art Deep Learning method. We have qualitatively shown that the trained medical concept vectors indeed captured medical insights compatible with our medical knowledge andexperience. For the heart failure prediction task, medical concept vectors improved the performance of many classifiers, thus quantitatively proving its effectiveness. We discussed the limitation of our method and possible future works, which include deeper utilization of medical information, combining expert knowledge into our framework, and expanding our approach to various medical applications.

References

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems, 2012, pp. 1097-1105.
[2] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol, "Extracting and composing robust features with denoising autoencoders," in International Conference on Machine learning, 2008, pp. 1096-1103.
[3] Quoc Le et al., "uilding high-level features using large scale unsupervised learning," in arXiv:1112.6209, 2012.
[4] Honglak Lee, Peter Pham, Yan Largman, and Andrew Ng, "Unsupervised feature learning for audio classification using convolutional deep belief networks," in Advances in Neural Information Processing Systems, 2009, pp. 1096-1104.
[5] Geoffrey Hinton et al., "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups," Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.
[6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, "Efficient estimation of word representations in vector space," in arXiv preprint arXiv:1301.3781, 2013.- [7] Kyunghyun Cho et al., "Learning phrase representations using rnn encoder-decoder for statistical machine translation," in arXiv preprint arXiv:1406.1078, 2014.
[8] Richard Socher, Jeffrey Pennington, Eric Huang, Andrew Ng, and Christopher Manning, "Semi-supervised recursive autoencoders for predicting sentiment distributions," in Empirical Methods in Natural Language Processing, 2011, pp. 151-161.
[9] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean, "Distributed representations of words and phrases and their compositionality," in Advances in Neural Information Processing Systems, 2013, pp. 3111-3119.
[10] Jeffrey Pennington, Richard Socher, and Christopher Manning, "Glove: Global vectors for word representation," in Empirical Methods on Natural Language Processing, 2014, pp. 1532-1543.
[11] Radim Rehurek. (2014, Dec.) Rare Technologies. [Online]. http://rare-technologies.com/making-sense-of-word2vec/
[12] Jose Minarro-Gimenez, Oscar Marin-Alonso, and Matthias Samwald, "Exploring the application of deep learning techniques on medical text corpora," Studies in health technology and informatics, vol. 205, pp. 584-588, 2013.
[13] Lance De Vine, Guido Zuccon, Bevan Koopman, Laurianne Sitbon, and Peter Bruza, "Medical Semantic Similarity with a Neural Language Model," in International Conference on Information and Knowledge Management, 2014, pp. 1819-1822.
[14] Youngduk Choi, Chill Chiu, and David Sontag, Learning low-dimensional representations of medical concepts, 2016, to be appeared in AMIA-CRI.- [15] Xin Rong, "word2vec Parameter learning explained," in arXiv preprint arXiv:1411.2738, 2014.
[16] James Bergstra et al., "Theano: a CPU and GPU Math Expression Compiler," in Python for Scientific Computing Conference, 2010.
[17] Matthew Zeiler, "ADADELTA: An adaptive learning rate method," in arXiv preprint arXiv:1212.5701, 2012.
[18] Laurens Van der Maaten and Geoffrey Hinton, "Visualizing data using t-SNE," Journal of Machine Learning Research, vol. 9, no. 11, pp. 2579-2605, 2008.
[19] Janice Thean, Anthony Hall, and Richard Stawell, "Uveitis in herpes zoster ophthalmicus," Clinical & Experimental Ophthalmology, vol. 29, no. 6, pp. 406-410, 2001.
[20] Veronique L. Roger et al., "Trends in heart failure incidence and survival in a community-based population," JAMA, vol. 292, no. 3, pp. 344-350, July 2004.
[21] Sherry L. Murphy, Jiaquan Xu, and Kenneth D. Kochanek, "Deaths: final data for 2010," National Vital Stat Rep, vol. 61, no. 4, pp. 1-117, May 2010.
[22] SOLVD Investigators, "Effect of enalapril on mortality and the development of heart failure in asymptomatic patients with reduced left ventricular ejection fractions," N Engl j Med, vol. 327, pp. 685-691, 1992.
[23] J Arnold et al., "Prevention of heart failure in patients in the Heart Outcomes Prevention Evaluation (HOPE) study," Circulation, vol. 107, no. 9, pp. 1284-1290, 2003.[24] Sebastiano Sciarretta, Francesca Palano, Giuliano Tocchi, Rossella Baldini, and Massimo Volpe, "Antihypertensive treatment and development of heart failure in hypertension: a Bayesian network meta-analysis of studies in patients with hypertension and high cardiovascular risk," Archives of internal medicine, vol. 171, no. 5, pp. 384-394, 2011.

[25] Chao-Hung Wang, Richard Weisel, Peter Liu, Paul Fedak, and Subodh Verma, "Glitazones and heart failure critical appraisal for the clinician," Circulation, vol. 107, no. 10, pp. 1350-1354, 2003.

[26] Rajakrishnan Vijayakrishnan et al., "Prevalence of heart failure signs and symptoms in a large primary care population identified through the use of text and data mining of the electronic health record," Journal of Cardiac Failure, vol. 20, no. 7, pp. 459-464, 2014.

[27] Jerry Gurwitz et al., "Contemporary prevalence and correlates of incident heart failure with preserved ejection fraction," The American journal of medicine, vol. 126, no. 5, pp. 393-400, 2013.## SUPPLEMENTARY

Table 4. Qualifying ICD-9 codes for heart failure

ICD-9 Code	Description
398.91	Rheumatic heart failure (congestive)
402.01	Malignant hypertensive heart disease with heart failure
402.11	Benign hypertensive heart disease with heart failure
402.91	Unspecified hypertensive heart disease with heart failure
404.01	Hypertensive heart and chronic kidney disease, malignant, with heart failure and with chronic kidney disease stage I through stage IV, or unspecified
404.03	Hypertensive heart and chronic kidney disease, malignant, with heart failure and with chronic kidney disease stage V or end stage renal disease
404.11	Hypertensive heart and chronic kidney disease, benign, with heart failure and with chronic kidney disease stage I through stage IV, or unspecified
404.13	Hypertensive heart and chronic kidney disease, benign, with heart failure and chronic kidney disease stage V or end stage renal disease
404.91	Hypertensive heart and chronic kidney disease, unspecified, with heart failure and with chronic kidney disease stage I through stage IV, or unspecified
404.93	Hypertensive heart and chronic kidney disease, unspecified, with heart failure and chronic kidney disease stage V or end stage renal disease
428.0	Congestive heart failure, unspecified
428.1	Left heart failure
428.20	Systolic heart failure, unspecified
428.21	Acute systolic heart failure
428.22	Chronic systolic heart failure
428.23	Acute on chronic systolic heart failure
428.30	Diastolic heart failure, unspecified
428.31	Acute diastolic heart failure
428.32	Chronic diastolic heart failure
428.33	Acute on chronic diastolic heart failure
428.40	Combined systolic and diastolic heart failure, unspecified
428.41	Acute combined systolic and diastolic heart failure
428.42	Chronic combined systolic and diastolic heart failure
428.43	Acute on chronic combined systolic and diastolic heart failure
428.9	Heart failure, unspecified

**Red and Blue Box of Figure 5**

Figure 7. Detailed version of the red box of Figure 5

Table 5. List of ICD-9 codes that appear in Figure 7, and their descriptions

ICD-9 Code	Description
172.3	Malignant melanoma of skin of other and unspecified parts of face
172.4	Malignant melanoma of skin of scalp and neck
172.5	Malignant melanoma of skin of trunk, except scrotum
173.0	Other and unspecified malignant neoplasm of skin of lip
173.1	Other and unspecified malignant neoplasm of skin of eyelid, including canthus
173.2	Other and unspecified malignant neoplasm of skin of ear and external auditory canal
173.3	Other and unspecified malignant neoplasm of skin of other and unspecified parts of face
173.31	Basal cell carcinoma of skin of other and unspecified parts of face
173.4	Other and unspecified malignant neoplasm of scalp and skin of neck
173.41	Basal cell carcinoma of scalp and skin of neck
173.5	Other and unspecified malignant neoplasm of skin of trunk, except scrotum
173.50	Unspecified malignant neoplasm of skin of trunk, except scrotum
173.51	Basal cell carcinoma of skin of trunk, except scrotum
173.6	Other and unspecified malignant neoplasm of skin of upper limb, including shoulder
173.7	Other and unspecified malignant neoplasm of skin of lower limb, including hip
173.71	Basal cell carcinoma of skin of lower limb, including hip
173.9	Other and unspecified malignant neoplasm of skin, site unspecified
173.91	Basal cell carcinoma of skin, site unspecified

238.9

Neoplasm of uncertain behavior, site unspecified

Figure 8. Detailed version of the blue box of Figure 5

Table 6. List of ICD-9 codes that appear in Figure 8, and their descriptions

ICD-9 Code	Description
078.10	Viral warts, unspecified
216.3	Benign neoplasm of skin of other and unspecified parts of face
216.5	Benign neoplasm of skin of trunk, except scrotum
216.6	Benign neoplasm of skin of upper limb, including shoulder
216.7	Benign neoplasm of skin of lower limb, including hip
216.8	Benign neoplasm of other specified sites of skin
216.9	Benign neoplasm of skin, site unspecified
228.00	Hemangioma of unspecified site
238.2	Neoplasm of uncertain behavior of skin
448.1	Nevus, non-neoplastic
448.9	Other and unspecified capillary diseases

## 6-fold Cross Validation Scheme

The diagram illustrates a 6-fold cross-validation scheme for a cohort of 32,787 patients. A legend at the top left identifies the colors: green for Training Set, blue for Validation Set, and red for Test Set. The main part of the diagram shows a horizontal bar representing the patient cohort, divided into 7 equal segments. A bracket above the bar is labeled '32,787 Patients'. Below the bar, the segments are grouped into six folds, labeled 'Fold 1' through 'Fold 6'. In each fold, one segment is designated as the Validation Set (blue) and one as the Test Set (red), while the remaining five segments are the Training Set (green). The segments are arranged in a repeating pattern: Fold 1 has Validation and Test sets at the beginning; Fold 2 has them at the second position; and so on, with Fold 6 having them at the end. Vertical ellipses between the folds indicate that the pattern continues for all 6 folds.

Figure 9. Diagram of 6-fold cross validation

Figure 6 depicts the 6-fold cross validation we performed for HF prediction. As explained earlier, the entire cohort is divided into 7 chunks, and two chunks take turn to play as the validation set and the test set.### Hyper-parameters used for training models

After experimenting with various values, the following hyper-parameter setting produced the best performance. We used Theano 0.7 and CUDA 7 for training logistic regression, MLP, and GRU models. SVM was implemented with Scikit-Learn Linear SVC. KNN was implemented with Scikit-Learn KNeighborsClassifier.

Table 7. Hyper-parameter settings for training the models

Model	Hyper-parameter
Logistic regression, one-hot vectors	L2 regularization: 0.1, Max epoch: 100
Logistic regression, medical concept vectors	L2 regularization: 0.01, Max epoch: 100
SVM, one-hot vectors	L2 regularization: 0.000001, Dual: False
SVM, medical concept vectors	L2 regularization: 0.001, Dual: False
MLP, one-hot vectors	L2 regularization: 0.01, Hidden layer size: 15, Max epoch: 100
MLP, medical concept vectors	L2 regularization: 0.001, Hidden layer size: 100, Max epoch: 100
KNN, one-hot vectors	Number of neighbors: 15
KNN, medical concept vectors	Number of neighbors: 100

Xet Storage Details

Size:: 52.1 kB
Xet hash:: 9d86b1ab47bdf64eb0c72c207fd3ba4d6395535a79cb16027fddee12085889b6

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.