new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

May 7

Efficient Variance-reduced Estimation from Generative EHR Models: The SCOPE and REACH Estimators

Generative models trained using self-supervision of tokenized electronic health record (EHR) timelines show promise for clinical outcome prediction. This is typically done using Monte Carlo simulation for future patient trajectories. However, existing approaches suffer from three key limitations: sparse estimate distributions that poorly differentiate patient risk levels, extreme computational costs, and high sampling variance. We propose two new estimators: the Sum of Conditional Outcome Probability Estimator (SCOPE) and Risk Estimation from Anticipated Conditional Hazards (REACH), that leverage next-token probability distributions discarded by standard Monte Carlo. We prove both estimators are unbiased and that REACH guarantees variance reduction over Monte Carlo sampling for any model and outcome. Empirically, on hospital mortality prediction in MIMIC-IV using the ETHOS-ARES framework, SCOPE and REACH match 100-sample Monte Carlo performance using only 10-11 samples (95% CI: [9,11]), representing a ~10x reduction in inference cost without degrading calibration. For ICU admission prediction, efficiency gains are more modest (~1.2x), which we attribute to the outcome's lower "spontaneity," a property we characterize theoretically and empirically. These methods substantially improve the feasibility of deploying generative EHR models in resource-constrained clinical settings.

  • 6 authors
·
Feb 2

Enhancing clinical decision support with physiological waveforms -- a multimodal benchmark in emergency care

Background: AI-driven prediction algorithms have the potential to enhance emergency medicine by enabling rapid and accurate decision-making regarding patient status and potential deterioration. However, the integration of multimodal data, including raw waveform signals, remains underexplored in clinical decision support. Methods: We present a dataset and benchmarking protocol designed to advance multimodal decision support in emergency care. Our models utilize demographics, biometrics, vital signs, laboratory values, and electrocardiogram (ECG) waveforms as inputs to predict both discharge diagnoses and patient deterioration. Results: The diagnostic model achieves area under the receiver operating curve (AUROC) scores above 0.8 for 609 out of 1,428 conditions, covering both cardiac (e.g., myocardial infarction) and non-cardiac (e.g., renal disease, diabetes) diagnoses. The deterioration model attains AUROC scores above 0.8 for 14 out of 15 targets, accurately predicting critical events such as cardiac arrest, mechanical ventilation, ICU admission, and mortality. Conclusions: Our study highlights the positive impact of incorporating raw waveform data into decision support models, improving predictive performance. By introducing a unique, publicly available dataset and baseline models, we provide a foundation for measurable progress in AI-driven decision support for emergency care.

  • 3 authors
·
Apr 29, 2025

Similarity-Based Self-Construct Graph Model for Predicting Patient Criticalness Using Graph Neural Networks and EHR Data

Accurately predicting the criticalness of ICU patients (such as in-ICU mortality risk) is vital for early intervention in critical care. However, conventional models often treat each patient in isolation and struggle to exploit the relational structure in Electronic Health Records (EHR). We propose a Similarity-Based Self-Construct Graph Model (SBSCGM) that dynamically builds a patient similarity graph from multi-modal EHR data, and a HybridGraphMedGNN architecture that operates on this graph to predict patient mortality and a continuous criticalness score. SBSCGM uses a hybrid similarity measure (combining feature-based and structural similarities) to connect patients with analogous clinical profiles in real-time. The HybridGraphMedGNN integrates Graph Convolutional Network (GCN), GraphSAGE, and Graph Attention Network (GAT) layers to learn robust patient representations, leveraging both local and global graph patterns. In experiments on 6,000 ICU stays from the MIMIC-III dataset, our model achieves state-of-the-art performance (AUC-ROC 0.94) outperforming baseline classifiers and single-type GNN models. We also demonstrate improved precision/recall and show that the attention mechanism provides interpretable insights into model predictions. Our framework offers a scalable and interpretable solution for critical care risk prediction, with potential to support clinicians in real-world ICU deployment.

  • 2 authors
·
Aug 1, 2025

Predicting sepsis in multi-site, multi-national intensive care cohorts using deep learning

Despite decades of clinical research, sepsis remains a global public health crisis with high mortality, and morbidity. Currently, when sepsis is detected and the underlying pathogen is identified, organ damage may have already progressed to irreversible stages. Effective sepsis management is therefore highly time-sensitive. By systematically analysing trends in the plethora of clinical data available in the intensive care unit (ICU), an early prediction of sepsis could lead to earlier pathogen identification, resistance testing, and effective antibiotic and supportive treatment, and thereby become a life-saving measure. Here, we developed and validated a machine learning (ML) system for the prediction of sepsis in the ICU. Our analysis represents the largest multi-national, multi-centre in-ICU study for sepsis prediction using ML to date. Our dataset contains 156,309 unique ICU admissions, which represent a refined and harmonised subset of five large ICU databases originating from three countries. Using the international consensus definition Sepsis-3, we derived hourly-resolved sepsis label annotations, amounting to 26,734 (17.1%) septic stays. We compared our approach, a deep self-attention model, to several clinical baselines as well as ML baselines and performed an extensive internal and external validation within and across databases. On average, our model was able to predict sepsis with an AUROC of 0.847 pm 0.050 (internal out-of sample validation) and 0.761 pm 0.052 (external validation). For a harmonised prevalence of 17%, at 80% recall our model detects septic patients with 39% precision 3.7 hours in advance.

  • 8 authors
·
Jul 12, 2021

Foundation Model of Electronic Medical Records for Adaptive Risk Estimation

Hospitals struggle to predict critical outcomes. Traditional early warning systems, like NEWS and MEWS, rely on static variables and fixed thresholds, limiting their adaptability, accuracy, and personalization. We previously developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), an AI model that tokenizes patient health timelines (PHTs) from EHRs and uses transformer-based architectures to predict future PHTs. ETHOS is a versatile framework for developing a wide range of applications. In this work, we develop the Adaptive Risk Estimation System (ARES) that leverages ETHOS to compute dynamic, personalized risk probabilities for clinician-defined critical events. ARES also features a personalized explainability module that highlights key clinical factors influencing risk estimates. We evaluated ARES using the MIMIC-IV v2.2 dataset together with its Emergency Department (ED) extension and benchmarked performance against both classical early warning systems and contemporary machine learning models. The entire dataset was tokenized resulting in 285,622 PHTs, comprising over 360 million tokens. ETHOS outperformed benchmark models in predicting hospital admissions, ICU admissions, and prolonged stays, achieving superior AUC scores. Its risk estimates were robust across demographic subgroups, with calibration curves confirming model reliability. The explainability module provided valuable insights into patient-specific risk factors. ARES, powered by ETHOS, advances predictive healthcare AI by delivering dynamic, real-time, personalized risk estimation with patient-specific explainability. Although our results are promising, the clinical impact remains uncertain. Demonstrating ARES's true utility in real-world settings will be the focus of our future work. We release the source code to facilitate future research.

  • 12 authors
·
Feb 9, 2025

Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML

Medical applications of machine learning (ML) have experienced a surge in popularity in recent years. The intensive care unit (ICU) is a natural habitat for ML given the abundance of available data from electronic health records. Models have been proposed to address numerous ICU prediction tasks like the early detection of complications. While authors frequently report state-of-the-art performance, it is challenging to verify claims of superiority. Datasets and code are not always published, and cohort definitions, preprocessing pipelines, and training setups are difficult to reproduce. This work introduces Yet Another ICU Benchmark (YAIB), a modular framework that allows researchers to define reproducible and comparable clinical ML experiments; we offer an end-to-end solution from cohort definition to model evaluation. The framework natively supports most open-access ICU datasets (MIMIC III/IV, eICU, HiRID, AUMCdb) and is easily adaptable to future ICU datasets. Combined with a transparent preprocessing pipeline and extensible training code for multiple ML and deep learning models, YAIB enables unified model development. Our benchmark comes with five predefined established prediction tasks (mortality, acute kidney injury, sepsis, kidney function, and length of stay) developed in collaboration with clinicians. Adding further tasks is straightforward by design. Using YAIB, we demonstrate that the choice of dataset, cohort definition, and preprocessing have a major impact on the prediction performance - often more so than model class - indicating an urgent need for YAIB as a holistic benchmarking tool. We provide our work to the clinical ML community to accelerate method development and enable real-world clinical implementations. Software Repository: https://github.com/rvandewater/YAIB.

  • 6 authors
·
Jun 8, 2023

Temporal-spatial Correlation Attention Network for Clinical Data Analysis in Intensive Care Unit

In recent years, medical information technology has made it possible for electronic health record (EHR) to store fairly complete clinical data. This has brought health care into the era of "big data". However, medical data are often sparse and strongly correlated, which means that medical problems cannot be solved effectively. With the rapid development of deep learning in recent years, it has provided opportunities for the use of big data in healthcare. In this paper, we propose a temporal-saptial correlation attention network (TSCAN) to handle some clinical characteristic prediction problems, such as predicting death, predicting length of stay, detecting physiologic decline, and classifying phenotypes. Based on the design of the attention mechanism model, our approach can effectively remove irrelevant items in clinical data and irrelevant nodes in time according to different tasks, so as to obtain more accurate prediction results. Our method can also find key clinical indicators of important outcomes that can be used to improve treatment options. Our experiments use information from the Medical Information Mart for Intensive Care (MIMIC-IV) database, which is open to the public. Finally, we have achieved significant performance benefits of 2.0\% (metric) compared to other SOTA prediction methods. We achieved a staggering 90.7\% on mortality rate, 45.1\% on length of stay. The source code can be find: https://github.com/yuyuheintju/TSCAN.

  • 6 authors
·
Jun 2, 2023

Improving Medical Predictions by Irregular Multimodal Electronic Health Records Modeling

Health conditions among patients in intensive care units (ICUs) are monitored via electronic health records (EHRs), composed of numerical time series and lengthy clinical note sequences, both taken at irregular time intervals. Dealing with such irregularity in every modality, and integrating irregularity into multimodal representations to improve medical predictions, is a challenging problem. Our method first addresses irregularity in each single modality by (1) modeling irregular time series by dynamically incorporating hand-crafted imputation embeddings into learned interpolation embeddings via a gating mechanism, and (2) casting a series of clinical note representations as multivariate irregular time series and tackling irregularity via a time attention mechanism. We further integrate irregularity in multimodal fusion with an interleaved attention mechanism across temporal steps. To the best of our knowledge, this is the first work to thoroughly model irregularity in multimodalities for improving medical predictions. Our proposed methods for two medical prediction tasks consistently outperforms state-of-the-art (SOTA) baselines in each single modality and multimodal fusion scenarios. Specifically, we observe relative improvements of 6.5\%, 3.6\%, and 4.3\% in F1 for time series, clinical notes, and multimodal fusion, respectively. These results demonstrate the effectiveness of our methods and the importance of considering irregularity in multimodal EHRs.

  • 5 authors
·
Oct 18, 2022

Temporal Supervised Contrastive Learning for Modeling Patient Risk Progression

We consider the problem of predicting how the likelihood of an outcome of interest for a patient changes over time as we observe more of the patient data. To solve this problem, we propose a supervised contrastive learning framework that learns an embedding representation for each time step of a patient time series. Our framework learns the embedding space to have the following properties: (1) nearby points in the embedding space have similar predicted class probabilities, (2) adjacent time steps of the same time series map to nearby points in the embedding space, and (3) time steps with very different raw feature vectors map to far apart regions of the embedding space. To achieve property (3), we employ a nearest neighbor pairing mechanism in the raw feature space. This mechanism also serves as an alternative to data augmentation, a key ingredient of contrastive learning, which lacks a standard procedure that is adequately realistic for clinical tabular data, to our knowledge. We demonstrate that our approach outperforms state-of-the-art baselines in predicting mortality of septic patients (MIMIC-III dataset) and tracking progression of cognitive impairment (ADNI dataset). Our method also consistently recovers the correct synthetic dataset embedding structure across experiments, a feat not achieved by baselines. Our ablation experiments show the pivotal role of our nearest neighbor pairing.

Benchmarking LLMs for Predictive Applications in the Intensive Care Units

With the advent of LLMs, various tasks across the natural language processing domain have been transformed. However, their application in predictive tasks remains less researched. This study compares large language models, including GatorTron-Base (trained on clinical data), Llama 8B, and Mistral 7B, against models like BioBERT, DocBERT, BioClinicalBERT, Word2Vec, and Doc2Vec, setting benchmarks for predicting Shock in critically ill patients. Timely prediction of shock can enable early interventions, thus improving patient outcomes. Text data from 17,294 ICU stays of patients in the MIMIC III database were scored for length of stay > 24 hours and shock index (SI) > 0.7 to yield 355 and 87 patients with normal and abnormal SI-index, respectively. Both focal and cross-entropy losses were used during finetuning to address class imbalances. Our findings indicate that while GatorTron Base achieved the highest weighted recall of 80.5%, the overall performance metrics were comparable between SLMs and LLMs. This suggests that LLMs are not inherently superior to SLMs in predicting future clinical events despite their strong performance on text-based tasks. To achieve meaningful clinical outcomes, future efforts in training LLMs should prioritize developing models capable of predicting clinical trajectories rather than focusing on simpler tasks such as named entity recognition or phenotyping.

  • 7 authors
·
Dec 23, 2025

Early Recognition of Sepsis with Gaussian Process Temporal Convolutional Networks and Dynamic Time Warping

Sepsis is a life-threatening host response to infection associated with high mortality, morbidity, and health costs. Its management is highly time-sensitive since each hour of delayed treatment increases mortality due to irreversible organ damage. Meanwhile, despite decades of clinical research, robust biomarkers for sepsis are missing. Therefore, detecting sepsis early by utilizing the affluence of high-resolution intensive care records has become a challenging machine learning problem. Recent advances in deep learning and data mining promise to deliver a powerful set of tools to efficiently address this task. This empirical study proposes two novel approaches for the early detection of sepsis: a deep learning model and a lazy learner based on time series distances. Our deep learning model employs a temporal convolutional network that is embedded in a Multi-task Gaussian Process Adapter framework, making it directly applicable to irregularly-spaced time series data. Our lazy learner, by contrast, is an ensemble approach that employs dynamic time warping. We frame the timely detection of sepsis as a supervised time series classification task. For this, we derive the most recent sepsis definition in an hourly resolution to provide the first fully accessible early sepsis detection environment. Seven hours before sepsis onset, our methods improve area under the precision--recall curve from 0.25 to 0.35/0.40 over the state of the art. This demonstrates that they are well-suited for detecting sepsis in the crucial earlier stages when management is most effective.

  • 5 authors
·
Feb 5, 2019 2

A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive Care

The COVID-19 pandemic has posed a heavy burden to the healthcare system worldwide and caused huge social disruption and economic loss. Many deep learning models have been proposed to conduct clinical predictive tasks such as mortality prediction for COVID-19 patients in intensive care units using Electronic Health Record (EHR) data. Despite their initial success in certain clinical applications, there is currently a lack of benchmarking results to achieve a fair comparison so that we can select the optimal model for clinical use. Furthermore, there is a discrepancy between the formulation of traditional prediction tasks and real-world clinical practice in intensive care. To fill these gaps, we propose two clinical prediction tasks, Outcome-specific length-of-stay prediction and Early mortality prediction for COVID-19 patients in intensive care units. The two tasks are adapted from the naive length-of-stay and mortality prediction tasks to accommodate the clinical practice for COVID-19 patients. We propose fair, detailed, open-source data-preprocessing pipelines and evaluate 17 state-of-the-art predictive models on two tasks, including 5 machine learning models, 6 basic deep learning models and 6 deep learning predictive models specifically designed for EHR data. We provide benchmarking results using data from two real-world COVID-19 EHR datasets. One dataset is publicly available without needing any inquiry and another dataset can be accessed on request. We provide fair, reproducible benchmarking results for two tasks. We deploy all experiment results and models on an online platform. We also allow clinicians and researchers to upload their data to the platform and get quick prediction results using our trained models. We hope our efforts can further facilitate deep learning and machine learning research for COVID-19 predictive modeling.

  • 7 authors
·
Sep 16, 2022

An Integrated Optimization and Machine Learning Models to Predict the Admission Status of Emergency Patients

This work proposes a framework for optimizing machine learning algorithms. The practicality of the framework is illustrated using an important case study from the healthcare domain, which is predicting the admission status of emergency department (ED) patients (e.g., admitted vs. discharged) using patient data at the time of triage. The proposed framework can mitigate the crowding problem by proactively planning the patient boarding process. A large retrospective dataset of patient records is obtained from the electronic health record database of all ED visits over three years from three major locations of a healthcare provider in the Midwest of the US. Three machine learning algorithms are proposed: T-XGB, T-ADAB, and T-MLP. T-XGB integrates extreme gradient boosting (XGB) and Tabu Search (TS), T-ADAB integrates Adaboost and TS, and T-MLP integrates multi-layer perceptron (MLP) and TS. The proposed algorithms are compared with the traditional algorithms: XGB, ADAB, and MLP, in which their parameters are tunned using grid search. The three proposed algorithms and the original ones are trained and tested using nine data groups that are obtained from different feature selection methods. In other words, 54 models are developed. Performance was evaluated using five measures: Area under the curve (AUC), sensitivity, specificity, F1, and accuracy. The results show that the newly proposed algorithms resulted in high AUC and outperformed the traditional algorithms. The T-ADAB performs the best among the newly developed algorithms. The AUC, sensitivity, specificity, F1, and accuracy of the best model are 95.4%, 99.3%, 91.4%, 95.2%, 97.2%, respectively.

  • 4 authors
·
Feb 18, 2022

Benchmarking emergency department triage prediction models with machine learning and large public electronic health records

The demand for emergency department (ED) services is increasing across the globe, particularly during the current COVID-19 pandemic. Clinical triage and risk assessment have become increasingly challenging due to the shortage of medical resources and the strain on hospital infrastructure caused by the pandemic. As a result of the widespread use of electronic health records (EHRs), we now have access to a vast amount of clinical data, which allows us to develop predictive models and decision support systems to address these challenges. To date, however, there are no widely accepted benchmark ED triage prediction models based on large-scale public EHR data. An open-source benchmarking platform would streamline research workflows by eliminating cumbersome data preprocessing, and facilitate comparisons among different studies and methodologies. In this paper, based on the Medical Information Mart for Intensive Care IV Emergency Department (MIMIC-IV-ED) database, we developed a publicly available benchmark suite for ED triage predictive models and created a benchmark dataset that contains over 400,000 ED visits from 2011 to 2019. We introduced three ED-based outcomes (hospitalization, critical outcomes, and 72-hour ED reattendance) and implemented a variety of popular methodologies, ranging from machine learning methods to clinical scoring systems. We evaluated and compared the performance of these methods against benchmark tasks. Our codes are open-source, allowing anyone with MIMIC-IV-ED data access to perform the same steps in data processing, benchmark model building, and experiments. This study provides future researchers with insights, suggestions, and protocols for managing raw data and developing risk triaging tools for emergency care.

  • 13 authors
·
Nov 22, 2021

Medical Concept Representation Learning from Electronic Health Records and its Application on Heart Failure Prediction

Objective: To transform heterogeneous clinical data from electronic health records into clinically meaningful constructed features using data driven method that rely, in part, on temporal relations among data. Materials and Methods: The clinically meaningful representations of medical concepts and patients are the key for health analytic applications. Most of existing approaches directly construct features mapped to raw data (e.g., ICD or CPT codes), or utilize some ontology mapping such as SNOMED codes. However, none of the existing approaches leverage EHR data directly for learning such concept representation. We propose a new way to represent heterogeneous medical concepts (e.g., diagnoses, medications and procedures) based on co-occurrence patterns in longitudinal electronic health records. The intuition behind the method is to map medical concepts that are co-occuring closely in time to similar concept vectors so that their distance will be small. We also derive a simple method to construct patient vectors from the related medical concept vectors. Results: For qualitative evaluation, we study similar medical concepts across diagnosis, medication and procedure. In quantitative evaluation, our proposed representation significantly improves the predictive modeling performance for onset of heart failure (HF), where classification methods (e.g. logistic regression, neural network, support vector machine and K-nearest neighbors) achieve up to 23% improvement in area under the ROC curve (AUC) using this proposed representation. Conclusion: We proposed an effective method for patient and medical concept representation learning. The resulting representation can map relevant concepts together and also improves predictive modeling performance.

  • 4 authors
·
Feb 11, 2016

Predicting COVID-19 Pneumonia Severity on Chest X-ray with Deep Learning

Purpose: The need to streamline patient management for COVID-19 has become more pressing than ever. Chest X-rays provide a non-invasive (potentially bedside) tool to monitor the progression of the disease. In this study, we present a severity score prediction model for COVID-19 pneumonia for frontal chest X-ray images. Such a tool can gauge severity of COVID-19 lung infections (and pneumonia in general) that can be used for escalation or de-escalation of care as well as monitoring treatment efficacy, especially in the ICU. Methods: Images from a public COVID-19 database were scored retrospectively by three blinded experts in terms of the extent of lung involvement as well as the degree of opacity. A neural network model that was pre-trained on large (non-COVID-19) chest X-ray datasets is used to construct features for COVID-19 images which are predictive for our task. Results: This study finds that training a regression model on a subset of the outputs from an this pre-trained chest X-ray model predicts our geographic extent score (range 0-8) with 1.14 mean absolute error (MAE) and our lung opacity score (range 0-6) with 0.78 MAE. Conclusions: These results indicate that our model's ability to gauge severity of COVID-19 lung infections could be used for escalation or de-escalation of care as well as monitoring treatment efficacy, especially in the intensive care unit (ICU). A proper clinical trial is needed to evaluate efficacy. To enable this we make our code, labels, and data available online at https://github.com/mlmed/torchxrayvision/tree/master/scripts/covid-severity and https://github.com/ieee8023/covid-chestxray-dataset

  • 11 authors
·
Jun 29, 2020

A Corpus for Detecting High-Context Medical Conditions in Intensive Care Patient Notes Focusing on Frequently Readmitted Patients

A crucial step within secondary analysis of electronic health records (EHRs) is to identify the patient cohort under investigation. While EHRs contain medical billing codes that aim to represent the conditions and treatments patients may have, much of the information is only present in the patient notes. Therefore, it is critical to develop robust algorithms to infer patients' conditions and treatments from their written notes. In this paper, we introduce a dataset for patient phenotyping, a task that is defined as the identification of whether a patient has a given medical condition (also referred to as clinical indication or phenotype) based on their patient note. Nursing Progress Notes and Discharge Summaries from the Intensive Care Unit of a large tertiary care hospital were manually annotated for the presence of several high-context phenotypes relevant to treatment and risk of re-hospitalization. This dataset contains 1102 Discharge Summaries and 1000 Nursing Progress Notes. Each Discharge Summary and Progress Note has been annotated by at least two expert human annotators (one clinical researcher and one resident physician). Annotated phenotypes include treatment non-adherence, chronic pain, advanced/metastatic cancer, as well as 10 other phenotypes. This dataset can be utilized for academic and industrial research in medicine and computer science, particularly within the field of medical natural language processing.

  • 10 authors
·
Mar 6, 2020

Forecasting Clinical Risk from Textual Time Series: Structuring Narratives for Temporal AI in Healthcare

Clinical case reports encode temporal patient trajectories that are often underexploited by traditional machine learning methods relying on structured data. In this work, we introduce the forecasting problem from textual time series, where timestamped clinical findings -- extracted via an LLM-assisted annotation pipeline -- serve as the primary input for prediction. We systematically evaluate a diverse suite of models, including fine-tuned decoder-based large language models and encoder-based transformers, on tasks of event occurrence prediction, temporal ordering, and survival analysis. Our experiments reveal that encoder-based models consistently achieve higher F1 scores and superior temporal concordance for short- and long-horizon event forecasting, while fine-tuned masking approaches enhance ranking performance. In contrast, instruction-tuned decoder models demonstrate a relative advantage in survival analysis, especially in early prognosis settings. Our sensitivity analyses further demonstrate the importance of time ordering, which requires clinical time series construction, as compared to text ordering, the format of the text inputs that LLMs are classically trained on. This highlights the additional benefit that can be ascertained from time-ordered corpora, with implications for temporal tasks in the era of widespread LLM use.

Forecasting Patient Flows with Pandemic Induced Concept Drift using Explainable Machine Learning

Accurately forecasting patient arrivals at Urgent Care Clinics (UCCs) and Emergency Departments (EDs) is important for effective resourcing and patient care. However, correctly estimating patient flows is not straightforward since it depends on many drivers. The predictability of patient arrivals has recently been further complicated by the COVID-19 pandemic conditions and the resulting lockdowns. This study investigates how a suite of novel quasi-real-time variables like Google search terms, pedestrian traffic, the prevailing incidence levels of influenza, as well as the COVID-19 Alert Level indicators can both generally improve the forecasting models of patient flows and effectively adapt the models to the unfolding disruptions of pandemic conditions. This research also uniquely contributes to the body of work in this domain by employing tools from the eXplainable AI field to investigate more deeply the internal mechanics of the models than has previously been done. The Voting ensemble-based method combining machine learning and statistical techniques was the most reliable in our experiments. Our study showed that the prevailing COVID-19 Alert Level feature together with Google search terms and pedestrian traffic were effective at producing generalisable forecasts. The implications of this study are that proxy variables can effectively augment standard autoregressive features to ensure accurate forecasting of patient flows. The experiments showed that the proposed features are potentially effective model inputs for preserving forecast accuracies in the event of future pandemic outbreaks.

  • 2 authors
·
Nov 1, 2022

DeepSurv: Personalized Treatment Recommender System Using A Cox Proportional Hazards Deep Neural Network

Medical practitioners use survival models to explore and understand the relationships between patients' covariates (e.g. clinical and genetic features) and the effectiveness of various treatment options. Standard survival models like the linear Cox proportional hazards model require extensive feature engineering or prior medical knowledge to model treatment interaction at an individual level. While nonlinear survival methods, such as neural networks and survival forests, can inherently model these high-level interaction terms, they have yet to be shown as effective treatment recommender systems. We introduce DeepSurv, a Cox proportional hazards deep neural network and state-of-the-art survival method for modeling interactions between a patient's covariates and treatment effectiveness in order to provide personalized treatment recommendations. We perform a number of experiments training DeepSurv on simulated and real survival data. We demonstrate that DeepSurv performs as well as or better than other state-of-the-art survival models and validate that DeepSurv successfully models increasingly complex relationships between a patient's covariates and their risk of failure. We then show how DeepSurv models the relationship between a patient's features and effectiveness of different treatment options to show how DeepSurv can be used to provide individual treatment recommendations. Finally, we train DeepSurv on real clinical studies to demonstrate how it's personalized treatment recommendations would increase the survival time of a set of patients. The predictive and modeling capabilities of DeepSurv will enable medical researchers to use deep neural networks as a tool in their exploration, understanding, and prediction of the effects of a patient's characteristics on their risk of failure.

  • 6 authors
·
Aug 8, 2017

State-Change Learning for Prediction of Future Events in Endoscopic Videos

Surgical future prediction, driven by real-time AI analysis of surgical video, is critical for operating room safety and efficiency. It provides actionable insights into upcoming events, their timing, and risks-enabling better resource allocation, timely instrument readiness, and early warnings for complications (e.g., bleeding, bile duct injury). Despite this need, current surgical AI research focuses on understanding what is happening rather than predicting future events. Existing methods target specific tasks in isolation, lacking unified approaches that span both short-term (action triplets, events) and long-term horizons (remaining surgery duration, phase transitions). These methods rely on coarse-grained supervision while fine-grained surgical action triplets and steps remain underexplored. Furthermore, methods based only on future feature prediction struggle to generalize across different surgical contexts and procedures. We address these limits by reframing surgical future prediction as state-change learning. Rather than forecasting raw observations, our approach classifies state transitions between current and future timesteps. We introduce SurgFUTR, implementing this through a teacher-student architecture. Video clips are compressed into state representations via Sinkhorn-Knopp clustering; the teacher network learns from both current and future clips, while the student network predicts future states from current videos alone, guided by our Action Dynamics (ActDyn) module. We establish SFPBench with five prediction tasks spanning short-term (triplets, events) and long-term (remaining surgery duration, phase and step transitions) horizons. Experiments across four datasets and three procedures show consistent improvements. Cross-procedure transfer validates generalizability.

  • 4 authors
·
Oct 14, 2025

Summarizing Patients Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models

Automatically summarizing patients' main problems from daily progress notes using natural language processing methods helps to battle against information and cognitive overload in hospital settings and potentially assists providers with computerized diagnostic decision support. Problem list summarization requires a model to understand, abstract, and generate clinical documentation. In this work, we propose a new NLP task that aims to generate a list of problems in a patient's daily care plan using input from the provider's progress notes during hospitalization. We investigate the performance of T5 and BART, two state-of-the-art seq2seq transformer architectures, in solving this problem. We provide a corpus built on top of progress notes from publicly available electronic health record progress notes in the Medical Information Mart for Intensive Care (MIMIC)-III. T5 and BART are trained on general domain text, and we experiment with a data augmentation method and a domain adaptation pre-training method to increase exposure to medical vocabulary and knowledge. Evaluation methods include ROUGE, BERTScore, cosine similarity on sentence embedding, and F-score on medical concepts. Results show that T5 with domain adaptive pre-training achieves significant performance gains compared to a rule-based system and general domain pre-trained language models, indicating a promising direction for tackling the problem summarization task.

  • 6 authors
·
Aug 17, 2022

Generalization in Healthcare AI: Evaluation of a Clinical Large Language Model

Advances in large language models (LLMs) provide new opportunities in healthcare for improved patient care, clinical decision-making, and enhancement of physician and administrator workflows. However, the potential of these models importantly depends on their ability to generalize effectively across clinical environments and populations, a challenge often underestimated in early development. To better understand reasons for these challenges and inform mitigation approaches, we evaluated ClinicLLM, an LLM trained on [HOSPITAL]'s clinical notes, analyzing its performance on 30-day all-cause readmission prediction focusing on variability across hospitals and patient characteristics. We found poorer generalization particularly in hospitals with fewer samples, among patients with government and unspecified insurance, the elderly, and those with high comorbidities. To understand reasons for lack of generalization, we investigated sample sizes for fine-tuning, note content (number of words per note), patient characteristics (comorbidity level, age, insurance type, borough), and health system aspects (hospital, all-cause 30-day readmission, and mortality rates). We used descriptive statistics and supervised classification to identify features. We found that, along with sample size, patient age, number of comorbidities, and the number of words in notes are all important factors related to generalization. Finally, we compared local fine-tuning (hospital specific), instance-based augmented fine-tuning and cluster-based fine-tuning for improving generalization. Among these, local fine-tuning proved most effective, increasing AUC by 0.25% to 11.74% (most helpful in settings with limited data). Overall, this study provides new insights for enhancing the deployment of large language models in the societally important domain of healthcare, and improving their performance for broader populations.

  • 6 authors
·
Feb 14, 2024

Prototype Learning to Create Refined Interpretable Digital Phenotypes from ECGs

Prototype-based neural networks offer interpretable predictions by comparing inputs to learned, representative signal patterns anchored in training data. While such models have shown promise in the classification of physiological data, it remains unclear whether their prototypes capture an underlying structure that aligns with broader clinical phenotypes. We use a prototype-based deep learning model trained for multi-label ECG classification using the PTB-XL dataset. Then without modification we performed inference on the MIMIC-IV clinical database. We assess whether individual prototypes, trained solely for classification, are associated with hospital discharge diagnoses in the form of phecodes in this external population. Individual prototypes demonstrate significantly stronger and more specific associations with clinical outcomes compared to the classifier's class predictions, NLP-extracted concepts, or broader prototype classes across all phecode categories. Prototype classes with mixed significance patterns exhibit significantly greater intra-class distances (p < 0.0001), indicating the model learned to differentiate clinically meaningful variations within diagnostic categories. The prototypes achieve strong predictive performance across diverse conditions, with AUCs ranging from 0.89 for atrial fibrillation to 0.91 for heart failure, while also showing substantial signal for non-cardiac conditions such as sepsis and renal disease. These findings suggest that prototype-based models can support interpretable digital phenotyping from physiologic time-series data, providing transferable intermediate phenotypes that capture clinically meaningful physiologic signatures beyond their original training objectives.

  • 6 authors
·
Aug 2, 2025

Forecasting Patient Demand at Urgent Care Clinics using Machine Learning

Urgent care clinics and emergency departments around the world periodically suffer from extended wait times beyond patient expectations due to inadequate staffing levels. These delays have been linked with adverse clinical outcomes. Previous research into forecasting demand this domain has mostly used a collection of statistical techniques, with machine learning approaches only now beginning to emerge in recent literature. The forecasting problem for this domain is difficult and has also been complicated by the COVID-19 pandemic which has introduced an additional complexity to this estimation due to typical demand patterns being disrupted. This study explores the ability of machine learning methods to generate accurate patient presentations at two large urgent care clinics located in Auckland, New Zealand. A number of machine learning algorithms were explored in order to determine the most effective technique for this problem domain, with the task of making forecasts of daily patient demand three months in advance. The study also performed an in-depth analysis into the model behaviour in respect to the exploration of which features are most effective at predicting demand and which features are capable of adaptation to the volatility caused by the COVID-19 pandemic lockdowns. The results showed that ensemble-based methods delivered the most accurate and consistent solutions on average, generating improvements in the range of 23%-27% over the existing in-house methods for estimating the daily demand.

  • 2 authors
·
May 25, 2022

NOTE: Notable generation Of patient Text summaries through Efficient approach based on direct preference optimization

The discharge summary is a one of critical documents in the patient journey, encompassing all events experienced during hospitalization, including multiple visits, medications, tests, surgery/procedures, and admissions/discharge. Providing a summary of the patient's progress is crucial, as it significantly influences future care and planning. Consequently, clinicians face the laborious and resource-intensive task of manually collecting, organizing, and combining all the necessary data for a discharge summary. Therefore, we propose "NOTE", which stands for "Notable generation Of patient Text summaries through an Efficient approach based on direct preference optimization". NOTE is based on Medical Information Mart for Intensive Care- III dataset and summarizes a single hospitalization of a patient. Patient events are sequentially combined and used to generate a discharge summary for each hospitalization. In the present circumstances, large language models' application programming interfaces (LLMs' APIs) are widely available, but importing and exporting medical data presents significant challenges due to privacy protection policies in healthcare institutions. Moreover, to ensure optimal performance, it is essential to implement a lightweight model for internal server or program within the hospital. Therefore, we utilized DPO and parameter efficient fine tuning (PEFT) techniques to apply a fine-tuning method that guarantees superior performance. To demonstrate the practical application of the developed NOTE, we provide a webpage-based demonstration software. In the future, we will aim to deploy the software available for actual use by clinicians in hospital. NOTE can be utilized to generate various summaries not only discharge summaries but also throughout a patient's journey, thereby alleviating the labor-intensive workload of clinicians and aiming for increased efficiency.

  • 5 authors
·
Feb 19, 2024

Foresight -- Generative Pretrained Transformer (GPT) for Modelling of Patient Timelines using EHRs

Background: Electronic Health Records hold detailed longitudinal information about each patient's health status and general clinical history, a large portion of which is stored within the unstructured text. Existing approaches focus mostly on structured data and a subset of single-domain outcomes. We explore how temporal modelling of patients from free text and structured data, using deep generative transformers can be used to forecast a wide range of future disorders, substances, procedures or findings. Methods: We present Foresight, a novel transformer-based pipeline that uses named entity recognition and linking tools to convert document text into structured, coded concepts, followed by providing probabilistic forecasts for future medical events such as disorders, substances, procedures and findings. We processed the entire free-text portion from three different hospital datasets totalling 811336 patients covering both physical and mental health. Findings: On tests in two UK hospitals (King's College Hospital, South London and Maudsley) and the US MIMIC-III dataset precision@10 0.68, 0.76 and 0.88 was achieved for forecasting the next disorder in a patient timeline, while precision@10 of 0.80, 0.81 and 0.91 was achieved for forecasting the next biomedical concept. Foresight was also validated on 34 synthetic patient timelines by five clinicians and achieved relevancy of 97% for the top forecasted candidate disorder. As a generative model, it can forecast follow-on biomedical concepts for as many steps as required. Interpretation: Foresight is a general-purpose model for biomedical concept modelling that can be used for real-world risk forecasting, virtual trials and clinical research to study the progression of disorders, simulate interventions and counterfactuals, and educational purposes.

  • 12 authors
·
Dec 13, 2022

The Alzheimer's Disease Prediction Of Longitudinal Evolution (TADPOLE) Challenge: Results after 1 Year Follow-up

We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcomes: clinical diagnosis, Alzheimer's Disease Assessment Scale Cognitive Subdomain (ADAS-Cog13), and total volume of the ventricles. The methods used by challenge participants included multivariate linear regression, machine learning methods such as support vector machines and deep neural networks, as well as disease progression models. No single submission was best at predicting all three outcomes. For clinical diagnosis and ventricle volume prediction, the best algorithms strongly outperform simple baselines in predictive ability. However, for ADAS-Cog13 no single submitted prediction method was significantly better than random guesswork. Two ensemble methods based on taking the mean and median over all predictions, obtained top scores on almost all tasks. Better than average performance at diagnosis prediction was generally associated with the additional inclusion of features from cerebrospinal fluid (CSF) samples and diffusion tensor imaging (DTI). On the other hand, better performance at ventricle volume prediction was associated with inclusion of summary statistics, such as the slope or maxima/minima of biomarkers. TADPOLE's unique results suggest that current prediction algorithms provide sufficient accuracy to exploit biomarkers related to clinical diagnosis and ventricle volume, for cohort refinement in clinical trials for Alzheimer's disease. However, results call into question the usage of cognitive test scores for patient selection and as a primary endpoint in clinical trials.

  • 96 authors
·
Feb 9, 2020

Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants

Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants (leq32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 pm 0.10, balanced accuracy of 0.69 pm 0.10, and an F1-score of 0.67 pm 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 pm 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.

  • 16 authors
·
Jul 16, 2025

Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models

Every prediction from a generative medical event model is bounded by how clinical events are tokenized, yet input representation is rarely isolated from other system and architectural choices. We evaluate how representation decisions affect downstream prediction after a shared one-epoch pretraining budget. We train 28 matched transformers on MIMIC-IV and evaluate them on 30 clinical outcomes in three experiments: (1) quantization granularity, reference-range anchoring, and code-value fusion; (2) value encoding (hard bins, soft discretization, code-normalized xVal) crossed with temporal encoding (event order, time tokens, admission-relative RoPE); and (3) native MIMIC laboratory/vital codes versus the Common Longitudinal ICU Format (CLIF)-remapped laboratory/vital codes with compression-preserving perturbation arms. In Experiment 1, fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 (BH-adjusted p < 0.001), hospital length-of-stay AUROC from 0.763 to 0.788 (BH-adjusted p < 0.001), and, for the decile fused-vs-unfused comparison, mean regression Spearman rho across the 13 regression outcomes from 0.414 to 0.494. Across the three temporal encodings, event order only and admission-relative RoPE match or exceed inserting time tokens on average while shortening sequences by 11%. CLIF remapping preserves downstream performance in our single-site setting while yielding a smaller, clinically interpretable token set compatible with multi-site use. Finer-than-decile quantization, reference-range anchoring, and soft discretization help in selective outcomes, while code-normalized xVal remains well below the discrete and soft families, consistent with near-median suppression that persists after the affine variant.

  • 6 authors
·
Apr 17

Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.

newyorkuniversity New York University
·
Nov 17, 2025 3

Hierarchical Pretraining for Biomedical Term Embeddings

Electronic health records (EHR) contain narrative notes that provide extensive details on the medical condition and management of patients. Natural language processing (NLP) of clinical notes can use observed frequencies of clinical terms as predictive features for downstream applications such as clinical decision making and patient trajectory prediction. However, due to the vast number of highly similar and related clinical concepts, a more effective modeling strategy is to represent clinical terms as semantic embeddings via representation learning and use the low dimensional embeddings as feature vectors for predictive modeling. To achieve efficient representation, fine-tuning pretrained language models with biomedical knowledge graphs may generate better embeddings for biomedical terms than those from standard language models alone. These embeddings can effectively discriminate synonymous pairs of from those that are unrelated. However, they often fail to capture different degrees of similarity or relatedness for concepts that are hierarchical in nature. To overcome this limitation, we propose HiPrBERT, a novel biomedical term representation model trained on additionally complied data that contains hierarchical structures for various biomedical terms. We modify an existing contrastive loss function to extract information from these hierarchies. Our numerical experiments demonstrate that HiPrBERT effectively learns the pair-wise distance from hierarchical information, resulting in a substantially more informative embeddings for further biomedical applications

  • 6 authors
·
Jul 1, 2023

From Cities to Series: Complex Networks and Deep Learning for Improved Spatial and Temporal Analytics*

Graphs have often been used to answer questions about the interaction between real-world entities by taking advantage of their capacity to represent complex topologies. Complex networks are known to be graphs that capture such non-trivial topologies; they are able to represent human phenomena such as epidemic processes, the dynamics of populations, and the urbanization of cities. The investigation of complex networks has been extrapolated to many fields of science, with particular emphasis on computing techniques, including artificial intelligence. In such a case, the analysis of the interaction between entities of interest is transposed to the internal learning of algorithms, a paradigm whose investigation is able to expand the state of the art in Computer Science. By exploring this paradigm, this thesis puts together complex networks and machine learning techniques to improve the understanding of the human phenomena observed in pandemics, pendular migration, and street networks. Accordingly, we contribute with: (i) a new neural network architecture capable of modeling dynamic processes observed in spatial and temporal data with applications in epidemics propagation, weather forecasting, and patient monitoring in intensive care units; (ii) a machine-learning methodology for analyzing and predicting links in the scope of human mobility between all the cities of Brazil; and, (iii) techniques for identifying inconsistencies in the urban planning of cities while tracking the most influential vertices, with applications over Brazilian and worldwide cities. We obtained results sustained by sound evidence of advances to the state of the art in artificial intelligence, rigorous formalisms, and ample experimentation. Our findings rely upon real-world applications in a range of domains, demonstrating the applicability of our methodologies.

  • 2 authors
·
Jun 1, 2022

Enhanced Mortality Prediction In Patients With Subarachnoid Haemorrhage Using A Deep Learning Model Based On The Initial CT Scan

PURPOSE: Subarachnoid hemorrhage (SAH) entails high morbidity and mortality rates. Convolutional neural networks (CNN), a form of deep learning, are capable of generating highly accurate predictions from imaging data. Our objective was to predict mortality in SAH patients by processing the initial CT scan on a CNN based algorithm. METHODS: Retrospective multicentric study of a consecutive cohort of patients with SAH between 2011-2022. Demographic, clinical and radiological variables were analyzed. Pre-processed baseline CT scan images were used as the input for training a CNN using AUCMEDI Framework. Our model's architecture leverages the DenseNet-121 structure, employing transfer learning principles. The output variable was mortality in the first three months. Performance of the model was evaluated by statistical parameters conventionally used in studies involving artificial intelligence methods. RESULTS: Images from 219 patients were processed, 175 for training and validation of the CNN and 44 for its evaluation. 52%(115/219) of patients were female, and the median age was 58(SD=13.06) years. 18.5%(39/219) were idiopathic SAH. Mortality rate was 28.5%(63/219). The model showed good accuracy at predicting mortality in SAH patients exclusively using the images of the initial CT scan (Accuracy=74%, F1=75% and AUC=82%). CONCLUSION: Modern image processing techniques based on AI and CNN make possible to predict mortality in SAH patients with high accuracy using CT scan images as the only input. These models might be optimized by including more data and patients resulting in better training, development and performance on tasks which are beyond the skills of conventional clinical knowledge.

  • 9 authors
·
Aug 25, 2023

MIRA: Medical Time Series Foundation Model for Real-World Health Data

A unified foundation model for medical time series -- pretrained on open access and ethics board-approved medical corpora -- offers the potential to reduce annotation burdens, minimize model customization, and enable robust transfer across clinical institutions, modalities, and tasks, particularly in data-scarce or privacy-constrained environments. However, existing generalist time series foundation models struggle to handle medical time series data due to their inherent challenges, including irregular intervals, heterogeneous sampling rates, and frequent missing values. To address these challenges, we introduce MIRA, a unified foundation model specifically designed for medical time series forecasting. MIRA incorporates a Continuous-Time Rotary Positional Encoding that enables fine-grained modeling of variable time intervals, a frequency-specific mixture-of-experts layer that routes computation across latent frequency regimes to further promote temporal specialization, and a Continuous Dynamics Extrapolation Block based on Neural ODE that models the continuous trajectory of latent states, enabling accurate forecasting at arbitrary target timestamps. Pretrained on a large-scale and diverse medical corpus comprising over 454 billion time points collect from publicly available datasets, MIRA achieves reductions in forecasting errors by an average of 10% and 7% in out-of-distribution and in-distribution scenarios, respectively, when compared to other zero-shot and fine-tuned baselines. We also introduce a comprehensive benchmark spanning multiple downstream clinical tasks, establishing a foundation for future research in medical time series modeling.

  • 11 authors
·
Jun 9, 2025

A Multi-View Joint Learning Framework for Embedding Clinical Codes and Text Using Graph Neural Networks

Learning to represent free text is a core task in many clinical machine learning (ML) applications, as clinical text contains observations and plans not otherwise available for inference. State-of-the-art methods use large language models developed with immense computational resources and training data; however, applying these models is challenging because of the highly varying syntax and vocabulary in clinical free text. Structured information such as International Classification of Disease (ICD) codes often succinctly abstracts the most important facts of a clinical encounter and yields good performance, but is often not as available as clinical text in real-world scenarios. We propose a multi-view learning framework that jointly learns from codes and text to combine the availability and forward-looking nature of text and better performance of ICD codes. The learned text embeddings can be used as inputs to predictive algorithms independent of the ICD codes during inference. Our approach uses a Graph Neural Network (GNN) to process ICD codes, and Bi-LSTM to process text. We apply Deep Canonical Correlation Analysis (DCCA) to enforce the two views to learn a similar representation of each patient. In experiments using planned surgical procedure text, our model outperforms BERT models fine-tuned to clinical data, and in experiments using diverse text in MIMIC-III, our model is competitive to a fine-tuned BERT at a tiny fraction of its computational effort.

  • 4 authors
·
Jan 27, 2023

Pre-training A Neural Language Model Improves The Sample Efficiency of an Emergency Room Classification Model

To build a French national electronic injury surveillance system based on emergency room visits, we aim to develop a coding system to classify their causes from clinical notes in free-text. Supervised learning techniques have shown good results in this area but require a large amount of expert annotated dataset which is time consuming and costly to obtain. We hypothesize that the Natural Language Processing Transformer model incorporating a generative self-supervised pre-training step can significantly reduce the required number of annotated samples for supervised fine-tuning. In this preliminary study, we test our hypothesis in the simplified problem of predicting whether a visit is the consequence of a traumatic event or not from free-text clinical notes. Using fully re-trained GPT-2 models (without OpenAI pre-trained weights), we assess the gain of applying a self-supervised pre-training phase with unlabeled notes prior to the supervised learning task. Results show that the number of data required to achieve a ginve level of performance (AUC>0.95) was reduced by a factor of 10 when applying pre-training. Namely, for 16 times more data, the fully-supervised model achieved an improvement <1% in AUC. To conclude, it is possible to adapt a multi-purpose neural language model such as the GPT-2 to create a powerful tool for classification of free-text notes with only a small number of labeled samples.

  • 6 authors
·
Aug 30, 2019

Clinically-Inspired Multi-Agent Transformers for Disease Trajectory Forecasting from Multimodal Data

Deep neural networks are often applied to medical images to automate the problem of medical diagnosis. However, a more clinically relevant question that practitioners usually face is how to predict the future trajectory of a disease. Current methods for prognosis or disease trajectory forecasting often require domain knowledge and are complicated to apply. In this paper, we formulate the prognosis prediction problem as a one-to-many prediction problem. Inspired by a clinical decision-making process with two agents -- a radiologist and a general practitioner -- we predict prognosis with two transformer-based components that share information with each other. The first transformer in this framework aims to analyze the imaging data, and the second one leverages its internal states as inputs, also fusing them with auxiliary clinical data. The temporal nature of the problem is modeled within the transformer states, allowing us to treat the forecasting problem as a multi-task classification, for which we propose a novel loss. We show the effectiveness of our approach in predicting the development of structural knee osteoarthritis changes and forecasting Alzheimer's disease clinical status directly from raw multi-modal data. The proposed method outperforms multiple state-of-the-art baselines with respect to performance and calibration, both of which are needed for real-world applications. An open-source implementation of our method is made publicly available at https://github.com/Oulu-IMEDS/CLIMATv2.

  • 4 authors
·
Oct 25, 2022

Boosting EfficientNets Ensemble Performance via Pseudo-Labels and Synthetic Images by pix2pixHD for Infection and Ischaemia Classification in Diabetic Foot Ulcers

Diabetic foot ulcers are a common manifestation of lesions on the diabetic foot, a syndrome acquired as a long-term complication of diabetes mellitus. Accompanying neuropathy and vascular damage promote acquisition of pressure injuries and tissue death due to ischaemia. Affected areas are prone to infections, hindering the healing progress. The research at hand investigates an approach on classification of infection and ischaemia, conducted as part of the Diabetic Foot Ulcer Challenge (DFUC) 2021. Different models of the EfficientNet family are utilized in ensembles. An extension strategy for the training data is applied, involving pseudo-labeling for unlabeled images, and extensive generation of synthetic images via pix2pixHD to cope with severe class imbalances. The resulting extended training dataset features 8.68 times the size of the baseline and shows a real to synthetic image ratio of 1:3. Performances of models and ensembles trained on the baseline and extended training dataset are compared. Synthetic images featured a broad qualitative variety. Results show that models trained on the extended training dataset as well as their ensemble benefit from the large extension. F1-Scores for rare classes receive outstanding boosts, while those for common classes are either not harmed or boosted moderately. A critical discussion concretizes benefits and identifies limitations, suggesting improvements. The work concludes that classification performance of individual models as well as that of ensembles can be boosted utilizing synthetic images. Especially performance for rare classes benefits notably.

  • 3 authors
·
Nov 30, 2021

A medical coding language model trained on clinical narratives from a population-wide cohort of 1.8 million patients

Medical coding translates clinical documentation into standardized codes for billing, research, and public health, but manual coding is time-consuming and error-prone. Existing automation efforts rely on small datasets that poorly represent real-world patient heterogeneity. We trained a language model on 5.8 million electronic health records from 1.8 million patients across nearly all specialties in Eastern Denmark (2006--2016) to predict ICD-10 codes from clinical notes, medications, and laboratory results. Evaluated on 270,000 held-out patients, the model achieved a micro F1 of 71.8% and a top-10 recall of 95.5%. Performance varied by specialty (F1: 53--91%), with higher scores in specialties with well-defined diagnostic criteria. Codes appearing predominantly as secondary diagnoses had markedly lower F1 scores. For three such codes (suicide-related behaviors, weight disorders, and hypertension), the model identified thousands of uncoded cases, of which 76-86% were confirmed valid upon manual review, suggesting systematic under-coding rather than model error. These findings suggest under-coding of secondary diagnoses in Eastern Denmark during this period, with potential implications for epidemiological research, public health surveillance, and understanding of multimorbidity. Similar time constraints and reimbursement structures in other healthcare systems suggest this may not be isolated to this dataset. The model can automate coding for approximately 50% of cases and provide accurate suggestions for most others, and may offer a practical solution to help capture missed secondary conditions.

  • 6 authors
·
Mar 2

TransICD: Transformer Based Code-wise Attention Model for Explainable ICD Coding

International Classification of Disease (ICD) coding procedure which refers to tagging medical notes with diagnosis codes has been shown to be effective and crucial to the billing system in medical sector. Currently, ICD codes are assigned to a clinical note manually which is likely to cause many errors. Moreover, training skilled coders also requires time and human resources. Therefore, automating the ICD code determination process is an important task. With the advancement of artificial intelligence theory and computational hardware, machine learning approach has emerged as a suitable solution to automate this process. In this project, we apply a transformer-based architecture to capture the interdependence among the tokens of a document and then use a code-wise attention mechanism to learn code-specific representations of the entire document. Finally, they are fed to separate dense layers for corresponding code prediction. Furthermore, to handle the imbalance in the code frequency of clinical datasets, we employ a label distribution aware margin (LDAM) loss function. The experimental results on the MIMIC-III dataset show that our proposed model outperforms other baselines by a significant margin. In particular, our best setting achieves a micro-AUC score of 0.923 compared to 0.868 of bidirectional recurrent neural networks. We also show that by using the code-wise attention mechanism, the model can provide more insights about its prediction, and thus it can support clinicians to make reliable decisions. Our code is available online (https://github.com/biplob1ly/TransICD)

  • 3 authors
·
Mar 28, 2021

Medical Triage as Pairwise Ranking: A Benchmark for Urgency in Patient Portal Messages

Medical triage is the task of allocating medical resources and prioritizing patients based on medical need. This paper introduces the first large-scale public dataset for studying medical triage in the context of asynchronous outpatient portal messages. Our novel task formulation views patient message triage as a pairwise inference problem, where we train LLMs to choose `"which message is more medically urgent" in a head-to-head tournament-style re-sort of a physician's inbox. Our novel benchmark PMR-Bench contains 1569 unique messages and 2,000+ high-quality test pairs for pairwise medical urgency assessment alongside a scalable training data generation pipeline. PMR-Bench includes samples that contain both unstructured patient-written messages alongside real electronic health record (EHR) data, emulating a real-world medical triage scenario. We develop a novel automated data annotation strategy to provide LLMs with in-domain guidance on this task. The resulting data is used to train two model classes, UrgentReward and UrgentSFT, leveraging Bradley-Terry and next token prediction objective, respectively to perform pairwise urgency classification. We find that UrgentSFT achieves top performance on PMR-Bench, with UrgentReward showing distinct advantages in low-resource settings. For example, UrgentSFT-8B and UrgentReward-8B provide a 15- and 16-point boost, respectively, on inbox sorting metrics over off-the-shelf 8B models. Paper resources can be found at https://tinyurl.com/Patient-Message-Triage

  • 7 authors
·
Jan 19

SurgWound-Bench: A Benchmark for Surgical Wound Diagnosis

Surgical site infection (SSI) is one of the most common and costly healthcare-associated infections and and surgical wound care remains a significant clinical challenge in preventing SSIs and improving patient outcomes. While recent studies have explored the use of deep learning for preliminary surgical wound screening, progress has been hindered by concerns over data privacy and the high costs associated with expert annotation. Currently, no publicly available dataset or benchmark encompasses various types of surgical wounds, resulting in the absence of an open-source Surgical-Wound screening tool. To address this gap: (1) we present SurgWound, the first open-source dataset featuring a diverse array of surgical wound types. It contains 697 surgical wound images annotated by 3 professional surgeons with eight fine-grained clinical attributes. (2) Based on SurgWound, we introduce the first benchmark for surgical wound diagnosis, which includes visual question answering (VQA) and report generation tasks to comprehensively evaluate model performance. (3) Furthermore, we propose a three-stage learning framework, WoundQwen, for surgical wound diagnosis. In the first stage, we employ five independent MLLMs to accurately predict specific surgical wound characteristics. In the second stage, these predictions serve as additional knowledge inputs to two MLLMs responsible for diagnosing outcomes, which assess infection risk and guide subsequent interventions. In the third stage, we train a MLLM that integrates the diagnostic results from the previous two stages to produce a comprehensive report. This three-stage framework can analyze detailed surgical wound characteristics and provide subsequent instructions to patients based on surgical images, paving the way for personalized wound care, timely intervention, and improved patient outcomes.

  • 9 authors
·
Aug 20, 2025

Enhancing End Stage Renal Disease Outcome Prediction: A Multi-Sourced Data-Driven Approach

Objective: To improve prediction of Chronic Kidney Disease (CKD) progression to End Stage Renal Disease (ESRD) using machine learning (ML) and deep learning (DL) models applied to an integrated clinical and claims dataset of varying observation windows, supported by explainable AI (XAI) to enhance interpretability and reduce bias. Materials and Methods: We utilized data about 10,326 CKD patients, combining their clinical and claims information from 2009 to 2018. Following data preprocessing, cohort identification, and feature engineering, we evaluated multiple statistical, ML and DL models using data extracted from five distinct observation windows. Feature importance and Shapley value analysis were employed to understand key predictors. Models were tested for robustness, clinical relevance, misclassification errors and bias issues. Results: Integrated data models outperformed those using single data sources, with the Long Short-Term Memory (LSTM) model achieving the highest AUC (0.93) and F1 score (0.65). A 24-month observation window was identified as optimal for balancing early detection and prediction accuracy. The 2021 eGFR equation improved prediction accuracy and reduced racial bias, notably for African American patients. Discussion: Improved ESRD prediction accuracy, results interpretability and bias mitigation strategies presented in this study have the potential to significantly enhance CKD and ESRD management, support targeted early interventions and reduce healthcare disparities. Conclusion: This study presents a robust framework for predicting ESRD outcomes in CKD patients, improving clinical decision-making and patient care through multi-sourced, integrated data and AI/ML methods. Future research will expand data integration and explore the application of this framework to other chronic diseases.

  • 2 authors
·
Oct 1, 2024

Multi-resolution Networks For Flexible Irregular Time Series Modeling (Multi-FIT)

Missing values, irregularly collected samples, and multi-resolution signals commonly occur in multivariate time series data, making predictive tasks difficult. These challenges are especially prevalent in the healthcare domain, where patients' vital signs and electronic records are collected at different frequencies and have occasionally missing information due to the imperfections in equipment or patient circumstances. Researchers have handled each of these issues differently, often handling missing data through mean value imputation and then using sequence models over the multivariate signals while ignoring the different resolution of signals. We propose a unified model named Multi-resolution Flexible Irregular Time series Network (Multi-FIT). The building block for Multi-FIT is the FIT network. The FIT network creates an informative dense representation at each time step using signal information such as last observed value, time difference since the last observed time stamp and overall mean for the signal. Vertical FIT (FIT-V) is a variant of FIT which also models the relationship between different temporal signals while creating the informative dense representations for the signal. The multi-FIT model uses multiple FIT networks for sets of signals with different resolutions, further facilitating the construction of flexible representations. Our model has three main contributions: a.) it does not impute values but rather creates informative representations to provide flexibility to the model for creating task-specific representations b.) it models the relationship between different signals in the form of support signals c.) it models different resolutions in parallel before merging them for the final prediction task. The FIT, FIT-V and Multi-FIT networks improve upon the state-of-the-art models for three predictive tasks, including the forecasting of patient survival.

  • 7 authors
·
Apr 30, 2019

Evaluation of Popular XAI Applied to Clinical Prediction Models: Can They be Trusted?

The absence of transparency and explainability hinders the clinical adoption of Machine learning (ML) algorithms. Although various methods of explainable artificial intelligence (XAI) have been suggested, there is a lack of literature that delves into their practicality and assesses them based on criteria that could foster trust in clinical environments. To address this gap this study evaluates two popular XAI methods used for explaining predictive models in the healthcare context in terms of whether they (i) generate domain-appropriate representation, i.e. coherent with respect to the application task, (ii) impact clinical workflow and (iii) are consistent. To that end, explanations generated at the cohort and patient levels were analysed. The paper reports the first benchmarking of the XAI methods applied to risk prediction models obtained by evaluating the concordance between generated explanations and the trigger of a future clinical deterioration episode recorded by the data collection system. We carried out an analysis using two Electronic Medical Records (EMR) datasets sourced from Australian major hospitals. The findings underscore the limitations of state-of-the-art XAI methods in the clinical context and their potential benefits. We discuss these limitations and contribute to the theoretical development of trustworthy XAI solutions where clinical decision support guides the choice of intervention by suggesting the pattern or drivers for clinical deterioration in the future.

  • 5 authors
·
Jun 20, 2023

Deep Learning Segmentation of Ascites on Abdominal CT Scans for Automatic Volume Quantification

Purpose: To evaluate the performance of an automated deep learning method in detecting ascites and subsequently quantifying its volume in patients with liver cirrhosis and ovarian cancer. Materials and Methods: This retrospective study included contrast-enhanced and non-contrast abdominal-pelvic CT scans of patients with cirrhotic ascites and patients with ovarian cancer from two institutions, National Institutes of Health (NIH) and University of Wisconsin (UofW). The model, trained on The Cancer Genome Atlas Ovarian Cancer dataset (mean age, 60 years +/- 11 [s.d.]; 143 female), was tested on two internal (NIH-LC and NIH-OV) and one external dataset (UofW-LC). Its performance was measured by the Dice coefficient, standard deviations, and 95% confidence intervals, focusing on ascites volume in the peritoneal cavity. Results: On NIH-LC (25 patients; mean age, 59 years +/- 14 [s.d.]; 14 male) and NIH-OV (166 patients; mean age, 65 years +/- 9 [s.d.]; all female), the model achieved Dice scores of 0.855 +/- 0.061 (CI: 0.831-0.878) and 0.826 +/- 0.153 (CI: 0.764-0.887), with median volume estimation errors of 19.6% (IQR: 13.2-29.0) and 5.3% (IQR: 2.4-9.7) respectively. On UofW-LC (124 patients; mean age, 46 years +/- 12 [s.d.]; 73 female), the model had a Dice score of 0.830 +/- 0.107 (CI: 0.798-0.863) and median volume estimation error of 9.7% (IQR: 4.5-15.1). The model showed strong agreement with expert assessments, with r^2 values of 0.79, 0.98, and 0.97 across the test sets. Conclusion: The proposed deep learning method performed well in segmenting and quantifying the volume of ascites in concordance with expert radiologist assessments.

  • 7 authors
·
Jun 22, 2024

Interpretable Clinical Classification with Kolmogorov-Arnold Networks

The increasing use of machine learning in clinical decision support has been limited by the lack of transparency of many high-performing models. In clinical settings, predictions must be interpretable, auditable, and actionable. This study investigates Kolmogorov-Arnold Networks (KANs) as intrinsically interpretable alternatives to conventional black-box models for clinical classification of tabular health data, aiming to balance predictive performance with clinically meaningful transparency. We introduce two KAN-based models: the Logistic KAN, a flexible generalization of logistic regression, and the Kolmogorov-Arnold Additive Model (KAAM), an additive variant that yields transparent symbolic representations through feature-wise decomposability. Both models are evaluated on multiple public clinical datasets and compared with standard linear, tree-based, and neural baselines. Across all datasets, the proposed models achieve predictive performance comparable to or exceeding that of commonly used baselines while remaining fully interpretable. Logistic-KAN obtains the highest overall ranking across evaluation metrics, with a mean reciprocal rank of 0.76, indicating consistently strong performance across tasks. KAAM provides competitive accuracy while offering enhanced transparency through feature-wise decomposability, patient-level visualizations, and nearest-patient retrieval, enabling direct inspection of individual predictions. KAN-based models provide a practical and trustworthy alternative to black-box models for clinical classification, offering a strong balance between predictive performance and interpretability for clinical decision support. By enabling transparent, patient-level reasoning and clinically actionable insights, the proposed models represent a promising step toward trustworthy AI in healthcare (code: https://github.com/Patricia-A-Apellaniz/classification_with_kans).

  • 6 authors
·
Apr 8

Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health

Linking clinical narratives to standardized vocabularies and coding systems is a key component of unlocking the information in medical text for analysis. However, many domains of medical concepts lack well-developed terminologies that can support effective coding of medical text. We present a framework for developing natural language processing (NLP) technologies for automated coding of under-studied types of medical information, and demonstrate its applicability via a case study on physical mobility function. Mobility is a component of many health measures, from post-acute care and surgical outcomes to chronic frailty and disability, and is coded in the International Classification of Functioning, Disability, and Health (ICF). However, mobility and other types of functional activity remain under-studied in medical informatics, and neither the ICF nor commonly-used medical terminologies capture functional status terminology in practice. We investigated two data-driven paradigms, classification and candidate selection, to link narrative observations of mobility to standardized ICF codes, using a dataset of clinical narratives from physical therapy encounters. Recent advances in language modeling and word embedding were used as features for established machine learning models and a novel deep learning approach, achieving a macro F-1 score of 84% on linking mobility activity reports to ICF codes. Both classification and candidate selection approaches present distinct strengths for automated coding in under-studied domains, and we highlight that the combination of (i) a small annotated data set; (ii) expert definitions of codes of interest; and (iii) a representative text corpus is sufficient to produce high-performing automated coding systems. This study has implications for the ongoing growth of NLP tools for a variety of specialized applications in clinical care and research.

  • 2 authors
·
Nov 27, 2020

The Patient is not a Moving Document: A World Model Training Paradigm for Longitudinal EHR

Large language models (LLMs) trained with next-word-prediction have achieved success as clinical foundation models. Representations from these language backbones yield strong linear probe performance across biomedical tasks, suggesting that patient semantics emerge from next-token prediction at scale. However, this paradigm treats patients as a document to be summarized rather than a dynamical system to be simulated; a patient's trajectory emerges from their state evolving under interventions and time, requiring models that simulate dynamics rather than predict tokens. To address this, we introduce SMB-Structure, a world model for structured EHR that grounds a joint-embedding prediction architecture (JEPA) with next-token prediction (SFT). SFT grounds our model to reconstruct future patient states in token space, while JEPA predicts those futures in latent space from the initial patient representation alone, forcing trajectory dynamics to be encoded before the next state is observed. We validate across two large-scale cohorts: Memorial Sloan Kettering (23,319 oncology patients; 323,000+ patient-years) and INSPECT (19,402 pulmonary embolism patients). Using a linear probe evaluated at multiple points along the disease trajectory, we demonstrate that our training paradigm learns embeddings that capture disease dynamics not recoverable by autoregressive baselines, enabling SMB-Structure to achieve competitive performance on complex tasks characterized by high patient heterogeneity. Model weights are available at https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure.

  • 8 authors
·
Jan 29

Advancements in Machine Learning and Deep Learning for Early Detection and Management of Mental Health Disorder

For the early identification, diagnosis, and treatment of mental health illnesses, the integration of deep learning (DL) and machine learning (ML) has started playing a significant role. By evaluating complex data from imaging, genetics, and behavioral assessments, these technologies have the potential to significantly improve clinical outcomes. However, they also present unique challenges related to data integration and ethical issues. This survey reviews the development of ML and DL methods for the early diagnosis and treatment of mental health issues. It examines a range of applications, with a particular emphasis on behavioral assessments, genetic and biomarker analysis, and medical imaging for diagnosing diseases like depression, bipolar disorder, and schizophrenia. Predictive modeling for illness progression is further discussed, focusing on the role of risk prediction models and longitudinal studies. Key findings highlight how ML and DL can improve diagnostic accuracy and treatment outcomes while addressing methodological inconsistencies, data integration challenges, and ethical concerns. The study emphasizes the importance of building real-time monitoring systems for individualized treatment, enhancing data fusion techniques, and fostering interdisciplinary collaboration. Future research should focus on overcoming these obstacles to ensure the valuable and ethical application of ML and DL in mental health services.

  • 6 authors
·
Dec 8, 2024

CXR-LLaVA: Multimodal Large Language Model for Interpreting Chest X-ray Images

Purpose: Recent advancements in large language models (LLMs) have expanded their capabilities in a multimodal fashion, potentially replicating the image interpretation of human radiologists. This study aimed to develop open-source multimodal large language model for interpreting chest X-ray images (CXR-LLaVA). We also examined the effect of prompt engineering and model parameters such as temperature and nucleus sampling. Materials and Methods: For training, we collected 659,287 publicly available CXRs: 417,336 CXRs had labels for certain radiographic abnormalities (dataset 1); 241,951 CXRs provided free-text radiology reports (dataset 2). After pre-training the Resnet50 as an image encoder, the contrastive language-image pre-training was used to align CXRs and corresponding radiographic abnormalities. Then, the Large Language Model Meta AI-2 was fine-tuned using dataset 2, which were refined using GPT-4, with generating various question answering scenarios. The code can be found at https://github.com/ECOFRI/CXR_LLaVA. Results: In the test set, we observed that the model's performance fluctuated based on its parameters. On average, it achieved F1 score of 0.34 for five pathologic findings (atelectasis, cardiomegaly, consolidation, edema, and pleural effusion), which was improved to 0.46 through prompt engineering. In the independent set, the model achieved an average F1 score of 0.30 for the same pathologic findings. Notably, for the pediatric chest radiograph dataset, which was unseen during training, the model differentiated abnormal radiographs with an F1 score ranging from 0.84 to 0.85. Conclusion: CXR-LLaVA demonstrates promising potential in CXR interpretation. Both prompt engineering and model parameter adjustments can play pivotal roles in interpreting CXRs.

  • 4 authors
·
Oct 22, 2023

AD-BERT: Using Pre-trained contextualized embeddings to Predict the Progression from Mild Cognitive Impairment to Alzheimer's Disease

Objective: We develop a deep learning framework based on the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model using unstructured clinical notes from electronic health records (EHRs) to predict the risk of disease progression from Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD). Materials and Methods: We identified 3657 patients diagnosed with MCI together with their progress notes from Northwestern Medicine Enterprise Data Warehouse (NMEDW) between 2000-2020. The progress notes no later than the first MCI diagnosis were used for the prediction. We first preprocessed the notes by deidentification, cleaning and splitting, and then pretrained a BERT model for AD (AD-BERT) based on the publicly available Bio+Clinical BERT on the preprocessed notes. The embeddings of all the sections of a patient's notes processed by AD-BERT were combined by MaxPooling to compute the probability of MCI-to-AD progression. For replication, we conducted a similar set of experiments on 2563 MCI patients identified at Weill Cornell Medicine (WCM) during the same timeframe. Results: Compared with the 7 baseline models, the AD-BERT model achieved the best performance on both datasets, with Area Under receiver operating characteristic Curve (AUC) of 0.8170 and F1 score of 0.4178 on NMEDW dataset and AUC of 0.8830 and F1 score of 0.6836 on WCM dataset. Conclusion: We developed a deep learning framework using BERT models which provide an effective solution for prediction of MCI-to-AD progression using clinical note analysis.

  • 12 authors
·
Nov 6, 2022