new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 21

SiliconHealth: A Complete Low-Cost Blockchain Healthcare Infrastructure for Resource-Constrained Regions Using Repurposed Bitcoin Mining ASICs

This paper presents SiliconHealth, a comprehensive blockchain-based healthcare infrastructure designed for resource-constrained regions, particularly sub-Saharan Africa. We demonstrate that obsolete Bitcoin mining Application-Specific Integrated Circuits (ASICs) can be repurposed to create a secure, low-cost, and energy-efficient medical records system. The proposed architecture employs a four-tier hierarchical network: regional hospitals using Antminer S19 Pro (90+ TH/s), urban health centers with Antminer S9 (14 TH/s), rural clinics equipped with Lucky Miner LV06 (500 GH/s, 13W), and mobile health points with portable ASIC devices. We introduce the Deterministic Hardware Fingerprinting (DHF) paradigm, which repurposes SHA-256 mining ASICs as cryptographic proof generators, achieving 100% verification rate across 23 test proofs during 300-second validation sessions. The system incorporates Reed-Solomon LSB watermarking for medical image authentication with 30-40% damage tolerance, semantic Retrieval-Augmented Generation (RAG) for intelligent medical record queries, and offline synchronization protocols for intermittent connectivity. Economic analysis demonstrates 96% cost reduction compared to GPU-based alternatives, with total deployment cost of $847 per rural clinic including 5-year solar power infrastructure. Validation experiments on Lucky Miner LV06 (BM1366 chip, 5nm) achieve 2.93 MH/W efficiency and confirm hardware universality. This work establishes a practical framework for deploying verifiable, tamper-proof electronic health records in regions where traditional healthcare IT infrastructure is economically unfeasible, potentially benefiting over 600 million people lacking access to basic health information systems.

  • 3 authors
·
Jan 14

Data Storage in the Decentralized World: Blockchain and Derivatives

We have entered an era where the importance of decentralized solutions has become more obvious. Blockchain technology and its derivatives are distributed ledger technologies that keep the registry of data between peers of a network. This ledger is secured within a successive over looping cryptographic chain. The accomplishment of the Bitcoin cryptocurrency proved that blockchain technology and its derivatives could be used to eliminate intermediaries and provide security for cyberspace. However, there are some challenges in the implementation of blockchain technology. This chapter first explains the concept of blockchain technology and the data that we can store therein. The main advantage of blockchain is the security services that it provides. This section continues by describing these services.. The challenges of blockchain; blockchain anomalies, energy consumption, speed, scalability, interoperability, privacy and cryptology in the age of quantum computing are described. Selected solutions for these challenges are given. Remarkable derivatives of blockchain, which use different solutions (directed acyclic graph, distributed hash table, gossip consensus protocol) to solve some of these challenges are described. Then the data storage in blockchain and evolving data solutions are explained. The comparison of decentralized solutions with the lcentralized database systems is given. A multi-platform interoperable scalable architecture (MPISA) is proposed. In the conclusion we include the evolution assumptions of data storage in a decentralized world.

  • 2 authors
·
Dec 18, 2020

zkBridge: Trustless Cross-chain Bridges Made Practical

Blockchains have seen growing traction with cryptocurrencies reaching a market cap of over 1 trillion dollars, major institution investors taking interests, and global impacts on governments, businesses, and individuals. Also growing significantly is the heterogeneity of the ecosystem where a variety of blockchains co-exist. Cross-chain bridge is a necessary building block in this multi-chain ecosystem. Existing solutions, however, either suffer from performance issues or rely on trust assumptions of committees that significantly lower the security. Recurring attacks against bridges have cost users more than 1.5 billion USD. In this paper, we introduce zkBridge, an efficient cross-chain bridge that guarantees strong security without external trust assumptions. With succinct proofs, zkBridge not only guarantees correctness, but also significantly reduces on-chain verification cost. We propose novel succinct proof protocols that are orders-of-magnitude faster than existing solutions for workload in zkBridge. With a modular design, zkBridge enables a broad spectrum of use cases and capabilities, including message passing, token transferring, and other computational logic operating on state changes from different chains. To demonstrate the practicality of zkBridge, we implemented a prototype bridge from Cosmos to Ethereum, a particularly challenging direction that involves large proof circuits that existing systems cannot efficiently handle. Our evaluation shows that zkBridge achieves practical performance: proof generation takes less than 20 seconds, while verifying proofs on-chain costs less than 230K gas. For completeness, we also implemented and evaluated the direction from Ethereum to other EVM-compatible chains (such as BSC) which involves smaller circuits and incurs much less overhead.

  • 8 authors
·
Oct 1, 2022

Vision Language Models in Medicine

With the advent of Vision-Language Models (VLMs), medical artificial intelligence (AI) has experienced significant technological progress and paradigm shifts. This survey provides an extensive review of recent advancements in Medical Vision-Language Models (Med-VLMs), which integrate visual and textual data to enhance healthcare outcomes. We discuss the foundational technology behind Med-VLMs, illustrating how general models are adapted for complex medical tasks, and examine their applications in healthcare. The transformative impact of Med-VLMs on clinical practice, education, and patient care is highlighted, alongside challenges such as data scarcity, narrow task generalization, interpretability issues, and ethical concerns like fairness, accountability, and privacy. These limitations are exacerbated by uneven dataset distribution, computational demands, and regulatory hurdles. Rigorous evaluation methods and robust regulatory frameworks are essential for safe integration into healthcare workflows. Future directions include leveraging large-scale, diverse datasets, improving cross-modal generalization, and enhancing interpretability. Innovations like federated learning, lightweight architectures, and Electronic Health Record (EHR) integration are explored as pathways to democratize access and improve clinical relevance. This review aims to provide a comprehensive understanding of Med-VLMs' strengths and limitations, fostering their ethical and balanced adoption in healthcare.

  • 3 authors
·
Feb 24, 2025

Digital Twins for Patient Care via Knowledge Graphs and Closed-Form Continuous-Time Liquid Neural Networks

Digital twin technology has is anticipated to transform healthcare, enabling personalized medicines and support, earlier diagnoses, simulated treatment outcomes, and optimized surgical plans. Digital twins are readily gaining traction in industries like manufacturing, supply chain logistics, and civil infrastructure. Not in patient care, however. The challenge of modeling complex diseases with multimodal patient data and the computational complexities of analyzing it have stifled digital twin adoption in the biomedical vertical. Yet, these major obstacles can potentially be handled by approaching these models in a different way. This paper proposes a novel framework for addressing the barriers to clinical twin modeling created by computational costs and modeling complexities. We propose structuring patient health data as a knowledge graph and using closed-form continuous-time liquid neural networks, for real-time analytics. By synthesizing multimodal patient data and leveraging the flexibility and efficiency of closed form continuous time networks and knowledge graph ontologies, our approach enables real time insights, personalized medicine, early diagnosis and intervention, and optimal surgical planning. This novel approach provides a comprehensive and adaptable view of patient health along with real-time analytics, paving the way for digital twin simulations and other anticipated benefits in healthcare.

  • 1 authors
·
Jul 8, 2023

A Survey on Medical Large Language Models: Technology, Application, Trustworthiness, and Future Directions

With the advent of Large Language Models (LLMs), medical artificial intelligence (AI) has experienced substantial technological progress and paradigm shifts, highlighting the potential of LLMs to streamline healthcare delivery and improve patient outcomes. Considering this rapid technical progress, in this survey, we trace the recent advances of Medical Large Language Models (Med-LLMs), including the background, key findings, and mainstream techniques, especially for the evolution from general-purpose models to medical-specialized applications. Firstly, we delve into the foundational technology of Med-LLMs, indicating how general models can be progressively adapted and refined for the complicated medical tasks. Secondly, the wide-ranging applications of Med-LLMs are investigated across various healthcare domains, as well as an up-to-date review of existing Med-LLMs. The transformative impact of these models on daily medical practice is evident through their ability to assist clinicians, educators, and patients. Recognizing the importance of responsible innovation, we discuss the challenges associated with ensuring fairness, accountability, privacy, and robustness. Ethical considerations, rigorous evaluation methodologies, and the establishment of regulatory frameworks are crucial for building trustworthiness in the real-world system. We emphasize the need for ongoing scrutiny and development to maintain high standards of safety and reliability. Finally, we anticipate possible future trajectories for Med-LLMs, identifying key avenues for prudent expansion. By consolidating these insights, our review aims to provide professionals and researchers with a thorough understanding of the strengths and limitations of Med-LLMs, fostering a balanced and ethical approach to their integration into the healthcare ecosystem.

  • 9 authors
·
Jun 5, 2024

The Aloe Family Recipe for Open and Specialized Healthcare LLMs

Purpose: With advancements in Large Language Models (LLMs) for healthcare, the need arises for competitive open-source models to protect the public interest. This work contributes to the field of open medical LLMs by optimizing key stages of data preprocessing and training, while showing how to improve model safety (through DPO) and efficacy (through RAG). The evaluation methodology used, which includes four different types of tests, defines a new standard for the field. The resultant models, shown to be competitive with the best private alternatives, are released with a permisive license. Methods: Building on top of strong base models like Llama 3.1 and Qwen 2.5, Aloe Beta uses a custom dataset to enhance public data with synthetic Chain of Thought examples. The models undergo alignment with Direct Preference Optimization, emphasizing ethical and policy-aligned performance in the presence of jailbreaking attacks. Evaluation includes close-ended, open-ended, safety and human assessments, to maximize the reliability of results. Results: Recommendations are made across the entire pipeline, backed by the solid performance of the Aloe Family. These models deliver competitive performance across healthcare benchmarks and medical fields, and are often preferred by healthcare professionals. On bias and toxicity, the Aloe Beta models significantly improve safety, showing resilience to unseen jailbreaking attacks. For a responsible release, a detailed risk assessment specific to healthcare is attached to the Aloe Family models. Conclusion: The Aloe Beta models, and the recipe that leads to them, are a significant contribution to the open-source medical LLM field, offering top-of-the-line performance while maintaining high ethical requirements. This work sets a new standard for developing and reporting aligned LLMs in healthcare.

  • 13 authors
·
May 7, 2025 2

Fidelity and Privacy of Synthetic Medical Data

The digitization of medical records ushered in a new era of big data to clinical science, and with it the possibility that data could be shared, to multiply insights beyond what investigators could abstract from paper records. The need to share individual-level medical data to accelerate innovation in precision medicine continues to grow, and has never been more urgent, as scientists grapple with the COVID-19 pandemic. However, enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy. That is, the ability to extract private or confidential information about an individual, in practice, renders it difficult to share data, since significant infrastructure and data governance must be established before data can be shared. Although HIPAA provided de-identification as an approved mechanism for data sharing, linkage attacks were identified as a major vulnerability. A variety of mechanisms have been established to avoid leaking private information, such as field suppression or abstraction, strictly limiting the amount of information that can be shared, or employing mathematical techniques such as differential privacy. Another approach, which we focus on here, is creating synthetic data that mimics the underlying data. For synthetic data to be a useful mechanism in support of medical innovation and a proxy for real-world evidence, one must demonstrate two properties of the synthetic dataset: (1) any analysis on the real data must be matched by analysis of the synthetic data (statistical fidelity) and (2) the synthetic data must preserve privacy, with minimal risk of re-identification (privacy guarantee). In this paper we propose a framework for quantifying the statistical fidelity and privacy preservation properties of synthetic datasets and demonstrate these metrics for synthetic data generated by Syntegra technology.

  • 2 authors
·
Jan 18, 2021

CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare

Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.

UVASDS UVA Data Science
·
Dec 12, 2025 2

Building Flexible, Scalable, and Machine Learning-ready Multimodal Oncology Datasets

The advancements in data acquisition, storage, and processing techniques have resulted in the rapid growth of heterogeneous medical data. Integrating radiological scans, histopathology images, and molecular information with clinical data is essential for developing a holistic understanding of the disease and optimizing treatment. The need for integrating data from multiple sources is further pronounced in complex diseases such as cancer for enabling precision medicine and personalized treatments. This work proposes Multimodal Integration of Oncology Data System (MINDS) - a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources such as the Cancer Research Data Commons (CRDC) into an interconnected, patient-centric framework. MINDS offers an interface for exploring relationships across data types and building cohorts for developing large-scale multimodal machine learning models. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability to uncover diagnostic and prognostic insights and enable evidence-based personalized care. MINDS tracks granular end-to-end data provenance, ensuring reproducibility and transparency. The cloud-native architecture of MINDS can handle exponential data growth in a secure, cost-optimized manner while ensuring substantial storage optimization, replication avoidance, and dynamic access capabilities. Auto-scaling, access controls, and other mechanisms guarantee pipelines' scalability and security. MINDS overcomes the limitations of existing biomedical data silos via an interoperable metadata-driven approach that represents a pivotal step toward the future of oncology data integration.

  • 5 authors
·
Sep 30, 2023

Benchmarking emergency department triage prediction models with machine learning and large public electronic health records

The demand for emergency department (ED) services is increasing across the globe, particularly during the current COVID-19 pandemic. Clinical triage and risk assessment have become increasingly challenging due to the shortage of medical resources and the strain on hospital infrastructure caused by the pandemic. As a result of the widespread use of electronic health records (EHRs), we now have access to a vast amount of clinical data, which allows us to develop predictive models and decision support systems to address these challenges. To date, however, there are no widely accepted benchmark ED triage prediction models based on large-scale public EHR data. An open-source benchmarking platform would streamline research workflows by eliminating cumbersome data preprocessing, and facilitate comparisons among different studies and methodologies. In this paper, based on the Medical Information Mart for Intensive Care IV Emergency Department (MIMIC-IV-ED) database, we developed a publicly available benchmark suite for ED triage predictive models and created a benchmark dataset that contains over 400,000 ED visits from 2011 to 2019. We introduced three ED-based outcomes (hospitalization, critical outcomes, and 72-hour ED reattendance) and implemented a variety of popular methodologies, ranging from machine learning methods to clinical scoring systems. We evaluated and compared the performance of these methods against benchmark tasks. Our codes are open-source, allowing anyone with MIMIC-IV-ED data access to perform the same steps in data processing, benchmark model building, and experiments. This study provides future researchers with insights, suggestions, and protocols for managing raw data and developing risk triaging tools for emergency care.

  • 13 authors
·
Nov 22, 2021

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

The utilization of large language models (LLMs) in the Healthcare domain has generated both excitement and concern due to their ability to effectively respond to freetext queries with certain professional knowledge. This survey outlines the capabilities of the currently developed LLMs for Healthcare and explicates their development process, with the aim of providing an overview of the development roadmap from traditional Pretrained Language Models (PLMs) to LLMs. Specifically, we first explore the potential of LLMs to enhance the efficiency and effectiveness of various Healthcare applications highlighting both the strengths and limitations. Secondly, we conduct a comparison between the previous PLMs and the latest LLMs, as well as comparing various LLMs with each other. Then we summarize related Healthcare training data, training methods, optimization strategies, and usage. Finally, the unique concerns associated with deploying LLMs in Healthcare settings are investigated, particularly regarding fairness, accountability, transparency and ethics. Our survey provide a comprehensive investigation from perspectives of both computer science and Healthcare specialty. Besides the discussion about Healthcare concerns, we supports the computer science community by compiling a collection of open source resources, such as accessible datasets, the latest methodologies, code implementations, and evaluation benchmarks in the Github. Summarily, we contend that a significant paradigm shift is underway, transitioning from PLMs to LLMs. This shift encompasses a move from discriminative AI approaches to generative AI approaches, as well as a shift from model-centered methodologies to datacentered methodologies.

  • 7 authors
·
Oct 9, 2023 1

D-VRE: From a Jupyter-enabled Private Research Environment to Decentralized Collaborative Research Ecosystem

Today, scientific research is increasingly data-centric and compute-intensive, relying on data and models across distributed sources. However, it still faces challenges in the traditional cooperation mode, due to the high storage and computing cost, geo-location barriers, and local confidentiality regulations. The Jupyter environment has recently emerged and evolved as a vital virtual research environment for scientific computing, which researchers can use to scale computational analyses up to larger datasets and high-performance computing resources. Nevertheless, existing approaches lack robust support of a decentralized cooperation mode to unlock the full potential of decentralized collaborative scientific research, e.g., seamlessly secure data sharing. In this work, we change the basic structure and legacy norms of current research environments via the seamless integration of Jupyter with Ethereum blockchain capabilities. As such, it creates a Decentralized Virtual Research Environment (D-VRE) from private computational notebooks to decentralized collaborative research ecosystem. We propose a novel architecture for the D-VRE and prototype some essential D-VRE elements for enabling secure data sharing with decentralized identity, user-centric agreement-making, membership, and research asset management. To validate our method, we conducted an experimental study to test all functionalities of D-VRE smart contracts and their gas consumption. In addition, we deployed the D-VRE prototype on a test net of the Ethereum blockchain for demonstration. The feedback from the studies showcases the current prototype's usability, ease of use, and potential and suggests further improvements.

  • 4 authors
·
May 24, 2024

Question Answering on Patient Medical Records with Private Fine-Tuned LLMs

Healthcare systems continuously generate vast amounts of electronic health records (EHRs), commonly stored in the Fast Healthcare Interoperability Resources (FHIR) standard. Despite the wealth of information in these records, their complexity and volume make it difficult for users to retrieve and interpret crucial health insights. Recent advances in Large Language Models (LLMs) offer a solution, enabling semantic question answering (QA) over medical data, allowing users to interact with their health records more effectively. However, ensuring privacy and compliance requires edge and private deployments of LLMs. This paper proposes a novel approach to semantic QA over EHRs by first identifying the most relevant FHIR resources for a user query (Task1) and subsequently answering the query based on these resources (Task2). We explore the performance of privately hosted, fine-tuned LLMs, evaluating them against benchmark models such as GPT-4 and GPT-4o. Our results demonstrate that fine-tuned LLMs, while 250x smaller in size, outperform GPT-4 family models by 0.55% in F1 score on Task1 and 42% on Meteor Task in Task2. Additionally, we examine advanced aspects of LLM usage, including sequential fine-tuning, model self-evaluation (narcissistic evaluation), and the impact of training data size on performance. The models and datasets are available here: https://huggingface.co/genloop

  • 2 authors
·
Jan 23, 2025 2

Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.

newyorkuniversity New York University
·
Nov 17, 2025 3

Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy

Synthetic Data Generation (SDG) based on Artificial Intelligence (AI) can transform the way clinical medicine is delivered by overcoming privacy barriers that currently render clinical data sharing difficult. This is the key to accelerating the development of digital tools contributing to enhanced patient safety. Such tools include robust data-driven clinical decision support systems, and example-based digital training tools that will enable healthcare professionals to improve their diagnostic performance for enhanced patient safety. This study focuses on the clinical evaluation of medical SDG, with a proof-of-concept investigation on diagnosing Inflammatory Bowel Disease (IBD) using Wireless Capsule Endoscopy (WCE) images. Its scientific contributions include a) a novel protocol for the systematic Clinical Evaluation of Medical Image Synthesis (CEMIS); b) a novel variational autoencoder-based model for the generation of high-resolution synthetic WCE images; and c) a comprehensive evaluation of the synthetic images using the CEMIS protocol by 10 international WCE specialists, in terms of image quality, diversity, and realism, as well as their utility for clinical decision-making. The results show that TIDE-II generates clinically plausible, very realistic WCE images, of improved quality compared to relevant state-of-the-art generative models. Concludingly, CEMIS can serve as a reference for future research on medical image-generation techniques, while the adaptation/extension of the architecture of TIDE-II to other imaging domains can be promising.

  • 13 authors
·
Oct 31, 2024

Efficient and Personalized Mobile Health Event Prediction via Small Language Models

Healthcare monitoring is crucial for early detection, timely intervention, and the ongoing management of health conditions, ultimately improving individuals' quality of life. Recent research shows that Large Language Models (LLMs) have demonstrated impressive performance in supporting healthcare tasks. However, existing LLM-based healthcare solutions typically rely on cloud-based systems, which raise privacy concerns and increase the risk of personal information leakage. As a result, there is growing interest in running these models locally on devices like mobile phones and wearables to protect users' privacy. Small Language Models (SLMs) are potential candidates to solve privacy and computational issues, as they are more efficient and better suited for local deployment. However, the performance of SLMs in healthcare domains has not yet been investigated. This paper examines the capability of SLMs to accurately analyze health data, such as steps, calories, sleep minutes, and other vital statistics, to assess an individual's health status. Our results show that, TinyLlama, which has 1.1 billion parameters, utilizes 4.31 GB memory, and has 0.48s latency, showing the best performance compared other four state-of-the-art (SOTA) SLMs on various healthcare applications. Our results indicate that SLMs could potentially be deployed on wearable or mobile devices for real-time health monitoring, providing a practical solution for efficient and privacy-preserving healthcare.

  • 4 authors
·
Sep 16, 2024

MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.

  • 18 authors
·
Nov 18, 2025

The Role of AI in Early Detection of Life-Threatening Diseases: A Retinal Imaging Perspective

Retinal imaging has emerged as a powerful, non-invasive modality for detecting and quantifying biomarkers of systemic diseases-ranging from diabetes and hypertension to Alzheimer's disease and cardiovascular disorders but current insights remain dispersed across platforms and specialties. Recent technological advances in optical coherence tomography (OCT/OCTA) and adaptive optics (AO) now deliver ultra-high-resolution scans (down to 5 {\mu}m ) with superior contrast and spatial integration, allowing early identification of microvascular abnormalities and neurodegenerative changes. At the same time, AI-driven and machine learning (ML) algorithms have revolutionized the analysis of large-scale retinal datasets, increasing sensitivity and specificity; for example, deep learning models achieve > 90 \% sensitivity for diabetic retinopathy and AUC = 0.89 for the prediction of cardiovascular risk from fundus photographs. The proliferation of mobile health technologies and telemedicine platforms further extends access, reduces costs, and facilitates community-based screening and longitudinal monitoring. Despite these breakthroughs, translation into routine practice is hindered by heterogeneous imaging protocols, limited external validation of AI models, and integration challenges within clinical workflows. In this review, we systematically synthesize the latest OCT/OCT and AO developments, AI/ML approaches, and mHealth/Tele-ophthalmology initiatives and quantify their diagnostic performance across disease domains. Finally, we propose a roadmap for multicenter protocol standardization, prospective validation trials, and seamless incorporation of retinal screening into primary and specialty care pathways-paving the way for precision prevention, early intervention, and ongoing treatment of life-threatening systemic diseases.

  • 3 authors
·
May 27, 2025

Proof-of-Contribution-Based Design for Collaborative Machine Learning on Blockchain

We consider a project (model) owner that would like to train a model by utilizing the local private data and compute power of interested data owners, i.e., trainers. Our goal is to design a data marketplace for such decentralized collaborative/federated learning applications that simultaneously provides i) proof-of-contribution based reward allocation so that the trainers are compensated based on their contributions to the trained model; ii) privacy-preserving decentralized model training by avoiding any data movement from data owners; iii) robustness against malicious parties (e.g., trainers aiming to poison the model); iv) verifiability in the sense that the integrity, i.e., correctness, of all computations in the data market protocol including contribution assessment and outlier detection are verifiable through zero-knowledge proofs; and v) efficient and universal design. We propose a blockchain-based marketplace design to achieve all five objectives mentioned above. In our design, we utilize a distributed storage infrastructure and an aggregator aside from the project owner and the trainers. The aggregator is a processing node that performs certain computations, including assessing trainer contributions, removing outliers, and updating hyper-parameters. We execute the proposed data market through a blockchain smart contract. The deployed smart contract ensures that the project owner cannot evade payment, and honest trainers are rewarded based on their contributions at the end of training. Finally, we implement the building blocks of the proposed data market and demonstrate their applicability in practical scenarios through extensive experiments.

  • 8 authors
·
Feb 27, 2023

Blockchain-Based Federated Learning: Incentivizing Data Sharing and Penalizing Dishonest Behavior

With the increasing importance of data sharing for collaboration and innovation, it is becoming more important to ensure that data is managed and shared in a secure and trustworthy manner. Data governance is a common approach to managing data, but it faces many challenges such as data silos, data consistency, privacy, security, and access control. To address these challenges, this paper proposes a comprehensive framework that integrates data trust in federated learning with InterPlanetary File System, blockchain, and smart contracts to facilitate secure and mutually beneficial data sharing while providing incentives, access control mechanisms, and penalizing any dishonest behavior. The experimental results demonstrate that the proposed model is effective in improving the accuracy of federated learning models while ensuring the security and fairness of the data-sharing process. The research paper also presents a decentralized federated learning platform that successfully trained a CNN model on the MNIST dataset using blockchain technology. The platform enables multiple workers to train the model simultaneously while maintaining data privacy and security. The decentralized architecture and use of blockchain technology allow for efficient communication and coordination between workers. This platform has the potential to facilitate decentralized machine learning and support privacy-preserving collaboration in various domains.

  • 6 authors
·
Jul 19, 2023

A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive Care

The COVID-19 pandemic has posed a heavy burden to the healthcare system worldwide and caused huge social disruption and economic loss. Many deep learning models have been proposed to conduct clinical predictive tasks such as mortality prediction for COVID-19 patients in intensive care units using Electronic Health Record (EHR) data. Despite their initial success in certain clinical applications, there is currently a lack of benchmarking results to achieve a fair comparison so that we can select the optimal model for clinical use. Furthermore, there is a discrepancy between the formulation of traditional prediction tasks and real-world clinical practice in intensive care. To fill these gaps, we propose two clinical prediction tasks, Outcome-specific length-of-stay prediction and Early mortality prediction for COVID-19 patients in intensive care units. The two tasks are adapted from the naive length-of-stay and mortality prediction tasks to accommodate the clinical practice for COVID-19 patients. We propose fair, detailed, open-source data-preprocessing pipelines and evaluate 17 state-of-the-art predictive models on two tasks, including 5 machine learning models, 6 basic deep learning models and 6 deep learning predictive models specifically designed for EHR data. We provide benchmarking results using data from two real-world COVID-19 EHR datasets. One dataset is publicly available without needing any inquiry and another dataset can be accessed on request. We provide fair, reproducible benchmarking results for two tasks. We deploy all experiment results and models on an online platform. We also allow clinicians and researchers to upload their data to the platform and get quick prediction results using our trained models. We hope our efforts can further facilitate deep learning and machine learning research for COVID-19 predictive modeling.

  • 7 authors
·
Sep 16, 2022

Multimodal Deep Learning for Low-Resource Settings: A Vector Embedding Alignment Approach for Healthcare Applications

Large-scale multi-modal deep learning models have revolutionized domains such as healthcare, highlighting the importance of computational power. However, in resource-constrained regions like Low and Middle-Income Countries (LMICs), limited access to GPUs and data poses significant challenges, often leaving CPUs as the sole resource. To address this, we advocate for leveraging vector embeddings to enable flexible and efficient computational methodologies, democratizing multimodal deep learning across diverse contexts. Our paper investigates the efficiency and effectiveness of using vector embeddings from single-modal foundation models and multi-modal Vision-Language Models (VLMs) for multimodal deep learning in low-resource environments, particularly in healthcare. Additionally, we propose a simple yet effective inference-time method to enhance performance by aligning image-text embeddings. Comparing these approaches with traditional methods, we assess their impact on computational efficiency and model performance using metrics like accuracy, F1-score, inference time, training time, and memory usage across three medical modalities: BRSET (ophthalmology), HAM10000 (dermatology), and SatelliteBench (public health). Our findings show that embeddings reduce computational demands without compromising model performance. Furthermore, our alignment method improves performance in medical tasks. This research promotes sustainable AI practices by optimizing resources in constrained environments, highlighting the potential of embedding-based approaches for efficient multimodal learning. Vector embeddings democratize multimodal deep learning in LMICs, particularly in healthcare, enhancing AI adaptability in varied use cases.

  • 6 authors
·
Jun 1, 2024

EHRMamba: Towards Generalizable and Scalable Foundation Models for Electronic Health Records

Transformers have significantly advanced the modeling of Electronic Health Records (EHR), yet their deployment in real-world healthcare is limited by several key challenges. Firstly, the quadratic computational cost and insufficient context length of these models pose significant obstacles for hospitals in processing the extensive medical histories typical in EHR data. Additionally, existing models employ separate finetuning for each clinical task, complicating maintenance in healthcare environments. Moreover, these models focus exclusively on either clinical prediction or EHR forecasting, lacking the flexibility to perform well across both. To overcome these limitations, we introduce EHRMamba, a robust foundation model built on the Mamba architecture. EHRMamba can process sequences up to four times longer than previous models due to its linear computational cost. We also introduce a novel approach to Multitask Prompted Finetuning (MTF) for EHR data, which enables EHRMamba to simultaneously learn multiple clinical tasks in a single finetuning phase, significantly enhancing deployment and cross-task generalization. Furthermore, our model leverages the HL7 FHIR data standard to simplify integration into existing hospital systems. Alongside EHRMamba, we open-source Odyssey, a toolkit designed to support the development and deployment of EHR foundation models, with an emphasis on data standardization and interpretability. Our evaluations on the MIMIC-IV dataset demonstrate that EHRMamba advances state-of-the-art performance across 6 major clinical tasks and excels in EHR forecasting, marking a significant leap forward in the field.

  • 4 authors
·
May 23, 2024

EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models

While the general machine learning (ML) community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits. We help address these challenges through three contributions. First, we publish a new dataset, EHRSHOT, which contains deidentified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and not restricted to ICU/ED patients. Second, we publish the weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. We are one of the first to fully release such a model for coded EHR data; in contrast, most prior models released for clinical data (e.g. GatorTron, ClinicalBERT) only work with unstructured text and cannot process the rich, structured data within an EHR. We provide an end-to-end pipeline for the community to validate and build upon its performance. Third, we define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation. Our model and dataset are available via a research data use agreement from the Stanford AIMI Center. Code to reproduce our results are available at our Github repo: https://github.com/som-shahlab/ehrshot-benchmark

  • 5 authors
·
Jul 5, 2023

CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment

Medical care follows complex clinical pathways that extend beyond isolated physician-patient encounters, emphasizing decision-making and transitions between different stages. Current benchmarks focusing on static exams or isolated dialogues inadequately evaluate large language models (LLMs) in dynamic clinical scenarios. We introduce CP-Env, a controllable agentic hospital environment designed to evaluate LLMs across end-to-end clinical pathways. CP-Env simulates a hospital ecosystem with patient and physician agents, constructing scenarios ranging from triage and specialist consultation to diagnostic testing and multidisciplinary team meetings for agent interaction. Following real hospital adaptive flow of healthcare, it enables branching, long-horizon task execution. We propose a three-tiered evaluation framework encompassing Clinical Efficacy, Process Competency, and Professional Ethics. Results reveal that most models struggle with pathway complexity, exhibiting hallucinations and losing critical diagnostic details. Interestingly, excessive reasoning steps can sometimes prove counterproductive, while top models tend to exhibit reduced tool dependency through internalized knowledge. CP-Env advances medical AI agents development through comprehensive end-to-end clinical evaluation. We provide the benchmark and evaluation tools for further research and development at https://github.com/SPIRAL-MED/CP_ENV.

  • 8 authors
·
Dec 10, 2025

Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare

The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for advancing the field, fostering reproducibility, and encouraging innovation in healthcare AI. We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. In stark contrast to previous efforts, it offers unrestricted access to its training datasets, codebase, checkpoints, and evaluation protocols. This open approach is designed to stimulate collaborative research, allowing the community to build upon, refine, and rigorously evaluate medical LLMs within a transparent ecosystem. Also, we introduce Hippo, a family of 7B models tailored for the medical domain, fine-tuned from Mistral and LLaMA2 through continual pre-training, instruction tuning, and reinforcement learning from human and AI feedback. Our models outperform existing open medical LLMs models by a large-margin, even surpassing models with 70B parameters. Through Hippocrates, we aspire to unlock the full potential of LLMs not just to advance medical knowledge and patient care but also to democratize the benefits of AI research in healthcare, making them available across the globe.

  • 7 authors
·
Apr 25, 2024

LLMs-in-the-Loop Part 2: Expert Small AI Models for Anonymization and De-identification of PHI Across Multiple Languages

The rise of chronic diseases and pandemics like COVID-19 has emphasized the need for effective patient data processing while ensuring privacy through anonymization and de-identification of protected health information (PHI). Anonymized data facilitates research without compromising patient confidentiality. This paper introduces expert small AI models developed using the LLM-in-the-loop methodology to meet the demand for domain-specific de-identification NER models. These models overcome the privacy risks associated with large language models (LLMs) used via APIs by eliminating the need to transmit or store sensitive data. More importantly, they consistently outperform LLMs in de-identification tasks, offering superior performance and reliability. Our de-identification NER models, developed in eight languages (English, German, Italian, French, Romanian, Turkish, Spanish, and Arabic) achieved f1-micro score averages of 0.966, 0.975, 0.976, 0.970, 0.964, 0.974, 0.978, and 0.953 respectively. These results establish them as the most accurate healthcare anonymization solutions, surpassing existing small models and even general-purpose LLMs such as GPT-4o. While Part-1 of this series introduced the LLM-in-the-loop methodology for bio-medical document translation, this second paper showcases its success in developing cost-effective expert small NER models in de-identification tasks. Our findings lay the groundwork for future healthcare AI innovations, including biomedical entity and relation extraction, demonstrating the value of specialized models for domain-specific challenges.

  • 3 authors
·
Dec 14, 2024

GenHPF: General Healthcare Predictive Framework with Multi-task Multi-source Learning

Despite the remarkable progress in the development of predictive models for healthcare, applying these algorithms on a large scale has been challenging. Algorithms trained on a particular task, based on specific data formats available in a set of medical records, tend to not generalize well to other tasks or databases in which the data fields may differ. To address this challenge, we propose General Healthcare Predictive Framework (GenHPF), which is applicable to any EHR with minimal preprocessing for multiple prediction tasks. GenHPF resolves heterogeneity in medical codes and schemas by converting EHRs into a hierarchical textual representation while incorporating as many features as possible. To evaluate the efficacy of GenHPF, we conduct multi-task learning experiments with single-source and multi-source settings, on three publicly available EHR datasets with different schemas for 12 clinically meaningful prediction tasks. Our framework significantly outperforms baseline models that utilize domain knowledge in multi-source learning, improving average AUROC by 1.2%P in pooled learning and 2.6%P in transfer learning while also showing comparable results when trained on a single EHR dataset. Furthermore, we demonstrate that self-supervised pretraining using multi-source datasets is effective when combined with GenHPF, resulting in a 0.6%P AUROC improvement compared to models without pretraining. By eliminating the need for preprocessing and feature engineering, we believe that this work offers a solid framework for multi-task and multi-source learning that can be leveraged to speed up the scaling and usage of predictive algorithms in healthcare.

  • 10 authors
·
Jul 20, 2022

Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models

In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.

  • 5 authors
·
Dec 20, 2022

Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis

Large-scale clinical databases offer opportunities for medical research, but their complexity creates barriers to effective use. The Medical Information Mart for Intensive Care (MIMIC-IV), one of the world's largest open-source electronic health record databases, traditionally requires both SQL proficiency and clinical domain expertise. We introduce M3, a system that enables natural language querying of MIMIC-IV data through the Model Context Protocol. With a single command, M3 retrieves MIMIC-IV from PhysioNet, launches a local SQLite instance or connects to hosted BigQuery, and allows researchers to pose clinical questions in plain English. We evaluated M3 using one hundred questions from the EHRSQL 2024 benchmark with two language models: the proprietary Claude Sonnet 4 achieved 94% accuracy, while the open-source gpt-oss-20B (deployable locally on consumer hardware) achieved 93% accuracy. Both models translate natural language into SQL, execute queries against MIMIC-IV, and return structured results alongside the underlying query for verification. Error analysis revealed that most failures stemmed from complex temporal reasoning or ambiguous question phrasing rather than fundamental architectural limitations. The comparable performance of a smaller open-source model demonstrates that privacy-preserving local deployment is viable for sensitive clinical data analysis. M3 lowers technical barriers to critical care data analysis while maintaining security through OAuth2 authentication, query validation, and comprehensive audit logging.

  • 6 authors
·
Jun 27, 2025

Similarity-Based Self-Construct Graph Model for Predicting Patient Criticalness Using Graph Neural Networks and EHR Data

Accurately predicting the criticalness of ICU patients (such as in-ICU mortality risk) is vital for early intervention in critical care. However, conventional models often treat each patient in isolation and struggle to exploit the relational structure in Electronic Health Records (EHR). We propose a Similarity-Based Self-Construct Graph Model (SBSCGM) that dynamically builds a patient similarity graph from multi-modal EHR data, and a HybridGraphMedGNN architecture that operates on this graph to predict patient mortality and a continuous criticalness score. SBSCGM uses a hybrid similarity measure (combining feature-based and structural similarities) to connect patients with analogous clinical profiles in real-time. The HybridGraphMedGNN integrates Graph Convolutional Network (GCN), GraphSAGE, and Graph Attention Network (GAT) layers to learn robust patient representations, leveraging both local and global graph patterns. In experiments on 6,000 ICU stays from the MIMIC-III dataset, our model achieves state-of-the-art performance (AUC-ROC 0.94) outperforming baseline classifiers and single-type GNN models. We also demonstrate improved precision/recall and show that the attention mechanism provides interpretable insights into model predictions. Our framework offers a scalable and interpretable solution for critical care risk prediction, with potential to support clinicians in real-world ICU deployment.

  • 2 authors
·
Aug 1, 2025

DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4

The digitization of healthcare has facilitated the sharing and re-using of medical data but has also raised concerns about confidentiality and privacy. HIPAA (Health Insurance Portability and Accountability Act) mandates removing re-identifying information before the dissemination of medical records. Thus, effective and efficient solutions for de-identifying medical data, especially those in free-text forms, are highly needed. While various computer-assisted de-identification methods, including both rule-based and learning-based, have been developed and used in prior practice, such solutions still lack generalizability or need to be fine-tuned according to different scenarios, significantly imposing restrictions in wider use. The advancement of large language models (LLM), such as ChatGPT and GPT-4, have shown great potential in processing text data in the medical domain with zero-shot in-context learning, especially in the task of privacy protection, as these models can identify confidential information by their powerful named entity recognition (NER) capability. In this work, we developed a novel GPT4-enabled de-identification framework (``DeID-GPT") to automatically identify and remove the identifying information. Compared to existing commonly used medical text data de-identification methods, our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text while preserving the original structure and meaning of the text. This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification, which provides insights for further research and solution development on the use of LLMs such as ChatGPT/GPT-4 in healthcare. Codes and benchmarking data information are available at https://github.com/yhydhx/ChatGPT-API.

  • 18 authors
·
Mar 20, 2023

On-device Computation of Single-lead ECG Parameters for Real-time Remote Cardiac Health Assessment: A Real-world Validation Study

Accurate, continuous out-of-hospital electrocardiogram (ECG) parameter measurement is vital for real-time cardiac health monitoring and telemedicine. On-device computation of single-lead ECG parameters enables timely assessment without reliance on centralized data processing, advancing personalized, ubiquitous cardiac care-yet comprehensive validation across heterogeneous real-world populations remains limited. This study validated the on-device algorithm FeatureDB (https://github.com/PKUDigitalHealth/FeatureDB) using two datasets: HeartVoice-ECG-lite (369 participants with single-lead ECGs annotated by two physicians) and PTB-XL/PTB-XL+ (21,354 patients with 12-lead ECGs and physicians' diagnostic annotations). FeatureDB computed PR, QT, and QTc intervals, with accuracy evaluated against physician annotations via mean absolute error (MAE), correlation analysis, and Bland-Altman analysis. Diagnostic performance for first-degree atrioventricular block (AVBI, PR-based) and long QT syndrome (LQT, QTc-based) was benchmarked against commercial 12-lead systems (12SL, Uni-G) and open-source algorithm Deli, using AUC, accuracy, sensitivity, and specificity. Results showed high concordance with expert annotations (Pearson correlations: 0.836-0.960), MAEs matching inter-observer variability, and minimal bias. AVBI AUC reached 0.787 (12SL: 0.859; Uni-G: 0.812; Deli: 0.501); LQT AUC was 0.684 (12SL: 0.716; Uni-G: 0.605; Deli: 0.569)-comparable to commercial tools and superior to open-source alternatives. FeatureDB delivers physician-level parameter accuracy and commercial-grade abnormality detection via single-lead devices, supporting scalable telemedicine, decentralized cardiac screening, and continuous monitoring in community and outpatient settings.

  • 12 authors
·
Feb 21, 2025

Synthetic Observational Health Data with GANs: from slow adoption to a boom in medical research and ultimately digital twins?

After being collected for patient care, Observational Health Data (OHD) can further benefit patient well-being by sustaining the development of health informatics and medical research. Vast potential is unexploited because of the fiercely private nature of patient-related data and regulations to protect it. Generative Adversarial Networks (GANs) have recently emerged as a groundbreaking way to learn generative models that produce realistic synthetic data. They have revolutionized practices in multiple domains such as self-driving cars, fraud detection, digital twin simulations in industrial sectors, and medical imaging. The digital twin concept could readily apply to modelling and quantifying disease progression. In addition, GANs posses many capabilities relevant to common problems in healthcare: lack of data, class imbalance, rare diseases, and preserving privacy. Unlocking open access to privacy-preserving OHD could be transformative for scientific research. In the midst of COVID-19, the healthcare system is facing unprecedented challenges, many of which of are data related for the reasons stated above. Considering these facts, publications concerning GAN applied to OHD seemed to be severely lacking. To uncover the reasons for this slow adoption, we broadly reviewed the published literature on the subject. Our findings show that the properties of OHD were initially challenging for the existing GAN algorithms (unlike medical imaging, for which state-of-the-art model were directly transferable) and the evaluation synthetic data lacked clear metrics. We find more publications on the subject than expected, starting slowly in 2017, and since then at an increasing rate. The difficulties of OHD remain, and we discuss issues relating to evaluation, consistency, benchmarking, data modelling, and reproducibility.

  • 2 authors
·
May 27, 2020

Towards Secure and Private AI: A Framework for Decentralized Inference

The rapid advancement of ML models in critical sectors such as healthcare, finance, and security has intensified the need for robust data security, model integrity, and reliable outputs. Large multimodal foundational models, while crucial for complex tasks, present challenges in scalability, reliability, and potential misuse. Decentralized systems offer a solution by distributing workload and mitigating central points of failure, but they introduce risks of unauthorized access to sensitive data across nodes. We address these challenges with a comprehensive framework designed for responsible AI development. Our approach incorporates: 1) Zero-knowledge proofs for secure model verification, enhancing trust without compromising privacy. 2) Consensus-based verification checks to ensure consistent outputs across nodes, mitigating hallucinations and maintaining model integrity. 3) Split Learning techniques that segment models across different nodes, preserving data privacy by preventing full data access at any point. 4) Hardware-based security through trusted execution environments (TEEs) to protect data and computations. This framework aims to enhance security and privacy and improve the reliability and fairness of multimodal AI systems. Promoting efficient resource utilization contributes to more sustainable AI development. Our state-of-the-art proofs and principles demonstrate the framework's effectiveness in responsibly democratizing artificial intelligence, offering a promising approach for building secure and private foundational models.

  • 8 authors
·
Jul 28, 2024

WiNGPT-3.0 Technical Report

Current Large Language Models (LLMs) exhibit significant limitations, notably in structured, interpretable, and verifiable medical reasoning, alongside practical deployment challenges related to computational resources and data privacy. This report focused on the development of WiNGPT-3.0, the 32-billion parameter LLMs, engineered with the objective of enhancing its capacity for medical reasoning and exploring its potential for effective integration within healthcare IT infrastructures. The broader aim is to advance towards clinically applicable models. The approach involved a multi-stage training pipeline tailored for general, medical, and clinical reasoning. This pipeline incorporated supervised fine-tuning (SFT) and reinforcement learning (RL), leveraging curated Long Chain-of-Thought (CoT) datasets, auxiliary reward models, and an evidence-based diagnostic chain simulation. WiNGPT-3.0 demonstrated strong performance: specific model variants achieved scores of 66.6 on MedCalc and 87.1 on MedQA-USMLE. Furthermore, targeted training improved performance on a clinical reasoning task from a baseline score of 58.1 to 62.5. These findings suggest that reinforcement learning, even when applied with a limited dataset of only a few thousand examples, can enhance medical reasoning accuracy. Crucially, this demonstration of RL's efficacy with limited data and computation paves the way for more trustworthy and practically deployable LLMs within clinical workflows and health information infrastructures.

  • 13 authors
·
May 22, 2025

Foundation Model of Electronic Medical Records for Adaptive Risk Estimation

Hospitals struggle to predict critical outcomes. Traditional early warning systems, like NEWS and MEWS, rely on static variables and fixed thresholds, limiting their adaptability, accuracy, and personalization. We previously developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), an AI model that tokenizes patient health timelines (PHTs) from EHRs and uses transformer-based architectures to predict future PHTs. ETHOS is a versatile framework for developing a wide range of applications. In this work, we develop the Adaptive Risk Estimation System (ARES) that leverages ETHOS to compute dynamic, personalized risk probabilities for clinician-defined critical events. ARES also features a personalized explainability module that highlights key clinical factors influencing risk estimates. We evaluated ARES using the MIMIC-IV v2.2 dataset together with its Emergency Department (ED) extension and benchmarked performance against both classical early warning systems and contemporary machine learning models. The entire dataset was tokenized resulting in 285,622 PHTs, comprising over 360 million tokens. ETHOS outperformed benchmark models in predicting hospital admissions, ICU admissions, and prolonged stays, achieving superior AUC scores. Its risk estimates were robust across demographic subgroups, with calibration curves confirming model reliability. The explainability module provided valuable insights into patient-specific risk factors. ARES, powered by ETHOS, advances predictive healthcare AI by delivering dynamic, real-time, personalized risk estimation with patient-specific explainability. Although our results are promising, the clinical impact remains uncertain. Demonstrating ARES's true utility in real-world settings will be the focus of our future work. We release the source code to facilitate future research.

  • 12 authors
·
Feb 9, 2025

README: Bridging Medical Jargon and Lay Understanding for Patient Education through Data-Centric NLP

The advancement in healthcare has shifted focus toward patient-centric approaches, particularly in self-care and patient education, facilitated by access to Electronic Health Records (EHR). However, medical jargon in EHRs poses significant challenges in patient comprehension. To address this, we introduce a new task of automatically generating lay definitions, aiming to simplify complex medical terms into patient-friendly lay language. We first created the README dataset, an extensive collection of over 50,000 unique (medical term, lay definition) pairs and 300,000 mentions, each offering context-aware lay definitions manually annotated by domain experts. We have also engineered a data-centric Human-AI pipeline that synergizes data filtering, augmentation, and selection to improve data quality. We then used README as the training data for models and leveraged a Retrieval-Augmented Generation method to reduce hallucinations and improve the quality of model outputs. Our extensive automatic and human evaluations demonstrate that open-source mobile-friendly models, when fine-tuned with high-quality data, are capable of matching or even surpassing the performance of state-of-the-art closed-source large language models like ChatGPT. This research represents a significant stride in closing the knowledge gap in patient education and advancing patient-centric healthcare solutions.

  • 9 authors
·
Dec 24, 2023

MLB: A Scenario-Driven Benchmark for Evaluating Large Language Models in Clinical Applications

The proliferation of Large Language Models (LLMs) presents transformative potential for healthcare, yet practical deployment is hindered by the absence of frameworks that assess real-world clinical utility. Existing benchmarks test static knowledge, failing to capture the dynamic, application-oriented capabilities required in clinical practice. To bridge this gap, we introduce a Medical LLM Benchmark MLB, a comprehensive benchmark evaluating LLMs on both foundational knowledge and scenario-based reasoning. MLB is structured around five core dimensions: Medical Knowledge (MedKQA), Safety and Ethics (MedSE), Medical Record Understanding (MedRU), Smart Services (SmartServ), and Smart Healthcare (SmartCare). The benchmark integrates 22 datasets (17 newly curated) from diverse Chinese clinical sources, covering 64 clinical specialties. Its design features a rigorous curation pipeline involving 300 licensed physicians. Besides, we provide a scalable evaluation methodology, centered on a specialized judge model trained via Supervised Fine-Tuning (SFT) on expert annotations. Our comprehensive evaluation of 10 leading models reveals a critical translational gap: while the top-ranked model, Kimi-K2-Instruct (77.3% accuracy overall), excels in structured tasks like information extraction (87.8% accuracy in MedRU), performance plummets in patient-facing scenarios (61.3% in SmartServ). Moreover, the exceptional safety score (90.6% in MedSE) of the much smaller Baichuan-M2-32B highlights that targeted training is equally critical. Our specialized judge model, trained via SFT on a 19k expert-annotated medical dataset, achieves 92.1% accuracy, an F1-score of 94.37%, and a Cohen's Kappa of 81.3% for human-AI consistency, validating a reproducible and expert-aligned evaluation protocol. MLB thus provides a rigorous framework to guide the development of clinically viable LLMs.

  • 23 authors
·
Jan 7

The Patient is not a Moving Document: A World Model Training Paradigm for Longitudinal EHR

Large language models (LLMs) trained with next-word-prediction have achieved success as clinical foundation models. Representations from these language backbones yield strong linear probe performance across biomedical tasks, suggesting that patient semantics emerge from next-token prediction at scale. However, this paradigm treats patients as a document to be summarized rather than a dynamical system to be simulated; a patient's trajectory emerges from their state evolving under interventions and time, requiring models that simulate dynamics rather than predict tokens. To address this, we introduce SMB-Structure, a world model for structured EHR that grounds a joint-embedding prediction architecture (JEPA) with next-token prediction (SFT). SFT grounds our model to reconstruct future patient states in token space, while JEPA predicts those futures in latent space from the initial patient representation alone, forcing trajectory dynamics to be encoded before the next state is observed. We validate across two large-scale cohorts: Memorial Sloan Kettering (23,319 oncology patients; 323,000+ patient-years) and INSPECT (19,402 pulmonary embolism patients). Using a linear probe evaluated at multiple points along the disease trajectory, we demonstrate that our training paradigm learns embeddings that capture disease dynamics not recoverable by autoregressive baselines, enabling SMB-Structure to achieve competitive performance on complex tasks characterized by high patient heterogeneity. Model weights are available at https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure.

  • 8 authors
·
Jan 29

Medical Hallucinations in Foundation Models and Their Impact on Healthcare

Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at https://github.com/mitmedialab/medical hallucination.

  • 25 authors
·
Feb 25, 2025

SoftTiger: A Clinical Foundation Model for Healthcare Workflows

We introduce SoftTiger, a clinical large language model (CLaM) designed as a foundation model for healthcare workflows. The narrative and unstructured nature of clinical notes is a major obstacle for healthcare intelligentization. We address a critical problem of structuring clinical notes into clinical data, according to international interoperability standards. We collect and annotate data for three subtasks, namely, international patient summary, clinical impression and medical encounter. We then supervised fine-tuned a state-of-the-art LLM using public and credentialed clinical data. The training is orchestrated in a way that the target model can first support basic clinical tasks such as abbreviation expansion and temporal information extraction, and then learn to perform more complex downstream clinical tasks. Moreover, we address several modeling challenges in the healthcare context, e.g., extra long context window. Our blind pairwise evaluation shows that SoftTiger outperforms other popular open-source models and GPT-3.5, comparable to Gemini-pro, with a mild gap from GPT-4. We believe that LLMs may become a step-stone towards healthcare digitalization and democratization. Therefore, we publicly release SoftTiger models at scales of 13 billion and 70 billion parameters, as well as datasets and code for our innovative scalable evaluation, hopefully, making a significant contribution to the healthcare industry.

  • 5 authors
·
Feb 29, 2024

Text2Node: a Cross-Domain System for Mapping Arbitrary Phrases to a Taxonomy

Electronic health record (EHR) systems are used extensively throughout the healthcare domain. However, data interchangeability between EHR systems is limited due to the use of different coding standards across systems. Existing methods of mapping coding standards based on manual human experts mapping, dictionary mapping, symbolic NLP and classification are unscalable and cannot accommodate large scale EHR datasets. In this work, we present Text2Node, a cross-domain mapping system capable of mapping medical phrases to concepts in a large taxonomy (such as SNOMED CT). The system is designed to generalize from a limited set of training samples and map phrases to elements of the taxonomy that are not covered by training data. As a result, our system is scalable, robust to wording variants between coding systems and can output highly relevant concepts when no exact concept exists in the target taxonomy. Text2Node operates in three main stages: first, the lexicon is mapped to word embeddings; second, the taxonomy is vectorized using node embeddings; and finally, the mapping function is trained to connect the two embedding spaces. We compared multiple algorithms and architectures for each stage of the training, including GloVe and FastText word embeddings, CNN and Bi-LSTM mapping functions, and node2vec for node embeddings. We confirmed the robustness and generalisation properties of Text2Node by mapping ICD-9-CM Diagnosis phrases to SNOMED CT and by zero-shot training at comparable accuracy. This system is a novel methodological contribution to the task of normalizing and linking phrases to a taxonomy, advancing data interchangeability in healthcare. When applied, the system can use electronic health records to generate an embedding that incorporates taxonomical medical knowledge to improve clinical predictive models.

  • 2 authors
·
Apr 11, 2019

A Decentralized Retrieval Augmented Generation System with Source Reliabilities Secured on Blockchain

Existing retrieval-augmented generation (RAG) systems typically use a centralized architecture, causing a high cost of data collection, integration, and management, as well as privacy concerns. There is a great need for a decentralized RAG system that enables foundation models to utilize information directly from data owners who maintain full control over their sources. However, decentralization brings a challenge: the numerous independent data sources vary significantly in reliability, which can diminish retrieval accuracy and response quality. To address this, our decentralized RAG system has a novel reliability scoring mechanism that dynamically evaluates each source based on the quality of responses it contributes to generate and prioritizes high-quality sources during retrieval. To ensure transparency and trust, the scoring process is securely managed through blockchain-based smart contracts, creating verifiable and tamper-proof reliability records without relying on a central authority. We evaluate our decentralized system with two Llama models (3B and 8B) in two simulated environments where six data sources have different levels of reliability. Our system achieves a +10.7\% performance improvement over its centralized counterpart in the real world-like unreliable data environments. Notably, it approaches the upper-bound performance of centralized systems under ideally reliable data environments. The decentralized infrastructure enables secure and trustworthy scoring management, achieving approximately 56\% marginal cost savings through batched update operations. Our code and system are open-sourced at github.com/yining610/Reliable-dRAG.

Improving Clinical Document Understanding on COVID-19 Research with Spark NLP

Following the global COVID-19 pandemic, the number of scientific papers studying the virus has grown massively, leading to increased interest in automated literate review. We present a clinical text mining system that improves on previous efforts in three ways. First, it can recognize over 100 different entity types including social determinants of health, anatomy, risk factors, and adverse events in addition to other commonly used clinical and biomedical entities. Second, the text processing pipeline includes assertion status detection, to distinguish between clinical facts that are present, absent, conditional, or about someone other than the patient. Third, the deep learning models used are more accurate than previously available, leveraging an integrated pipeline of state-of-the-art pretrained named entity recognition models, and improving on the previous best performing benchmarks for assertion status detection. We illustrate extracting trends and insights, e.g. most frequent disorders and symptoms, and most common vital signs and EKG findings, from the COVID-19 Open Research Dataset (CORD-19). The system is built using the Spark NLP library which natively supports scaling to use distributed clusters, leveraging GPUs, configurable and reusable NLP pipelines, healthcare specific embeddings, and the ability to train models to support new entity types or human languages with no code changes.

  • 2 authors
·
Dec 6, 2020

TPM-Based Continuous Remote Attestation and Integrity Verification for 5G VNFs on Kubernetes

In the rapidly evolving landscape of 5G technology, the adoption of cloud-based infrastructure for the deployment of 5G services has become increasingly common. Using a service-based architecture, critical 5G components, such as the Access and Mobility Management Function (AMF), Session Management Function (SMF), and User Plane Function (UPF), now run as containerized pods on Kubernetes clusters. Although this approach improves scalability, flexibility, and resilience, it also introduces new security challenges, particularly to ensure the integrity and trustworthiness of these components. Current 5G security specifications (for example, 3GPP TS 33.501) focus on communication security and assume that network functions remain trustworthy after authentication, consequently lacking mechanisms to continuously validate the integrity of NVFs at runtime. To close this gap, and to align with Zero Trust principles of 'never trust, always verify', we present a TPM 2.0-based continuous remote attestation solution for core 5G components deployed on Kubernetes. Our approach uses the Linux Integrity Measurement Architecture (IMA) and a Trusted Platform Module (TPM) to provide hardware-based runtime validation. We integrate the open-source Keylime framework with a custom IMA template that isolates pod-level measurements, allowing per-pod integrity verification. A prototype on a k3s cluster (consisting of 1 master, 2 worker nodes) was implemented to attest to core functions, including AMF, SMF and UPF. The experimental results show that the system detects unauthorized modifications in real time, labels each pod's trust state, and generates detailed audit logs. This work provides hardware-based continuous attestation for cloud native and edge deployments, strengthening the resilience of 5G as critical infrastructure in multi-vendor and mission-critical scenarios of 5G.

  • 5 authors
·
Oct 3, 2025

Temporal-spatial Correlation Attention Network for Clinical Data Analysis in Intensive Care Unit

In recent years, medical information technology has made it possible for electronic health record (EHR) to store fairly complete clinical data. This has brought health care into the era of "big data". However, medical data are often sparse and strongly correlated, which means that medical problems cannot be solved effectively. With the rapid development of deep learning in recent years, it has provided opportunities for the use of big data in healthcare. In this paper, we propose a temporal-saptial correlation attention network (TSCAN) to handle some clinical characteristic prediction problems, such as predicting death, predicting length of stay, detecting physiologic decline, and classifying phenotypes. Based on the design of the attention mechanism model, our approach can effectively remove irrelevant items in clinical data and irrelevant nodes in time according to different tasks, so as to obtain more accurate prediction results. Our method can also find key clinical indicators of important outcomes that can be used to improve treatment options. Our experiments use information from the Medical Information Mart for Intensive Care (MIMIC-IV) database, which is open to the public. Finally, we have achieved significant performance benefits of 2.0\% (metric) compared to other SOTA prediction methods. We achieved a staggering 90.7\% on mortality rate, 45.1\% on length of stay. The source code can be find: https://github.com/yuyuheintju/TSCAN.

  • 6 authors
·
Jun 2, 2023

Deep Learning, Machine Learning, Advancing Big Data Analytics and Management

Advancements in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management into pivotal domains for research and application. This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies, emphasizing their role in uncovering actionable insights from massive, high-dimensional datasets. The study presents a systematic overview of data preprocessing techniques, including data cleaning, normalization, integration, and dimensionality reduction, to prepare raw data for analysis. Core analytics methodologies such as classification, clustering, regression, and anomaly detection are examined, with a focus on algorithmic innovation and scalability. Furthermore, the text delves into state-of-the-art frameworks for data mining and predictive modeling, highlighting the role of neural networks, support vector machines, and ensemble methods in tackling complex analytical challenges. Special emphasis is placed on the convergence of big data with distributed computing paradigms, including cloud and edge computing, to address challenges in storage, computation, and real-time analytics. The integration of ethical considerations, including data privacy and compliance with global standards, ensures a holistic perspective on data management. Practical applications across healthcare, finance, marketing, and policy-making illustrate the real-world impact of these technologies. Through comprehensive case studies and Python-based implementations, this work equips researchers, practitioners, and data enthusiasts with the tools to navigate the complexities of modern data analytics. It bridges the gap between theory and practice, fostering the development of innovative solutions for managing and leveraging data in the era of artificial intelligence.

  • 26 authors
·
Dec 3, 2024

MedS^3: Towards Medical Small Language Models with Self-Evolved Slow Thinking

Medical language models (MLMs) have become pivotal in advancing medical natural language processing. However, prior models that rely on pre-training or supervised fine-tuning often exhibit low data efficiency and limited practicality in real-world clinical applications. While OpenAIs O1 highlights test-time scaling in mathematics, attempts to replicate this approach in medicine typically distill responses from GPT-series models to open-source models, focusing primarily on multiple-choice tasks. This strategy, though straightforward, neglects critical concerns like data privacy and realistic deployment in clinical settings. In this work, we present a deployable, small-scale medical language model, \mone, designed for long-chain reasoning in clinical tasks using a self-evolution paradigm. Starting with a seed dataset of around 8,000 instances spanning five domains and 16 datasets, we prompt a base policy model to perform Monte Carlo Tree Search (MCTS) to construct verifiable reasoning chains. Each reasoning step is assigned an evolution rollout value, allowing verified trajectories to train the policy model and the reward model. During inference, the policy model generates multiple responses, and the reward model selects the one with the highest reward score. Experiments on eleven evaluation datasets demonstrate that \mone outperforms prior open-source models by 2 points, with the addition of the reward model further boosting performance (sim13 points), surpassing GPT-4o-mini. Code and data are available at https://github.com/pixas/MedSSS.

  • 6 authors
·
Jan 21, 2025

Algorithms Trained on Normal Chest X-rays Can Predict Health Insurance Types

Artificial intelligence is revealing what medicine never intended to encode. Deep vision models, trained on chest X-rays, can now detect not only disease but also invisible traces of social inequality. In this study, we show that state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) can predict a patient's health insurance type, a strong proxy for socioeconomic status, from normal chest X-rays with significant accuracy (AUC around 0.70 on MIMIC-CXR-JPG, 0.68 on CheXpert). The signal was unlikely contributed by demographic features by our machine learning study combining age, race, and sex labels to predict health insurance types; it also remains detectable when the model is trained exclusively on a single racial group. Patch-based occlusion reveals that the signal is diffuse rather than localized, embedded in the upper and mid-thoracic regions. This suggests that deep networks may be internalizing subtle traces of clinical environments, equipment differences, or care pathways; learning socioeconomic segregation itself. These findings challenge the assumption that medical images are neutral biological data. By uncovering how models perceive and exploit these hidden social signatures, this work reframes fairness in medical AI: the goal is no longer only to balance datasets or adjust thresholds, but to interrogate and disentangle the social fingerprints embedded in clinical data itself.

  • 11 authors
·
Nov 14, 2025

SynLLM: A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering

Access to real-world medical data is often restricted due to privacy regulations, posing a significant barrier to the advancement of healthcare research. Synthetic data offers a promising alternative; however, generating realistic, clinically valid, and privacy-conscious records remains a major challenge. Recent advancements in Large Language Models (LLMs) offer new opportunities for structured data generation; however, existing approaches frequently lack systematic prompting strategies and comprehensive, multi-dimensional evaluation frameworks. In this paper, we present SynLLM, a modular framework for generating high-quality synthetic medical tabular data using 20 state-of-the-art open-source LLMs, including LLaMA, Mistral, and GPT variants, guided by structured prompts. We propose four distinct prompt types, ranging from example-driven to rule-based constraints, that encode schema, metadata, and domain knowledge to control generation without model fine-tuning. Our framework features a comprehensive evaluation pipeline that rigorously assesses generated data across statistical fidelity, clinical consistency, and privacy preservation. We evaluate SynLLM across three public medical datasets, including Diabetes, Cirrhosis, and Stroke, using 20 open-source LLMs. Our results show that prompt engineering significantly impacts data quality and privacy risk, with rule-based prompts achieving the best privacy-quality balance. SynLLM establishes that, when guided by well-designed prompts and evaluated with robust, multi-metric criteria, LLMs can generate synthetic medical data that is both clinically plausible and privacy-aware, paving the way for safer and more effective data sharing in healthcare research.

  • 3 authors
·
Aug 11, 2025

EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records

Electronic Health Records (EHRs) are integral for storing comprehensive patient medical records, combining structured data (e.g., medications) with detailed clinical notes (e.g., physician notes). These elements are essential for straightforward data retrieval and provide deep, contextual insights into patient care. However, they often suffer from discrepancies due to unintuitive EHR system designs and human errors, posing serious risks to patient safety. To address this, we developed EHRCon, a new dataset and task specifically designed to ensure data consistency between structured tables and unstructured notes in EHRs. EHRCon was crafted in collaboration with healthcare professionals using the MIMIC-III EHR dataset, and includes manual annotations of 3,943 entities across 105 clinical notes checked against database entries for consistency. EHRCon has two versions, one using the original MIMIC-III schema, and another using the OMOP CDM schema, in order to increase its applicability and generalizability. Furthermore, leveraging the capabilities of large language models, we introduce CheckEHR, a novel framework for verifying the consistency between clinical notes and database tables. CheckEHR utilizes an eight-stage process and shows promising results in both few-shot and zero-shot settings. The code is available at https://github.com/dustn1259/EHRCon.

  • 9 authors
·
Jun 24, 2024 7

Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models

Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm and promote trustworthy healthcare applications of AI. However, LLMs are advancing so rapidly that static safety benchmarks often become obsolete upon publication, yielding only an incomplete and sometimes misleading picture of model trustworthiness. We demonstrate that a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs can reveal significant weaknesses of current LLMs across four safety-critical domains: robustness, privacy, bias/fairness, and hallucination. A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses, uncovering vulnerabilities in real time without human intervention. Applying DAS to 15 proprietary and open-source LLMs revealed a stark contrast between static benchmark performance and vulnerability under adversarial pressure. Despite a median MedQA accuracy exceeding 80\%, 94\% of previously correct answers failed our dynamic robustness tests. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86\% of scenarios, cognitive-bias priming altered clinical recommendations in 81\% of fairness tests, and we identified hallucination rates exceeding 66\% in widely used models. Such profound residual risks are incompatible with routine clinical practice. By converting red-teaming from a static checklist into a dynamic stress-test audit, DAS red-teaming offers the surveillance that hospitals/regulators/technology vendors require as LLMs become embedded in patient chatbots, decision-support dashboards, and broader healthcare workflows. Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.

  • 21 authors
·
Jul 30, 2025

Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings

This study investigates the feasibility and performance of federated learning (FL) for multi-label ICD code classification using clinical notes from the MIMIC-IV dataset. Unlike previous approaches that rely on centralized training or fine-tuned large language models, we propose a lightweight and scalable pipeline combining frozen text embeddings with simple multilayer perceptron (MLP) classifiers. This design offers a privacy-preserving and deployment-efficient alternative for clinical NLP applications, particularly suited to distributed healthcare settings. Extensive experiments across both centralized and federated configurations were conducted, testing six publicly available embedding models from Massive Text Embedding Benchmark leaderboard and three MLP classifier architectures under two medical coding (ICD-9 and ICD-10). Additionally, ablation studies over ten random stratified splits assess performance stability. Results show that embedding quality substantially outweighs classifier complexity in determining predictive performance, and that federated learning can closely match centralized results in idealized conditions. While the models are orders of magnitude smaller than state-of-the-art architectures and achieved competitive micro and macro F1 scores, limitations remain including the lack of end-to-end training and the simplified FL assumptions. Nevertheless, this work demonstrates a viable way toward scalable, privacy-conscious medical coding systems and offers a step toward for future research into federated, domain-adaptive clinical AI.

  • 2 authors
·
Jul 3, 2025

EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.

Refine Medical Diagnosis Using Generation Augmented Retrieval and Clinical Practice Guidelines

Current medical language models, adapted from large language models (LLMs), typically predict ICD code-based diagnosis from electronic health records (EHRs) because these labels are readily available. However, ICD codes do not capture the nuanced, context-rich reasoning clinicians use for diagnosis. Clinicians synthesize diverse patient data and reference clinical practice guidelines (CPGs) to make evidence-based decisions. This misalignment limits the clinical utility of existing models. We introduce GARMLE-G, a Generation-Augmented Retrieval framework that grounds medical language model outputs in authoritative CPGs. Unlike conventional Retrieval-Augmented Generation based approaches, GARMLE-G enables hallucination-free outputs by directly retrieving authoritative guideline content without relying on model-generated text. It (1) integrates LLM predictions with EHR data to create semantically rich queries, (2) retrieves relevant CPG knowledge snippets via embedding similarity, and (3) fuses guideline content with model output to generate clinically aligned recommendations. A prototype system for hypertension diagnosis was developed and evaluated on multiple metrics, demonstrating superior retrieval precision, semantic relevance, and clinical guideline adherence compared to RAG-based baselines, while maintaining a lightweight architecture suitable for localized healthcare deployment. This work provides a scalable, low-cost, and hallucination-free method for grounding medical language models in evidence-based clinical practice, with strong potential for broader clinical deployment.

  • 8 authors
·
Jun 22, 2025

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

Recently, the increasing demand for superior medical services has highlighted the discrepancies in the medical infrastructure. With big data, especially texts, forming the foundation of medical services, there is an exigent need for effective natural language processing (NLP) solutions tailored to the healthcare domain. Conventional approaches leveraging pre-trained models present promising results in this domain and current large language models (LLMs) offer advanced foundation for medical text processing. However, most medical LLMs are trained only with supervised fine-tuning (SFT), even though it efficiently empowers LLMs to understand and respond to medical instructions but is ineffective in learning domain knowledge and aligning with human preference. Another engineering barrier that prevents current medical LLM from better text processing ability is their restricted context length (e.g., 2,048 tokens), making it hard for the LLMs to process long context, which is frequently required in the medical domain. In this work, we propose ChiMed-GPT, a new benchmark LLM designed explicitly for Chinese medical domain, with enlarged context length to 4,096 tokens and undergoes a comprehensive training regime with pre-training, SFT, and RLHF. Evaluations on real-world tasks including information extraction, question answering, and dialogue generation demonstrate ChiMed-GPT's superior performance over general domain LLMs. Furthermore, we analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients, so as to contribute to further responsible development of LLMs in the medical domain. The code and model are released at https://github.com/synlp/ChiMed-GPT.

  • 5 authors
·
Nov 10, 2023

Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review

With the rapid development of artificial intelligence, large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning. This has sparked significant interest in applying LLMs to enhance various aspects of healthcare, ranging from medical education to clinical decision support. However, medicine involves multifaceted data modalities and nuanced reasoning skills, presenting challenges for integrating LLMs. This paper provides a comprehensive review on the applications and implications of LLMs in medicine. It begins by examining the fundamental applications of general-purpose and specialized LLMs, demonstrating their utilities in knowledge retrieval, research support, clinical workflow automation, and diagnostic assistance. Recognizing the inherent multimodality of medicine, the review then focuses on multimodal LLMs, investigating their ability to process diverse data types like medical imaging and EHRs to augment diagnostic accuracy. To address LLMs' limitations regarding personalization and complex clinical reasoning, the paper explores the emerging development of LLM-powered autonomous agents for healthcare. Furthermore, it summarizes the evaluation methodologies for assessing LLMs' reliability and safety in medical contexts. Overall, this review offers an extensive analysis on the transformative potential of LLMs in modern medicine. It also highlights the pivotal need for continuous optimizations and ethical oversight before these models can be effectively integrated into clinical practice. Visit https://github.com/mingze-yuan/Awesome-LLM-Healthcare for an accompanying GitHub repository containing latest papers.

  • 11 authors
·
Nov 3, 2023

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.

MOD-X: A Modular Open Decentralized eXchange Framework proposal for Heterogeneous Interoperable Artificial Agents

As Artificial Intelligence systems evolve from monolithic models to ecosystems of specialized agents, the need for standardized communication protocols becomes increasingly critical. This paper introduces MOD-X (Modular Open Decentralized eXchange), a novel architectural framework proposal for agent interoperability that addresses key limitations of existing protocols. Unlike current approaches, MOD-X proposes a layered architecture with a Universal Message Bus, thorough state management, translation capabilities, and blockchain-based security mechanisms. We present MOD-X's architecture, compare it with existing protocols, and demonstrate its application through a worked example how it enables integration between heterogeneous specialist agents (agents with different architectures, vendors, capabilities, and knowledge representations--including rule-based systems, neural networks, symbolic reasoning engines, and legacy software with agent wrappers). MOD-X's key innovations include a publish-subscribe communication model, semantic capability discovery, and dynamic workflow orchestration--providing a framework that bridges theoretical formalism with practical implementation. This architecture addresses the growing need for truly decentralized, interoperable agent ecosystems that can scale effectively without the need for central coordination.

  • 5 authors
·
Jul 6, 2025 1

MedGemma Technical Report

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

  • 80 authors
·
Jul 7, 2025 2