You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

PrMed — Perturbation-Resilient Medical Foundation Model

Large language models (LLMs) have achieved strong performance on medical benchmarks, yet their reliability in real-world clinical settings remains insufficient. We identify a key source of this gap: a mismatch between real patient expressions — which often contain linguistic perturbations such as colloquial, vague, dialectal, and emotionally charged language — and the relatively clean and standardized corpora on which most existing LLMs are trained.

We curated 569,913 real-world Chinese patient utterances from six clinical specialties and found that 95.1% contained at least one perturbation type, while 83.6% contained two or more, indicating that linguistic perturbations are pervasive in real medical communication. Perturbation-gradient experiments showed that, although several leading LLMs approached or even exceeded open-book physician performance under clean inputs, their performance declined sharply under mild-to-severe perturbations, whereas physicians remained substantially more stable.

Error-pattern analysis revealed that linguistic perturbations not only impaired key-information extraction, but more importantly disrupted reasoning accuracy and induced reasoning drift, suggesting that the main limitation of current medical LLMs lies not in insufficient medical knowledge, but in fragile understanding and reasoning under non-standard patient language.

To address this gap, we developed PrMed, a perturbation-resilient medical foundation model trained in two stages on 1.2 million multi-source medical samples, with stage 1 using perturbation-resilient chain-of-thought data for LoRA fine-tuning and stage 2 using GRPO-based reinforcement learning with a patient simulator to enhance multi-turn interactive reasoning. PrMed consistently showed stronger robustness than other LLMs, with an accuracy drop of only 2.71 percentage points from formal to heavy perturbation, while better preserving reasoning stability, safety, completeness, and actionable advice in long-form dialogues.

Model Training

We developed a two-stage training framework to enable LLMs to perform perturbation-resilient complex medical reasoning through structured multi-step inference.

Stage 1: Perturbation-Resilient Reasoning CoT

Training data construction. We curate high-quality training samples by searching for correct reasoning trajectories under a strict rubric-based verification system. The rubric comprises three layers: a CoT layer with five axes, a response layer with five axes, and a cross layer with three axes to quantify the coherence and alignment between the CoT and the final response. The reasoning procedure follows five ordered steps:

Emotion perception — recognizing implicit emotional signals in perturbations to guide response tone and style
Perturbation identification — determining whether perturbations are present, labeling them at corresponding spans, and interpreting intended meaning
Utterance correction — reconstructing the patient message into a more clinically interpretable form
Chief complaint extraction — filtering distractions to focus on the core clinical request
Medical reasoning — conducting thorough and rigorous medical reasoning grounded in the extracted chief complaint

After generation, an independent judge agent scores the output using the predefined rubric on a 5-point Likert scale. A sample is included in the final training corpus only if all axes receive scores > 4. This generate–evaluate–refine loop is repeated for up to three iterations.

Fine-tuning procedure. We select Qwen3-32B as the base model and perform parameter-efficient fine-tuning using LoRA.

Parameter	Value
Base model	Qwen/Qwen3-32B
PEFT method	LoRA
LoRA modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Rank (r)	16
Alpha (α)	32
Dropout	0.05
Max context length	8192 tokens
Precision	bfloat16 (mixed precision)
Batch size	1 per GPU, gradient accumulation 4 (effective batch size = 4)
Optimizer	AdamW, lr = 5×10⁻⁵, cosine schedule, 3% warmup
Training	Up to 5 epochs with early stopping on validation loss

Stage 2: Reinforcement Learning with GRPO

We further refine the Stage 1 model using Group Relative Policy Optimization (GRPO). For each prompt, GRPO generates multiple candidate responses from the current policy, scores them using a reward function, and updates the policy based on the relative advantage within each group. Training proceeds in two complementary phases:

Single-turn phase: The model generates candidate responses to individual patient queries and is optimized based on rubric scores.
Multi-turn phase: A DeepSeek-V3-based patient simulator generates follow-up utterances, and the model's next-turn response is evaluated under the same rubric, yielding an adaptive closed loop of simulate–evaluate–optimize.

Quick Start

Install Dependencies

pip install torch transformers peft accelerate

Download Base Model

Via ModelScope (recommended for users in China):

from modelscope import snapshot_download
model_dir = snapshot_download("Qwen/Qwen3-32B", cache_dir="./")

Or via HuggingFace:

from huggingface_hub import snapshot_download
snapshot_download("Qwen/Qwen3-32B", local_dir="./Qwen3-32B")

Load Model with PrMed

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model_path = "./Qwen3-32B"
PrMed_path = "./PrMed"

tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(model, PrMed_path)

Inference

## Chinese (primary)
messages = [
    {"role": "system", "content": "你是一个抗语言扰动的医疗专家，通过多步骤思考过程，给出高质量的医学回复。"},
    {"role": "user", "content": "医生你好，我最近总是头疼，有时候还会恶心，这是怎么回事？"}
]

## English
messages = [
    {"role": "system", "content": "You are a perturbation-resilient medical expert. Reason step by step and provide a high-quality medical response."},
    {"role": "user", "content": "Hi doctor, I've been having headaches a lot lately, sometimes with nausea. What could be going on?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=8192,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(output[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Limitations

This model is a research prototype and should NOT be used for actual clinical decision-making.
Performance is optimized for Chinese medical text with linguistic perturbations.
Requires Qwen3-32B as the base model (~60 GB in bfloat16).

Authors

Xinti Sun, Yuexuan Long, Qiyang Hong, Yinbo Xiao, Erping Long

Chinese Academy of Medical Sciences and Peking Union Medical College, Peking Union Medical College Hospital

Contact: sunxinti@tmu.edu.cn

Citation

@misc{prmed2026,
  title={PrMed: Perturbation-Resilient Medical Foundation Model},
  author={Xinti Sun and Yuexuan Long and Qiyang Hong and Yinbo Xiao and Erping Long},
  year={2026},
  url={https://huggingface.co/Xinti/PrMed}
}

Downloads last month: -

Model tree for Xinti/PrMed

Base model

Qwen/Qwen3-32B

Adapter

(242)

this model