PrMed — Perturbation-Resilient Medical Foundation Model
Large language models (LLMs) have achieved strong performance on medical benchmarks, yet their reliability in real-world clinical settings remains insufficient. We identify a key source of this gap: a mismatch between real patient expressions — which often contain linguistic perturbations such as colloquial, vague, dialectal, and emotionally charged language — and the relatively clean and standardized corpora on which most existing LLMs are trained.
We curated 569,913 real-world Chinese patient utterances from six clinical specialties and found that 95.1% contained at least one perturbation type, while 83.6% contained two or more, indicating that linguistic perturbations are pervasive in real medical communication. Perturbation-gradient experiments showed that, although several leading LLMs approached or even exceeded open-book physician performance under clean inputs, their performance declined sharply under mild-to-severe perturbations, whereas physicians remained substantially more stable.
Error-pattern analysis revealed that linguistic perturbations not only impaired key-information extraction, but more importantly disrupted reasoning accuracy and induced reasoning drift, suggesting that the main limitation of current medical LLMs lies not in insufficient medical knowledge, but in fragile understanding and reasoning under non-standard patient language.
To address this gap, we developed PrMed, a perturbation-resilient medical foundation model trained in two stages on 1.2 million multi-source medical samples, with stage 1 using perturbation-resilient chain-of-thought data for LoRA fine-tuning and stage 2 using GRPO-based reinforcement learning with a patient simulator to enhance multi-turn interactive reasoning. PrMed consistently showed stronger robustness than other LLMs, with an accuracy drop of only 2.71 percentage points from formal to heavy perturbation, while better preserving reasoning stability, safety, completeness, and actionable advice in long-form dialogues.
Model Training
We developed a two-stage training framework to enable LLMs to perform perturbation-resilient complex medical reasoning through structured multi-step inference.
Stage 1: Perturbation-Resilient Reasoning CoT
Training data construction. We curate high-quality training samples by searching for correct reasoning trajectories under a strict rubric-based verification system. The rubric comprises three layers: a CoT layer with five axes, a response layer with five axes, and a cross layer with three axes to quantify the coherence and alignment between the CoT and the final response. The reasoning procedure follows five ordered steps:
- Emotion perception — recognizing implicit emotional signals in perturbations to guide response tone and style
- Perturbation identification — determining whether perturbations are present, labeling them at corresponding spans, and interpreting intended meaning
- Utterance correction — reconstructing the patient message into a more clinically interpretable form
- Chief complaint extraction — filtering distractions to focus on the core clinical request
- Medical reasoning — conducting thorough and rigorous medical reasoning grounded in the extracted chief complaint
After generation, an independent judge agent scores the output using the predefined rubric on a 5-point Likert scale. A sample is included in the final training corpus only if all axes receive scores > 4. This generate–evaluate–refine loop is repeated for up to three iterations.
Fine-tuning procedure. We select Qwen3-32B as the base model and perform parameter-efficient fine-tuning using LoRA.
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-32B |
| PEFT method | LoRA |
| LoRA modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Rank (r) | 16 |
| Alpha (α) | 32 |
| Dropout | 0.05 |
| Max context length | 8192 tokens |
| Precision | bfloat16 (mixed precision) |
| Batch size | 1 per GPU, gradient accumulation 4 (effective batch size = 4) |
| Optimizer | AdamW, lr = 5×10⁻⁵, cosine schedule, 3% warmup |
| Training | Up to 5 epochs with early stopping on validation loss |
Stage 2: Reinforcement Learning with GRPO
We further refine the Stage 1 model using Group Relative Policy Optimization (GRPO). For each prompt, GRPO generates multiple candidate responses from the current policy, scores them using a reward function, and updates the policy based on the relative advantage within each group. Training proceeds in two complementary phases:
- Single-turn phase: The model generates candidate responses to individual patient queries and is optimized based on rubric scores.
- Multi-turn phase: A DeepSeek-V3-based patient simulator generates follow-up utterances, and the model's next-turn response is evaluated under the same rubric, yielding an adaptive closed loop of simulate–evaluate–optimize.
Quick Start
Install Dependencies
pip install torch transformers peft accelerate
Download Base Model
Via ModelScope (recommended for users in China):
from modelscope import snapshot_download
model_dir = snapshot_download("Qwen/Qwen3-32B", cache_dir="./")
Or via HuggingFace:
from huggingface_hub import snapshot_download
snapshot_download("Qwen/Qwen3-32B", local_dir="./Qwen3-32B")
Load Model with PrMed
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model_path = "./Qwen3-32B"
PrMed_path = "./PrMed"
tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
base_model_path,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, PrMed_path)
Inference
## Chinese (primary)
messages = [
{"role": "system", "content": "你是一个抗语言扰动的医疗专家,通过多步骤思考过程,给出高质量的医学回复。"},
{"role": "user", "content": "医生你好,我最近总是头疼,有时候还会恶心,这是怎么回事?"}
]
## English
messages = [
{"role": "system", "content": "You are a perturbation-resilient medical expert. Reason step by step and provide a high-quality medical response."},
{"role": "user", "content": "Hi doctor, I've been having headaches a lot lately, sometimes with nausea. What could be going on?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=8192,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
response = tokenizer.decode(output[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
Limitations
- This model is a research prototype and should NOT be used for actual clinical decision-making.
- Performance is optimized for Chinese medical text with linguistic perturbations.
- Requires Qwen3-32B as the base model (~60 GB in bfloat16).
Authors
Xinti Sun, Yuexuan Long, Qiyang Hong, Yinbo Xiao, Erping Long
Chinese Academy of Medical Sciences and Peking Union Medical College, Peking Union Medical College Hospital
Contact: sunxinti@tmu.edu.cn
Citation
@misc{prmed2026,
title={PrMed: Perturbation-Resilient Medical Foundation Model},
author={Xinti Sun and Yuexuan Long and Qiyang Hong and Yinbo Xiao and Erping Long},
year={2026},
url={https://huggingface.co/Xinti/PrMed}
}
- Downloads last month
- -
Model tree for Xinti/PrMed
Base model
Qwen/Qwen3-32B