---
base_model: Qwen/Qwen3-32B
library_name: peft
pipeline_tag: text-generation
license: apache-2.0
language:
- zh
- en
tags:
- lora
- transformers
- medical
- perturbation-robust
- qwen3
- chain-of-thought
- grpo
- reinforcement-learning
---

# PrMed — Perturbation-Resilient Medical Foundation Model

Large language models (LLMs) have achieved strong performance on medical benchmarks, yet their reliability in real-world clinical settings remains insufficient. We identify a key source of this gap: a mismatch between real patient expressions — which often contain linguistic perturbations such as colloquial, vague, dialectal, and emotionally charged language — and the relatively clean and standardized corpora on which most existing LLMs are trained.

We curated **569,913 real-world Chinese patient utterances** from six clinical specialties and found that **95.1%** contained at least one perturbation type, while **83.6%** contained two or more, indicating that linguistic perturbations are pervasive in real medical communication. Perturbation-gradient experiments showed that, although several leading LLMs approached or even exceeded open-book physician performance under clean inputs, their performance **declined sharply** under mild-to-severe perturbations, whereas physicians remained substantially more stable.

Error-pattern analysis revealed that linguistic perturbations not only impaired key-information extraction, but more importantly **disrupted reasoning accuracy and induced reasoning drift**, suggesting that the main limitation of current medical LLMs lies not in insufficient medical knowledge, but in fragile understanding and reasoning under non-standard patient language.

To address this gap, we developed **PrMed**, a perturbation-resilient medical foundation model trained in two stages on **1.2 million multi-source medical samples**, with stage 1 using perturbation-resilient chain-of-thought data for LoRA fine-tuning and stage 2 using GRPO-based reinforcement learning with a patient simulator to enhance multi-turn interactive reasoning. PrMed consistently showed stronger robustness than other LLMs, with an accuracy drop of only **2.71 percentage points** from formal to heavy perturbation, while better preserving reasoning stability, safety, completeness, and actionable advice in long-form dialogues. 

## Model Training

We developed a two-stage training framework to enable LLMs to perform perturbation-resilient complex medical reasoning through structured multi-step inference.

### Stage 1: Perturbation-Resilient Reasoning CoT

**Training data construction.** We curate high-quality training samples by searching for correct reasoning trajectories under a strict rubric-based verification system. The rubric comprises three layers: a CoT layer with five axes, a response layer with five axes, and a cross layer with three axes to quantify the coherence and alignment between the CoT and the final response. The reasoning procedure follows five ordered steps:

1. **Emotion perception** — recognizing implicit emotional signals in perturbations to guide response tone and style
2. **Perturbation identification** — determining whether perturbations are present, labeling them at corresponding spans, and interpreting intended meaning
3. **Utterance correction** — reconstructing the patient message into a more clinically interpretable form
4. **Chief complaint extraction** — filtering distractions to focus on the core clinical request
5. **Medical reasoning** — conducting thorough and rigorous medical reasoning grounded in the extracted chief complaint

After generation, an independent judge agent scores the output using the predefined rubric on a 5-point Likert scale. A sample is included in the final training corpus only if **all axes receive scores > 4**. This generate–evaluate–refine loop is repeated for up to three iterations.

**Fine-tuning procedure.** We select Qwen3-32B as the base model and perform parameter-efficient fine-tuning using LoRA.

| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-32B |
| PEFT method | LoRA |
| LoRA modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Rank (r) | 16 |
| Alpha (α) | 32 |
| Dropout | 0.05 |
| Max context length | 8192 tokens |
| Precision | bfloat16 (mixed precision) |
| Batch size | 1 per GPU, gradient accumulation 4 (effective batch size = 4) |
| Optimizer | AdamW, lr = 5×10⁻⁵, cosine schedule, 3% warmup |
| Training | Up to 5 epochs with early stopping on validation loss |

### Stage 2: Reinforcement Learning with GRPO

We further refine the Stage 1 model using Group Relative Policy Optimization (GRPO). For each prompt, GRPO generates multiple candidate responses from the current policy, scores them using a reward function, and updates the policy based on the relative advantage within each group. Training proceeds in two complementary phases:

- **Single-turn phase**: The model generates candidate responses to individual patient queries and is optimized based on rubric scores.
- **Multi-turn phase**: A DeepSeek-V3-based patient simulator generates follow-up utterances, and the model's next-turn response is evaluated under the same rubric, yielding an adaptive closed loop of simulate–evaluate–optimize.

## Quick Start

### Install Dependencies

```bash
pip install torch transformers peft accelerate
```

### Download Base Model

Via ModelScope (recommended for users in China):

```python
from modelscope import snapshot_download
model_dir = snapshot_download("Qwen/Qwen3-32B", cache_dir="./")
```

Or via HuggingFace:

```python
from huggingface_hub import snapshot_download
snapshot_download("Qwen/Qwen3-32B", local_dir="./Qwen3-32B")
```

### Load Model with PrMed

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model_path = "./Qwen3-32B"
PrMed_path = "./PrMed"

tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(model, PrMed_path)
```

### Inference

```python
## Chinese (primary)
messages = [
    {"role": "system", "content": "你是一个抗语言扰动的医疗专家，通过多步骤思考过程，给出高质量的医学回复。"},
    {"role": "user", "content": "医生你好，我最近总是头疼，有时候还会恶心，这是怎么回事？"}
]

## English
messages = [
    {"role": "system", "content": "You are a perturbation-resilient medical expert. Reason step by step and provide a high-quality medical response."},
    {"role": "user", "content": "Hi doctor, I've been having headaches a lot lately, sometimes with nausea. What could be going on?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

output = model.generate(
    inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=8192,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(output[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)
```

## Limitations

- This model is a **research prototype** and should **NOT** be used for actual clinical decision-making.
- Performance is optimized for Chinese medical text with linguistic perturbations.
- Requires Qwen3-32B as the base model (~60 GB in bfloat16).

## Authors

**Xinti Sun, Yuexuan Long, Qiyang Hong, Yinbo Xiao, Erping Long**

Chinese Academy of Medical Sciences and Peking Union Medical College, Peking Union Medical College Hospital

Contact: sunxinti@tmu.edu.cn

## Citation

```bibtex
@misc{prmed2026,
  title={PrMed: Perturbation-Resilient Medical Foundation Model},
  author={Xinti Sun and Yuexuan Long and Qiyang Hong and Yinbo Xiao and Erping Long},
  year={2026},
  url={https://huggingface.co/Xinti/PrMed}
}
```