--- base_model: Qwen/Qwen3-32B library_name: peft pipeline_tag: text-generation license: apache-2.0 language: - zh - en tags: - lora - transformers - medical - perturbation-robust - qwen3 - chain-of-thought - grpo - reinforcement-learning --- # PrMed — Perturbation-Resilient Medical Foundation Model Large language models (LLMs) have achieved strong performance on medical benchmarks, yet their reliability in real-world clinical settings remains insufficient. We identify a key source of this gap: a mismatch between real patient expressions — which often contain linguistic perturbations such as colloquial, vague, dialectal, and emotionally charged language — and the relatively clean and standardized corpora on which most existing LLMs are trained. We curated **569,913 real-world Chinese patient utterances** from six clinical specialties and found that **95.1%** contained at least one perturbation type, while **83.6%** contained two or more, indicating that linguistic perturbations are pervasive in real medical communication. Perturbation-gradient experiments showed that, although several leading LLMs approached or even exceeded open-book physician performance under clean inputs, their performance **declined sharply** under mild-to-severe perturbations, whereas physicians remained substantially more stable. Error-pattern analysis revealed that linguistic perturbations not only impaired key-information extraction, but more importantly **disrupted reasoning accuracy and induced reasoning drift**, suggesting that the main limitation of current medical LLMs lies not in insufficient medical knowledge, but in fragile understanding and reasoning under non-standard patient language. To address this gap, we developed **PrMed**, a perturbation-resilient medical foundation model trained in two stages on **1.2 million multi-source medical samples**, with stage 1 using perturbation-resilient chain-of-thought data for LoRA fine-tuning and stage 2 using GRPO-based reinforcement learning with a patient simulator to enhance multi-turn interactive reasoning. PrMed consistently showed stronger robustness than other LLMs, with an accuracy drop of only **2.71 percentage points** from formal to heavy perturbation, while better preserving reasoning stability, safety, completeness, and actionable advice in long-form dialogues. ## Model Training We developed a two-stage training framework to enable LLMs to perform perturbation-resilient complex medical reasoning through structured multi-step inference. ### Stage 1: Perturbation-Resilient Reasoning CoT **Training data construction.** We curate high-quality training samples by searching for correct reasoning trajectories under a strict rubric-based verification system. The rubric comprises three layers: a CoT layer with five axes, a response layer with five axes, and a cross layer with three axes to quantify the coherence and alignment between the CoT and the final response. The reasoning procedure follows five ordered steps: 1. **Emotion perception** — recognizing implicit emotional signals in perturbations to guide response tone and style 2. **Perturbation identification** — determining whether perturbations are present, labeling them at corresponding spans, and interpreting intended meaning 3. **Utterance correction** — reconstructing the patient message into a more clinically interpretable form 4. **Chief complaint extraction** — filtering distractions to focus on the core clinical request 5. **Medical reasoning** — conducting thorough and rigorous medical reasoning grounded in the extracted chief complaint After generation, an independent judge agent scores the output using the predefined rubric on a 5-point Likert scale. A sample is included in the final training corpus only if **all axes receive scores > 4**. This generate–evaluate–refine loop is repeated for up to three iterations. **Fine-tuning procedure.** We select Qwen3-32B as the base model and perform parameter-efficient fine-tuning using LoRA. | Parameter | Value | |---|---| | Base model | Qwen/Qwen3-32B | | PEFT method | LoRA | | LoRA modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | Rank (r) | 16 | | Alpha (α) | 32 | | Dropout | 0.05 | | Max context length | 8192 tokens | | Precision | bfloat16 (mixed precision) | | Batch size | 1 per GPU, gradient accumulation 4 (effective batch size = 4) | | Optimizer | AdamW, lr = 5×10⁻⁵, cosine schedule, 3% warmup | | Training | Up to 5 epochs with early stopping on validation loss | ### Stage 2: Reinforcement Learning with GRPO We further refine the Stage 1 model using Group Relative Policy Optimization (GRPO). For each prompt, GRPO generates multiple candidate responses from the current policy, scores them using a reward function, and updates the policy based on the relative advantage within each group. Training proceeds in two complementary phases: - **Single-turn phase**: The model generates candidate responses to individual patient queries and is optimized based on rubric scores. - **Multi-turn phase**: A DeepSeek-V3-based patient simulator generates follow-up utterances, and the model's next-turn response is evaluated under the same rubric, yielding an adaptive closed loop of simulate–evaluate–optimize. ## Quick Start ### Install Dependencies ```bash pip install torch transformers peft accelerate ``` ### Download Base Model Via ModelScope (recommended for users in China): ```python from modelscope import snapshot_download model_dir = snapshot_download("Qwen/Qwen3-32B", cache_dir="./") ``` Or via HuggingFace: ```python from huggingface_hub import snapshot_download snapshot_download("Qwen/Qwen3-32B", local_dir="./Qwen3-32B") ``` ### Load Model with PrMed ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel base_model_path = "./Qwen3-32B" PrMed_path = "./PrMed" tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( base_model_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, ) model = PeftModel.from_pretrained(model, PrMed_path) ``` ### Inference ```python ## Chinese (primary) messages = [ {"role": "system", "content": "你是一个抗语言扰动的医疗专家,通过多步骤思考过程,给出高质量的医学回复。"}, {"role": "user", "content": "医生你好,我最近总是头疼,有时候还会恶心,这是怎么回事?"} ] ## English messages = [ {"role": "system", "content": "You are a perturbation-resilient medical expert. Reason step by step and provide a high-quality medical response."}, {"role": "user", "content": "Hi doctor, I've been having headaches a lot lately, sometimes with nausea. What could be going on?"} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) output = model.generate( inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=8192, do_sample=True, temperature=0.7, top_p=0.9, ) response = tokenizer.decode(output[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True) print(response) ``` ## Limitations - This model is a **research prototype** and should **NOT** be used for actual clinical decision-making. - Performance is optimized for Chinese medical text with linguistic perturbations. - Requires Qwen3-32B as the base model (~60 GB in bfloat16). ## Authors **Xinti Sun, Yuexuan Long, Qiyang Hong, Yinbo Xiao, Erping Long** Chinese Academy of Medical Sciences and Peking Union Medical College, Peking Union Medical College Hospital Contact: sunxinti@tmu.edu.cn ## Citation ```bibtex @misc{prmed2026, title={PrMed: Perturbation-Resilient Medical Foundation Model}, author={Xinti Sun and Yuexuan Long and Qiyang Hong and Yinbo Xiao and Erping Long}, year={2026}, url={https://huggingface.co/Xinti/PrMed} } ```