Qwen3-14B MedMCQA LoRA Adapter
A LoRA fine-tuned adapter for Qwen/Qwen3-14B on the MedMCQA dataset — 182K medical multiple-choice questions covering 21 subjects from Indian medical entrance exams (AIIMS/PG style).
Model Details
- Developed by: James Oon (@jamezoon), SUTD MSTR-DAIE Deep Learning Project
- Model type: Causal LM with LoRA adapter (PEFT)
- Base model:
Qwen/Qwen3-14B(dense, 14B parameters, BF16, standard transformer) - Language: English
- License: Follows base model license (Qwen3)
- Adapter size: ~81 MB (
adapter_model.safetensors)
Intended Use
Medical multiple-choice question answering. Given a clinical question and 4 options (A–D), the model selects the correct answer with a step-by-step explanation. Subjects covered include Physiology, Anatomy, Biochemistry, Pathology, Pharmacology, Surgery, Medicine, Dental, Gynaecology, Paediatrics, and more.
Not intended for real clinical decision-making. This is a research/educational model.
How to Get Started
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base_model_id = "Qwen/Qwen3-14B"
adapter_id = "jamezoon/qwen3-14b-medmcqa-lora"
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_id)
messages = [
{"role": "system", "content": "You are a helpful tutor for pre-med students preparing for medical entrance exams. Answer the following multiple choice question by thinking step by step, then give the answer."},
{"role": "user", "content": (
"Question: Which of the following is the most common cause of mitral stenosis?\n"
"Options: A. Rheumatic fever B. Congenital C. Infective endocarditis D. SLE\n"
"Think step by step. Then respond in the format:\n"
"Explanation: ...\nAnswer: <one of A, B, C, D>"
)},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Training Details
Dataset
- MedMCQA — 182,822 training samples, 4,183 validation samples
- 21 medical subjects (Dental, Surgery, Medicine, Pathology, Pharmacology, etc.)
- Each sample: question + 4 options + correct answer (1-indexed) + explanation
- Formatted as chat messages with system/user/assistant roles
Training Procedure
| Hyperparameter | Value |
|---|---|
| Training steps | 1,000 (max_steps — ~8.7% of 1 full epoch) |
| Epochs | 1 (partial) |
| Per-device batch size | 4 |
| Gradient accumulation | 4 (effective batch = 16) |
| Learning rate | 2e-4 |
| LR scheduler | Cosine |
| Warmup steps | 100 |
| Max sequence length | 512 tokens |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Max grad norm | 1.0 |
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 16 |
| Alpha (α) | 32 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj |
| Trainable parameters | 20,971,520 (0.1418% of 14B) |
| Bias | none |
Hardware & Training Time
- Hardware: NVIDIA GB10 Grace Blackwell (NVIDIA DGX Spark), 121 GB unified CPU+GPU memory
- Training duration: ~26 hours total (1,000 training steps + 5 evaluation passes × ~95 min each)
- Actual training steps: ~83 minutes (1,000 steps at ~5s/step)
- Framework: PyTorch 2.x, HuggingFace Transformers, PEFT 0.18.1, TRL (SFTTrainer)
Architecture Note
Qwen/Qwen3-14B is a standard dense transformer (not the Qwen3.5 hybrid variant). It does not use GatedDeltaNet linear attention layers, making it fully compatible with standard CUDA training without special kernel requirements.
Evaluation
Training Loss Progression
| Step | Train Loss | Token Accuracy |
|---|---|---|
| 10 | 2.972 | 55.5% |
| 50 | 1.502 | 69.5% |
| 100 | 1.124 | 76.6% |
| 200 | 1.061 | 77.3% |
| 600 | 1.052 | 77.3% |
| 1,000 | 1.068 | 76.7% |
Validation Loss (Dev Set, 4,183 samples)
| Checkpoint | Eval Loss | Token Accuracy |
|---|---|---|
| Step 200 | 0.9825 | 78.96% |
| Step 600 | 0.9746 | 79.02% |
| Step 800 | 0.9681 | 79.14% |
| Step 1000 | 0.9664 | 79.18% |
| Best (saved) | 0.9649 | 79.20% |
Eval loss improved consistently throughout training, indicating good generalisation.
MCQ Accuracy Comparison (Dev Split, 4,183 samples)
| Model | Accuracy | Notes |
|---|---|---|
| Qwen3-14B zero-shot | 27.4% | Format failures common (~12.5% None responses) |
| Qwen3-14B + LoRA (this adapter) | TBD | Evaluation in progress |
Zero-shot accuracy is low primarily due to format non-compliance — the base model frequently fails to output a clean A/B/C/D answer in zero-shot settings. LoRA fine-tuning addresses both format adherence and domain knowledge.
Per-Subject Zero-Shot Baseline (for reference)
Best subjects: Anaesthesia (38.2%), Psychiatry (37.5%), Radiology (31.9%) Weakest subjects: Orthopaedics (10.0%), Skin (11.8%), Anatomy (19.2%)
Comparison with Qwen3.5-9B Adapter
| Qwen3.5-9B adapter | This adapter (Qwen3-14B) | |
|---|---|---|
| Base model params | 9B | 14B |
| Architecture | Hybrid (GatedDeltaNet) | Standard transformer |
| Trainable params | 7.1M (0.079%) | 21.0M (0.142%) |
| Best eval loss | 0.9669 | 0.9649 |
| Best token acc | 78.7% | 79.20% |
| Adapter size | 28MB | 81MB |
Citation
@inproceedings{pmlr-v174-pal22a,
title = {MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering},
author = {Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan},
booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
year = {2022},
publisher = {PMLR}
}
Framework Versions
- PEFT 0.18.1
- Transformers (latest as of March 2026)
- TRL (SFTTrainer)
- PyTorch 2.x + CUDA
- Downloads last month
- 33