Qwen3-14B MedMCQA LoRA Adapter

A LoRA fine-tuned adapter for Qwen/Qwen3-14B on the MedMCQA dataset — 182K medical multiple-choice questions covering 21 subjects from Indian medical entrance exams (AIIMS/PG style).

Model Details

  • Developed by: James Oon (@jamezoon), SUTD MSTR-DAIE Deep Learning Project
  • Model type: Causal LM with LoRA adapter (PEFT)
  • Base model: Qwen/Qwen3-14B (dense, 14B parameters, BF16, standard transformer)
  • Language: English
  • License: Follows base model license (Qwen3)
  • Adapter size: ~81 MB (adapter_model.safetensors)

Intended Use

Medical multiple-choice question answering. Given a clinical question and 4 options (A–D), the model selects the correct answer with a step-by-step explanation. Subjects covered include Physiology, Anatomy, Biochemistry, Pathology, Pharmacology, Surgery, Medicine, Dental, Gynaecology, Paediatrics, and more.

Not intended for real clinical decision-making. This is a research/educational model.

How to Get Started

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "Qwen/Qwen3-14B"
adapter_id = "jamezoon/qwen3-14b-medmcqa-lora"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_id)

messages = [
    {"role": "system", "content": "You are a helpful tutor for pre-med students preparing for medical entrance exams. Answer the following multiple choice question by thinking step by step, then give the answer."},
    {"role": "user", "content": (
        "Question: Which of the following is the most common cause of mitral stenosis?\n"
        "Options: A. Rheumatic fever  B. Congenital  C. Infective endocarditis  D. SLE\n"
        "Think step by step. Then respond in the format:\n"
        "Explanation: ...\nAnswer: <one of A, B, C, D>"
    )},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training Details

Dataset

  • MedMCQA — 182,822 training samples, 4,183 validation samples
  • 21 medical subjects (Dental, Surgery, Medicine, Pathology, Pharmacology, etc.)
  • Each sample: question + 4 options + correct answer (1-indexed) + explanation
  • Formatted as chat messages with system/user/assistant roles

Training Procedure

Hyperparameter Value
Training steps 1,000 (max_steps — ~8.7% of 1 full epoch)
Epochs 1 (partial)
Per-device batch size 4
Gradient accumulation 4 (effective batch = 16)
Learning rate 2e-4
LR scheduler Cosine
Warmup steps 100
Max sequence length 512 tokens
Precision BF16
Optimizer AdamW (fused)
Max grad norm 1.0

LoRA Configuration

Parameter Value
Rank (r) 16
Alpha (α) 32
Dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj
Trainable parameters 20,971,520 (0.1418% of 14B)
Bias none

Hardware & Training Time

  • Hardware: NVIDIA GB10 Grace Blackwell (NVIDIA DGX Spark), 121 GB unified CPU+GPU memory
  • Training duration: ~26 hours total (1,000 training steps + 5 evaluation passes × ~95 min each)
  • Actual training steps: ~83 minutes (1,000 steps at ~5s/step)
  • Framework: PyTorch 2.x, HuggingFace Transformers, PEFT 0.18.1, TRL (SFTTrainer)

Architecture Note

Qwen/Qwen3-14B is a standard dense transformer (not the Qwen3.5 hybrid variant). It does not use GatedDeltaNet linear attention layers, making it fully compatible with standard CUDA training without special kernel requirements.

Evaluation

Training Loss Progression

Step Train Loss Token Accuracy
10 2.972 55.5%
50 1.502 69.5%
100 1.124 76.6%
200 1.061 77.3%
600 1.052 77.3%
1,000 1.068 76.7%

Validation Loss (Dev Set, 4,183 samples)

Checkpoint Eval Loss Token Accuracy
Step 200 0.9825 78.96%
Step 600 0.9746 79.02%
Step 800 0.9681 79.14%
Step 1000 0.9664 79.18%
Best (saved) 0.9649 79.20%

Eval loss improved consistently throughout training, indicating good generalisation.

MCQ Accuracy Comparison (Dev Split, 4,183 samples)

Model Accuracy Notes
Qwen3-14B zero-shot 27.4% Format failures common (~12.5% None responses)
Qwen3-14B + LoRA (this adapter) TBD Evaluation in progress

Zero-shot accuracy is low primarily due to format non-compliance — the base model frequently fails to output a clean A/B/C/D answer in zero-shot settings. LoRA fine-tuning addresses both format adherence and domain knowledge.

Per-Subject Zero-Shot Baseline (for reference)

Best subjects: Anaesthesia (38.2%), Psychiatry (37.5%), Radiology (31.9%) Weakest subjects: Orthopaedics (10.0%), Skin (11.8%), Anatomy (19.2%)

Comparison with Qwen3.5-9B Adapter

Qwen3.5-9B adapter This adapter (Qwen3-14B)
Base model params 9B 14B
Architecture Hybrid (GatedDeltaNet) Standard transformer
Trainable params 7.1M (0.079%) 21.0M (0.142%)
Best eval loss 0.9669 0.9649
Best token acc 78.7% 79.20%
Adapter size 28MB 81MB

Citation

@inproceedings{pmlr-v174-pal22a,
  title     = {MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering},
  author    = {Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan},
  booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
  year      = {2022},
  publisher = {PMLR}
}

Framework Versions

  • PEFT 0.18.1
  • Transformers (latest as of March 2026)
  • TRL (SFTTrainer)
  • PyTorch 2.x + CUDA
Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jamezoon/qwen3-14b-medmcqa-lora

Finetuned
Qwen/Qwen3-14B
Adapter
(254)
this model