Qwen3.5-9B MedMCQA LoRA Adapter

A LoRA fine-tuned adapter for Qwen/Qwen3.5-9B on the MedMCQA dataset โ€” 182K medical multiple-choice questions covering 21 subjects from Indian medical entrance exams (AIIMS/PG style).

Model Details

  • Developed by: James Oon (@jamezoon), SUTD MSTR-DAIE Deep Learning Project
  • Model type: Causal LM with LoRA adapter (PEFT)
  • Base model: Qwen/Qwen3.5-9B (dense, 9B parameters, BF16)
  • Language: English
  • License: Follows base model license (Qwen3.5)
  • Adapter size: ~28 MB (adapter_model.safetensors)

Intended Use

Medical multiple-choice question answering. Given a clinical question and 4 options (Aโ€“D), the model selects the correct answer with a step-by-step explanation. Subjects covered include Physiology, Anatomy, Biochemistry, Pathology, Pharmacology, Surgery, Medicine, Dental, Gynaecology, Paediatrics, and more.

Not intended for real clinical decision-making. This is a research/educational model.

How to Get Started

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "Qwen/Qwen3.5-9B"
adapter_id = "jamezoon/qwen3-5-9b-medmcqa-lora"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_id)

messages = [
    {"role": "system", "content": "You are a helpful tutor for pre-med students preparing for medical entrance exams. Answer the following multiple choice question by thinking step by step, then give the answer."},
    {"role": "user", "content": (
        "Question: Which of the following is the most common cause of mitral stenosis?\n"
        "Options: A. Rheumatic fever  B. Congenital  C. Infective endocarditis  D. SLE\n"
        "Think step by step. Then respond in the format:\n"
        "Explanation: ...\nAnswer: <one of A, B, C, D>"
    )},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Training Details

Dataset

  • MedMCQA โ€” 182,822 training samples, 4,183 validation samples
  • 21 medical subjects (Dental, Surgery, Medicine, Pathology, Pharmacology, etc.)
  • Each sample: question + 4 options + correct answer (1-indexed) + explanation
  • Formatted as chat messages with system/user/assistant roles

Training Procedure

Hyperparameter Value
Training steps 1,000 (max_steps)
Epochs 1
Per-device batch size 4
Gradient accumulation 4 (effective batch = 16)
Learning rate 2e-4
LR scheduler Cosine
Warmup steps 100
Max sequence length 512 tokens
Precision BF16
Optimizer AdamW (fused)
Max grad norm 1.0

LoRA Configuration

Parameter Value
Rank (r) 16
Alpha (ฮฑ) 32
Dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, out_proj
Trainable parameters 7,077,888 (0.079% of 9B)
Bias none

Hardware & Training Time

  • Hardware: NVIDIA GB10 Grace Blackwell (NVIDIA DGX Spark), 121 GB unified CPU+GPU memory
  • Training duration: ~50 minutes (1,000 steps at ~3s/step)
  • Total run time including evaluation: ~10 hours (eval_steps=200 triggered 5 evaluation passes ร— ~55 min each โ€” caused by GatedDeltaNet inference overhead)
  • Framework: PyTorch 2.x, HuggingFace Transformers, PEFT 0.18.1, TRL (SFTTrainer)

Known Technical Issue Fixed During Development

Qwen3.5-9B uses a hybrid GatedDeltaNet architecture (linear attention layers interleaved with standard softmax attention). This required a monkey-patch to transformers.masking_utils and modeling_qwen3_5.py to fix a 5D attention mask shape mismatch ([batch,1,1,seq,seq] โ†’ [batch,1,seq,seq]) that caused runtime errors during training.

Without the CUDA kernel (causal-conv1d, flash-linear-attention), GatedDeltaNet falls back to a PyTorch implementation running at ~3โ€“5s/step instead of the expected ~0.5s/step. This affected both training and evaluation speed on this hardware.

Evaluation

Baseline Comparison (dev split, 4,183 samples)

Model Accuracy Notes
Qwen3.5-9B zero-shot 26.6% Barely above random (25%); format failures common
Qwen3.5-9B + LoRA (this adapter) TBD Evaluation in progress

Zero-shot accuracy is low primarily due to format non-compliance โ€” the base model frequently fails to output a clean A/B/C/D answer in zero-shot settings. Fine-tuning addresses both format adherence and medical knowledge.

Per-Subject Baseline Accuracy (zero-shot, for reference)

Best subjects: Forensic Medicine (35.8%), Psychiatry (31.3%), Gynaecology (31.7%) Weakest subjects: Orthopaedics (20.0%), Ophthalmology (19.0%), Radiology (23.2%)

Citation

If you use this adapter, please cite the MedMCQA dataset:

@inproceedings{pmlr-v174-pal22a,
  title     = {MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering},
  author    = {Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan},
  booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
  year      = {2022},
  publisher = {PMLR}
}

Framework Versions

  • PEFT 0.18.1
  • Transformers (latest as of March 2026)
  • TRL (SFTTrainer)
  • PyTorch 2.x + CUDA
Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jamezoon/qwen3-5-9b-medmcqa-lora

Finetuned
Qwen/Qwen3.5-9B
Adapter
(125)
this model