You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🧠 Diagnostic-Reasoning-Q3

The highest-performing open-source sub-10B clinical reasoning model on MedXpertQA

🩺 8B parameters · ⚡ ~$300 training cost · 🛡️ 98.3% safety · 🏥 Runs on a single consumer GPU

Diagnostic-Reasoning-Q3X1 (Q3) is an 8-billion parameter clinical reasoning model built on the Qwen3-8B base using the Pentabrid training framework. It achieves competitive performance with frontier models 9–84× larger on the most challenging medical reasoning benchmark available.

📄 Paper: Training for Reasoning, Not Retrieval: How Behavioural Fine-Tuning Enables a Sub-10B Parameter Model to Compete with Frontier Clinical AI 👨‍⚕️ Authors: Adnan Agha, Eram Anwar — College of Medicine and Health Sciences, UAE University 📹 Live Evaluation: Watch full 10,578-question evaluation session

Key Results

Metric	Value
MedXpertQA Text (expert reasoning)	23.8% (584/2450)
MedQA — USMLE	72.7% (926/1273)
MMLU Medical Genetics	91.0% (91/100)
MMLU Professional Medicine	88.2% (240/272)
MMLU Clinical Knowledge	87.9% (233/265)
MMLU Anatomy	79.3% (107/135)
PubMedQA	75.2% (752/1000)
MedMCQA	60.5% (2531/4183)
MedSafetyBench (900 items)	98.3% refusal rate
Parameter Efficiency Ratio	2.98 accuracy/%/B
Training Cost	~$300 USD

MedXpertQA Leaderboard Position

Q3 ranks alongside models 9–84× larger on the official MedXpertQA Text evaluation:

Rank	Model	Parameters	Accuracy	Type
1	o1	Proprietary	44.7%	Inference-time scaled
2	DeepSeek-R1	671B	37.8%	Inference-time scaled
3	o3-mini	Proprietary	37.3%	Inference-time scaled
4	GPT-4o	~200B†	30.4%	Vanilla
5	LLaMA-3.3-70B	70B	24.5%	Vanilla
6	DeepSeek-V3	685B (37B active)	24.2%	Vanilla
7	Q3 (Ours)	8B	23.8%	Training-optimised
8	Claude-3.5 Sonnet	~175B†	21.3%	Vanilla
9	Gemini-2.0 Flash	MoE†	20.6%	Vanilla
10	Qwen2.5-72B	72B	18.9%	Vanilla
11	QwQ-32B-Preview	32B	18.0%	Inference-time scaled

†Estimated parameter counts. Comparator scores from Zuo et al. (ICML 2025). Q3 evaluated using identical official generative methodology.

No other sub-10B model approaches this performance tier.

Evaluation-Format Mismatch Penalty (Dual-Scoring Analysis)

Q3 was evaluated under both generative chain-of-thought and zero-shot log-likelihood scoring to investigate format sensitivity:

Benchmark	Q3 Generative	Q3 Log-Likelihood	Qwen3-8B Base LL	Gen–LL Gap	Q3–Base LL Gap
Medical Genetics	91.0%	87.0%	82.0%	+4.0pp	+5.0pp
Professional Medicine	88.2%	89.0%	81.6%	−0.7pp	+7.4pp
Clinical Knowledge	87.9%	87.2%	79.2%	+0.8pp	+7.9pp
Anatomy	79.3%	82.2%	71.1%	−3.0pp	+11.1pp
MedQA — USMLE	72.7%	65.9%	64.2%	+6.8pp	+1.7pp
MedMCQA	60.5%	58.5%	59.8%	+2.0pp	−1.3pp

On 4/6 benchmarks, generative accuracy exceeded log-likelihood accuracy. The largest gap (+6.8pp) occurred on MedQA, the most reasoning-intensive standard benchmark.

Body-System Performance (MedXpertQA Text, n=2450)

Body System	Correct	Total	Accuracy
Nervous	110	386	28.5%
Integumentary	13	48	27.1%
Lymphatic	22	85	25.9%
Endocrine	43	176	24.4%
Reproductive	49	201	24.4%
Cardiovascular	73	306	23.9%
Other / NA	30	126	23.8%
Urinary	29	123	23.6%
Skeletal	82	355	23.1%
Respiratory	44	193	22.8%
Digestive	56	274	20.4%
Muscular	33	177	18.6%
Overall	584	2450	23.8%

Range: 18.6–28.5% (9.9pp spread across 12 systems), indicating balanced reasoning acquisition rather than domain-specific memorisation.

Live Evaluation Recording

The complete evaluation session — all benchmarks including MedXpertQA (2450 questions), MedSafetyBench (900 items), and seven standard benchmarks — was recorded end-to-end in a single uninterrupted session.

▶️ Watch the full evaluation recording on asciinema

Total evaluation: 10,578 questions across 9 benchmarks. Single NVIDIA H100 80GB GPU. ~30 minutes total inference time. No-extract rate: 0.43% (46/10,578).

The Pentabrid Framework

Q3 was trained using the Pentabrid five-phase self-correcting reasoning protocol, which embeds structured clinical reasoning directly into model weights:

Read All Options First — Prevents anchoring bias by requiring systematic review before evaluation
Read the Question — Systematic extraction of clinical features, demographics, and key findings
Evaluate Each Option — Explicit RIGHT/WRONG determination with mechanistic reasoning for every choice
Self-Correction Check — Structured cognitive debiasing audit targeting anchoring, premature closure, availability heuristic, and search satisficing
Final Selection — Deterministic answer extraction following the complete reasoning chain

This protocol mirrors how expert clinicians transition from Type 1 (pattern recognition) to Type 2 (analytical, probabilistic) reasoning when confronted with ambiguous clinical presentations.

Training Details

Parameter	Value
Base model	Qwen3-8B
Method	QLoRA (4-bit base, BF16 adapters)
LoRA rank / alpha	128 / 256
Target modules	q, k, v, o, gate, up, down proj
Max sequence length	8192 tokens
Effective batch size	32 (2 × 16 grad accumulation)
Learning rate	2×10⁻⁴ cosine, 5% warmup
Stabiliser epoch	5×10⁻⁵
Training data	~75,000 effective examples
Training time	12–17 hours
Hardware	Single NVIDIA H100 80GB
Cost	~$300 USD

Full methodology details are protected under institutional intellectual property (UAEU reference IDF-00388).

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

prompt = """You are an expert clinical reasoning assistant. Answer the following medical question using the five-phase reasoning protocol.

Question: A 45-year-old woman presents with fatigue, weight gain, and cold intolerance. TSH is 12 mIU/L. What is the most likely diagnosis?

A) Graves' disease
B) Primary hypothyroidism
C) Secondary hypothyroidism
D) Euthyroid sick syndrome
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=3000, temperature=0.0)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hardware Requirements

Precision	Minimum VRAM
BF16 (full)	~18 GB
GPTQ 4-bit	~6 GB
GGUF Q4_K_M	~6 GB

Runs on a single consumer GPU (RTX 3090/4090, A6000, or equivalent). No cloud API dependency required.

Safety

Q3 achieved a 98.3% refusal rate (885/900) on MedSafetyBench, exceeding the unmodified Qwen3-8B base model (95.9%). Clinical reasoning optimisation improved safety rather than compromising it.

⚠️ This model is for research purposes only. It has not been validated for clinical use and should not be used as a substitute for professional medical advice, diagnosis, or treatment.

Citation

@article{agha2026training,
  title={Training for Reasoning, Not Retrieval: How Behavioural Fine-Tuning Enables a Sub-10B Parameter Model to Compete with Frontier Clinical AI},
  author={Agha, Adnan and Anwar, Eram},
  year={2026},
  institution={College of Medicine and Health Sciences, United Arab Emirates University}
}

Licence

CC-BY-NC-ND-4.0

Developed by Dr Adnan Agha, Department of Internal Medicine, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, UAE.

Downloads last month: 62

Safetensors

Model size

8B params

Tensor type

F16

Model tree for Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B