🧠 Diagnostic-Reasoning-Q3
The highest-performing open-source sub-10B clinical reasoning model on MedXpertQA
🩺 8B parameters · ⚡ ~$300 training cost · 🛡️ 98.3% safety · 🏥 Runs on a single consumer GPU
Diagnostic-Reasoning-Q3X1 (Q3) is an 8-billion parameter clinical reasoning model built on the Qwen3-8B base using the Pentabrid training framework. It achieves competitive performance with frontier models 9–84× larger on the most challenging medical reasoning benchmark available.
📄 Paper: Training for Reasoning, Not Retrieval: How Behavioural Fine-Tuning Enables a Sub-10B Parameter Model to Compete with Frontier Clinical AI 👨⚕️ Authors: Adnan Agha, Eram Anwar — College of Medicine and Health Sciences, UAE University 📹 Live Evaluation: Watch full 10,578-question evaluation session
Key Results
| Metric | Value |
|---|---|
| MedXpertQA Text (expert reasoning) | 23.8% (584/2450) |
| MedQA — USMLE | 72.7% (926/1273) |
| MMLU Medical Genetics | 91.0% (91/100) |
| MMLU Professional Medicine | 88.2% (240/272) |
| MMLU Clinical Knowledge | 87.9% (233/265) |
| MMLU Anatomy | 79.3% (107/135) |
| PubMedQA | 75.2% (752/1000) |
| MedMCQA | 60.5% (2531/4183) |
| MedSafetyBench (900 items) | 98.3% refusal rate |
| Parameter Efficiency Ratio | 2.98 accuracy/%/B |
| Training Cost | ~$300 USD |
MedXpertQA Leaderboard Position
Q3 ranks alongside models 9–84× larger on the official MedXpertQA Text evaluation:
| Rank | Model | Parameters | Accuracy | Type |
|---|---|---|---|---|
| 1 | o1 | Proprietary | 44.7% | Inference-time scaled |
| 2 | DeepSeek-R1 | 671B | 37.8% | Inference-time scaled |
| 3 | o3-mini | Proprietary | 37.3% | Inference-time scaled |
| 4 | GPT-4o | ~200B† | 30.4% | Vanilla |
| 5 | LLaMA-3.3-70B | 70B | 24.5% | Vanilla |
| 6 | DeepSeek-V3 | 685B (37B active) | 24.2% | Vanilla |
| 7 | Q3 (Ours) | 8B | 23.8% | Training-optimised |
| 8 | Claude-3.5 Sonnet | ~175B† | 21.3% | Vanilla |
| 9 | Gemini-2.0 Flash | MoE† | 20.6% | Vanilla |
| 10 | Qwen2.5-72B | 72B | 18.9% | Vanilla |
| 11 | QwQ-32B-Preview | 32B | 18.0% | Inference-time scaled |
†Estimated parameter counts. Comparator scores from Zuo et al. (ICML 2025). Q3 evaluated using identical official generative methodology.
No other sub-10B model approaches this performance tier.
Evaluation-Format Mismatch Penalty (Dual-Scoring Analysis)
Q3 was evaluated under both generative chain-of-thought and zero-shot log-likelihood scoring to investigate format sensitivity:
| Benchmark | Q3 Generative | Q3 Log-Likelihood | Qwen3-8B Base LL | Gen–LL Gap | Q3–Base LL Gap |
|---|---|---|---|---|---|
| Medical Genetics | 91.0% | 87.0% | 82.0% | +4.0pp | +5.0pp |
| Professional Medicine | 88.2% | 89.0% | 81.6% | −0.7pp | +7.4pp |
| Clinical Knowledge | 87.9% | 87.2% | 79.2% | +0.8pp | +7.9pp |
| Anatomy | 79.3% | 82.2% | 71.1% | −3.0pp | +11.1pp |
| MedQA — USMLE | 72.7% | 65.9% | 64.2% | +6.8pp | +1.7pp |
| MedMCQA | 60.5% | 58.5% | 59.8% | +2.0pp | −1.3pp |
On 4/6 benchmarks, generative accuracy exceeded log-likelihood accuracy. The largest gap (+6.8pp) occurred on MedQA, the most reasoning-intensive standard benchmark.
Body-System Performance (MedXpertQA Text, n=2450)
| Body System | Correct | Total | Accuracy |
|---|---|---|---|
| Nervous | 110 | 386 | 28.5% |
| Integumentary | 13 | 48 | 27.1% |
| Lymphatic | 22 | 85 | 25.9% |
| Endocrine | 43 | 176 | 24.4% |
| Reproductive | 49 | 201 | 24.4% |
| Cardiovascular | 73 | 306 | 23.9% |
| Other / NA | 30 | 126 | 23.8% |
| Urinary | 29 | 123 | 23.6% |
| Skeletal | 82 | 355 | 23.1% |
| Respiratory | 44 | 193 | 22.8% |
| Digestive | 56 | 274 | 20.4% |
| Muscular | 33 | 177 | 18.6% |
| Overall | 584 | 2450 | 23.8% |
Range: 18.6–28.5% (9.9pp spread across 12 systems), indicating balanced reasoning acquisition rather than domain-specific memorisation.
Live Evaluation Recording
The complete evaluation session — all benchmarks including MedXpertQA (2450 questions), MedSafetyBench (900 items), and seven standard benchmarks — was recorded end-to-end in a single uninterrupted session.
▶️ Watch the full evaluation recording on asciinema
Total evaluation: 10,578 questions across 9 benchmarks. Single NVIDIA H100 80GB GPU. ~30 minutes total inference time. No-extract rate: 0.43% (46/10,578).
The Pentabrid Framework
Q3 was trained using the Pentabrid five-phase self-correcting reasoning protocol, which embeds structured clinical reasoning directly into model weights:
- Read All Options First — Prevents anchoring bias by requiring systematic review before evaluation
- Read the Question — Systematic extraction of clinical features, demographics, and key findings
- Evaluate Each Option — Explicit RIGHT/WRONG determination with mechanistic reasoning for every choice
- Self-Correction Check — Structured cognitive debiasing audit targeting anchoring, premature closure, availability heuristic, and search satisficing
- Final Selection — Deterministic answer extraction following the complete reasoning chain
This protocol mirrors how expert clinicians transition from Type 1 (pattern recognition) to Type 2 (analytical, probabilistic) reasoning when confronted with ambiguous clinical presentations.
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen3-8B |
| Method | QLoRA (4-bit base, BF16 adapters) |
| LoRA rank / alpha | 128 / 256 |
| Target modules | q, k, v, o, gate, up, down proj |
| Max sequence length | 8192 tokens |
| Effective batch size | 32 (2 × 16 grad accumulation) |
| Learning rate | 2×10⁻⁴ cosine, 5% warmup |
| Stabiliser epoch | 5×10⁻⁵ |
| Training data | ~75,000 effective examples |
| Training time | 12–17 hours |
| Hardware | Single NVIDIA H100 80GB |
| Cost | ~$300 USD |
Full methodology details are protected under institutional intellectual property (UAEU reference IDF-00388).
Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
prompt = """You are an expert clinical reasoning assistant. Answer the following medical question using the five-phase reasoning protocol.
Question: A 45-year-old woman presents with fatigue, weight gain, and cold intolerance. TSH is 12 mIU/L. What is the most likely diagnosis?
A) Graves' disease
B) Primary hypothyroidism
C) Secondary hypothyroidism
D) Euthyroid sick syndrome
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=3000, temperature=0.0)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Hardware Requirements
| Precision | Minimum VRAM |
|---|---|
| BF16 (full) | ~18 GB |
| GPTQ 4-bit | ~6 GB |
| GGUF Q4_K_M | ~6 GB |
Runs on a single consumer GPU (RTX 3090/4090, A6000, or equivalent). No cloud API dependency required.
Safety
Q3 achieved a 98.3% refusal rate (885/900) on MedSafetyBench, exceeding the unmodified Qwen3-8B base model (95.9%). Clinical reasoning optimisation improved safety rather than compromising it.
⚠️ This model is for research purposes only. It has not been validated for clinical use and should not be used as a substitute for professional medical advice, diagnosis, or treatment.
Citation
@article{agha2026training,
title={Training for Reasoning, Not Retrieval: How Behavioural Fine-Tuning Enables a Sub-10B Parameter Model to Compete with Frontier Clinical AI},
author={Agha, Adnan and Anwar, Eram},
year={2026},
institution={College of Medicine and Health Sciences, United Arab Emirates University}
}
Links
- Model weights: Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1
- Evaluation recording: asciinema.org/a/822289
- MedXpertQA benchmark: Zuo et al. (ICML 2025)
- Organisation: Clinical-Reasoning-Hub on HuggingFace
Licence
CC-BY-NC-ND-4.0
Developed by Dr Adnan Agha, Department of Internal Medicine, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, UAE.
- Downloads last month
- 62
Model tree for Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1
Dataset used to train Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1
Paper for Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1
Evaluation results
- Accuracy on MedXpertQA Textself-reported23.800
- Accuracy on MedQA (USMLE)self-reported72.700
- Accuracy on MMLU Professional Medicineself-reported88.200
- Accuracy on MMLU Medical Geneticsself-reported91.000
- Accuracy on MMLU Clinical Knowledgeself-reported87.900
- Refusal Rate on MedSafetyBenchself-reported98.300