You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

🧠 Diagnostic-Reasoning-Q3

The highest-performing open-source sub-10B clinical reasoning model on MedXpertQA

🩺 8B parameters · ⚡ ~$300 training cost · 🛡️ 98.3% safety · 🏥 Runs on a single consumer GPU

Diagnostic-Reasoning-Q3X1 (Q3) is an 8-billion parameter clinical reasoning model built on the Qwen3-8B base using the Pentabrid training framework. It achieves competitive performance with frontier models 9–84× larger on the most challenging medical reasoning benchmark available.

📄 Paper: Training for Reasoning, Not Retrieval: How Behavioural Fine-Tuning Enables a Sub-10B Parameter Model to Compete with Frontier Clinical AI 👨‍⚕️ Authors: Adnan Agha, Eram Anwar — College of Medicine and Health Sciences, UAE University 📹 Live Evaluation: Watch full 10,578-question evaluation session


Key Results

Metric Value
MedXpertQA Text (expert reasoning) 23.8% (584/2450)
MedQA — USMLE 72.7% (926/1273)
MMLU Medical Genetics 91.0% (91/100)
MMLU Professional Medicine 88.2% (240/272)
MMLU Clinical Knowledge 87.9% (233/265)
MMLU Anatomy 79.3% (107/135)
PubMedQA 75.2% (752/1000)
MedMCQA 60.5% (2531/4183)
MedSafetyBench (900 items) 98.3% refusal rate
Parameter Efficiency Ratio 2.98 accuracy/%/B
Training Cost ~$300 USD

MedXpertQA Leaderboard Position

Q3 ranks alongside models 9–84× larger on the official MedXpertQA Text evaluation:

Rank Model Parameters Accuracy Type
1 o1 Proprietary 44.7% Inference-time scaled
2 DeepSeek-R1 671B 37.8% Inference-time scaled
3 o3-mini Proprietary 37.3% Inference-time scaled
4 GPT-4o ~200B† 30.4% Vanilla
5 LLaMA-3.3-70B 70B 24.5% Vanilla
6 DeepSeek-V3 685B (37B active) 24.2% Vanilla
7 Q3 (Ours) 8B 23.8% Training-optimised
8 Claude-3.5 Sonnet ~175B† 21.3% Vanilla
9 Gemini-2.0 Flash MoE† 20.6% Vanilla
10 Qwen2.5-72B 72B 18.9% Vanilla
11 QwQ-32B-Preview 32B 18.0% Inference-time scaled

†Estimated parameter counts. Comparator scores from Zuo et al. (ICML 2025). Q3 evaluated using identical official generative methodology.

No other sub-10B model approaches this performance tier.


Evaluation-Format Mismatch Penalty (Dual-Scoring Analysis)

Q3 was evaluated under both generative chain-of-thought and zero-shot log-likelihood scoring to investigate format sensitivity:

Benchmark Q3 Generative Q3 Log-Likelihood Qwen3-8B Base LL Gen–LL Gap Q3–Base LL Gap
Medical Genetics 91.0% 87.0% 82.0% +4.0pp +5.0pp
Professional Medicine 88.2% 89.0% 81.6% −0.7pp +7.4pp
Clinical Knowledge 87.9% 87.2% 79.2% +0.8pp +7.9pp
Anatomy 79.3% 82.2% 71.1% −3.0pp +11.1pp
MedQA — USMLE 72.7% 65.9% 64.2% +6.8pp +1.7pp
MedMCQA 60.5% 58.5% 59.8% +2.0pp −1.3pp

On 4/6 benchmarks, generative accuracy exceeded log-likelihood accuracy. The largest gap (+6.8pp) occurred on MedQA, the most reasoning-intensive standard benchmark.


Body-System Performance (MedXpertQA Text, n=2450)

Body System Correct Total Accuracy
Nervous 110 386 28.5%
Integumentary 13 48 27.1%
Lymphatic 22 85 25.9%
Endocrine 43 176 24.4%
Reproductive 49 201 24.4%
Cardiovascular 73 306 23.9%
Other / NA 30 126 23.8%
Urinary 29 123 23.6%
Skeletal 82 355 23.1%
Respiratory 44 193 22.8%
Digestive 56 274 20.4%
Muscular 33 177 18.6%
Overall 584 2450 23.8%

Range: 18.6–28.5% (9.9pp spread across 12 systems), indicating balanced reasoning acquisition rather than domain-specific memorisation.


Live Evaluation Recording

The complete evaluation session — all benchmarks including MedXpertQA (2450 questions), MedSafetyBench (900 items), and seven standard benchmarks — was recorded end-to-end in a single uninterrupted session.

▶️ Watch the full evaluation recording on asciinema

asciicast

Total evaluation: 10,578 questions across 9 benchmarks. Single NVIDIA H100 80GB GPU. ~30 minutes total inference time. No-extract rate: 0.43% (46/10,578).


The Pentabrid Framework

Q3 was trained using the Pentabrid five-phase self-correcting reasoning protocol, which embeds structured clinical reasoning directly into model weights:

  1. Read All Options First — Prevents anchoring bias by requiring systematic review before evaluation
  2. Read the Question — Systematic extraction of clinical features, demographics, and key findings
  3. Evaluate Each Option — Explicit RIGHT/WRONG determination with mechanistic reasoning for every choice
  4. Self-Correction Check — Structured cognitive debiasing audit targeting anchoring, premature closure, availability heuristic, and search satisficing
  5. Final Selection — Deterministic answer extraction following the complete reasoning chain

This protocol mirrors how expert clinicians transition from Type 1 (pattern recognition) to Type 2 (analytical, probabilistic) reasoning when confronted with ambiguous clinical presentations.


Training Details

Parameter Value
Base model Qwen3-8B
Method QLoRA (4-bit base, BF16 adapters)
LoRA rank / alpha 128 / 256
Target modules q, k, v, o, gate, up, down proj
Max sequence length 8192 tokens
Effective batch size 32 (2 × 16 grad accumulation)
Learning rate 2×10⁻⁴ cosine, 5% warmup
Stabiliser epoch 5×10⁻⁵
Training data ~75,000 effective examples
Training time 12–17 hours
Hardware Single NVIDIA H100 80GB
Cost ~$300 USD

Full methodology details are protected under institutional intellectual property (UAEU reference IDF-00388).


Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

prompt = """You are an expert clinical reasoning assistant. Answer the following medical question using the five-phase reasoning protocol.

Question: A 45-year-old woman presents with fatigue, weight gain, and cold intolerance. TSH is 12 mIU/L. What is the most likely diagnosis?

A) Graves' disease
B) Primary hypothyroidism
C) Secondary hypothyroidism
D) Euthyroid sick syndrome
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=3000, temperature=0.0)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hardware Requirements

Precision Minimum VRAM
BF16 (full) ~18 GB
GPTQ 4-bit ~6 GB
GGUF Q4_K_M ~6 GB

Runs on a single consumer GPU (RTX 3090/4090, A6000, or equivalent). No cloud API dependency required.


Safety

Q3 achieved a 98.3% refusal rate (885/900) on MedSafetyBench, exceeding the unmodified Qwen3-8B base model (95.9%). Clinical reasoning optimisation improved safety rather than compromising it.

⚠️ This model is for research purposes only. It has not been validated for clinical use and should not be used as a substitute for professional medical advice, diagnosis, or treatment.


Citation

@article{agha2026training,
  title={Training for Reasoning, Not Retrieval: How Behavioural Fine-Tuning Enables a Sub-10B Parameter Model to Compete with Frontier Clinical AI},
  author={Agha, Adnan and Anwar, Eram},
  year={2026},
  institution={College of Medicine and Health Sciences, United Arab Emirates University}
}

Links


Licence

CC-BY-NC-ND-4.0


Developed by Dr Adnan Agha, Department of Internal Medicine, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain, UAE.

Downloads last month
62
Safetensors
Model size
8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1

Finetuned
Qwen/Qwen3-8B
Finetuned
(1452)
this model

Dataset used to train Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1

Paper for Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X1

Evaluation results