MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents
Paper • 2501.14654 • Published
Smart Adaptive Health Engine for Local Intelligence
A LoRA fine-tune of Google Gemma 4 E4B-it with 7 technical novelties for clinical decision support in low-resource settings.
| # | Novelty | Description | Based On |
|---|---|---|---|
| 1 | Transparent Reasoning | enable_thinking=True shows step-by-step clinical reasoning chains |
ArgMed-Agents (arxiv:2403.06294) |
| 2 | FHIR Function Calling | Native tool use generates HL7 FHIR R4 records with ICD-10 codes | MedAgentBench (arxiv:2501.14654) |
| 3 | Semantic RAG | GTE-small + FAISS vector search over 11 WHO guideline areas | MIRAGE (arxiv:2402.13178) |
| 4 | Multi-Agent Triage | 3-tier severity routing adapts response depth to case complexity | MDAgents (NeurIPS 2024, arxiv:2404.15155) |
| 5 | Multimodal | Single model handles text + images + audio (no separate pipelines) | Gemma 4 native |
| 6 | Medical Benchmarks | Evaluated on MedQA, MedMCQA, PubMedQA, AfriMed-QA v2 | — |
| 7 | Edge Deployment | GGUF Q4_K_M for Ollama/llama.cpp on $150 Android phone | — |
| Property | Value |
|---|---|
| Base Model | google/gemma-4-E4B-it (8B params, 4.5B effective) |
| Method | QLoRA (4-bit NF4) via Unsloth + TRL SFT |
| LoRA Rank | 16, Alpha 16 |
| Target Modules | q/k/v/o_proj, gate/up/down_proj |
| Training Epochs | 3 |
| Learning Rate | 2e-4 (cosine) |
| Max Seq Length | 2048 |
| Optimizer | AdamW 8-bit |
| Dataset | Size | Purpose |
|---|---|---|
| FreedomIntelligence/medical-o1-reasoning-SFT | ~31MB | Chain-of-thought clinical reasoning |
| lavita/medical-qa-datasets | 148MB | Broad medical QA dialogues |
Patient Input (Voice / Photo / Text)
|
[Complexity Triage] → LOW / MODERATE / HIGH
|
[Gemma 4 E4B + Thinking Mode] → Clinical reasoning
|
[Semantic RAG: GTE-small + FAISS] → WHO guidelines
|
[Function Calling: FHIR Tools] → Structured records
|
Answer + Reasoning Chain + FHIR JSON + Triage Level
| File | Description |
|---|---|
train_saheli.py |
Complete fine-tuning script (Unsloth + TRL) |
app_v2.py |
Enhanced Gradio app with all 7 novelties |
eval_benchmarks.py |
Medical benchmark evaluation (MedQA/MedMCQA/PubMedQA/AfriMed-QA) |
Modelfile |
Ollama deployment config |
setup.sh |
One-command setup |
KAGGLE_WRITEUP.md |
Hackathon submission writeup |
from transformers import AutoProcessor, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E4B-it", dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained("google/gemma-4-E4B-it")
messages = [
{"role": "system", "content": "You are SAHELI, a medical AI for community health workers."},
{"role": "user", "content": "2-year-old, cough 3 days, breathing fast 52/min, temp 38.5C"}
]
# With thinking mode
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=1.0, top_p=0.95, top_k=64)
response = processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)
parsed = processor.parse_response(response)
print("Thinking:", parsed.get("thinking", ""))
print("Answer:", parsed.get("answer", ""))
| Resource | URL |
|---|---|
| Live Demo | HF Spaces |
| Model | HuggingFace |
| Base Model | Gemma 4 E4B-it |
Main Track | Health & Sciences | Digital Equity | Safety & Trust | Unsloth | Ollama | llama.cpp
Apache 2.0 (following base model)