Biomedical Text Retrieval

A curated map of papers, models, datasets, and benchmarks for dense retrieval in the biomedical domain.

15 papers 10+ models 12 benchmarks 7 training datasets

📜 Evolution of BioRetrieval

From domain pretraining to LLM-based retrievers — key milestones that shaped the field.

2019

BioBERT

First BERT fine-tuned on PubMed abstracts + PMC full texts. Proved domain pretraining consistently improves biomedical NER, RE, and QA. The baseline everything is measured against.
2019

SciBERT

BERT pretrained on 1.14M scientific papers (18% biomed, 82% CS). Broader scientific vocabulary makes it competitive on retrieval and used as backbone for SLEDGE-Z (TREC-COVID SOTA).
2020

PubMedBERT

Showed pretraining from scratch on PubMed beats continual pretraining from general BERT. Introduced the BLURB benchmark. Became the de-facto backbone for biomedical retrieval fine-tuning.
2021

SapBERT

Self-alignment pretraining using UMLS ontology + metric learning. SOTA on medical entity linking without task-specific supervision — the go-to for entity disambiguation and concept normalization.
2021

BEIR Benchmark

18 datasets, 9 tasks — revealed that dense models trained on MS MARCO generalize poorly to biomedical domains. BM25 often wins out-of-distribution. The standard evaluation framework.
2022

BioLinkBERT

Pretrains via hyperlinked documents in the same context window + Document Relation Prediction. Excels at multi-hop biomedical reasoning (BioASQ, USMLE QA).
2023

MedCPT / BioCPT

Trained on 255M PubMed user click logs via contrastive learning — zero-shot SOTA on 5 biomedical IR tasks. Released as query/article encoder pair + cross-encoder reranker. Click logs as free supervision.
2023

BioLORD-2023

Grounds biomedical concepts in UMLS definitions via multi-phase contrastive learning + LLM self-distillation + weight averaging. SOTA on MedSTS, MedNLI-S, EHR-Rel-B. Multilingual variants available.
2024

BMRetriever

LLM-based retriever: unsupervised contrastive pretraining on PubMed/textbooks/StatPearls, then instruction fine-tuning on 11 datasets. 410M model outperforms baselines 11.7× larger. 2B matches 5B+ models.
2024

BiCA

Citation-aware hard negatives: 2-hop citation graphs from PubMed articles for semantic hard-negative mining. Fine-tunes GTE-small/base with only 20K examples — consistent BEIR + LoTTE gains.
2025

MedTE + MedTEB

51-task medical embedding benchmark (classification, clustering, retrieval). MedTE model (GTE-Base fine-tuned on 7 medical corpora) achieves mean 0.578 vs 0.539 next-best. The new comprehensive eval standard.
2025

BioHiCL

Hierarchical MeSH supervision: depth-weighted contrastive loss + LoRA on BGE models. 0.1B model achieves IR Avg 0.543, beating BMRetriever-1B. Best on NFCorpus and SCIDOCS. Current efficiency SOTA.

🧠 Models

Production-ready retrieval models available on the Hugging Face Hub.

MedCPT (Query + Article)

Model
Asymmetric bi-encoder from NCBI, trained on 255M PubMed click logs. Separate query/article encoders + cross-encoder reranker. Zero-shot SOTA on biomedical IR.

BioLORD-2023

Model
UMLS-grounded sentence embeddings via multi-phase contrastive + LLM distillation + weight averaging. SOTA on clinical STS, entity linking, and concept similarity. Multilingual variants available.

BMRetriever (410M–7B)

Model
LLM-based retriever family. Instruction-formatted queries + last-token pooling. 410M outperforms 5B+ baselines. Models at 410M (GPT-NeoX), 2B (Gemma), 7B (Mistral).

BioHiCL

Model
MeSH hierarchy-supervised BGE model. Depth-weighted contrastive loss + LoRA. 0.1B params achieves IR Avg 0.543, beating BMRetriever-1B. Best efficiency/performance ratio.

SapBERT

Model
Self-alignment on UMLS synonyms via metric learning. Go-to model for medical entity linking and concept normalization. No task-specific labels needed.

MedTE

Model
GTE-Base fine-tuned on 7 diverse medical corpora (PubMed, MIMIC-IV, ClinicalTrials, bioRxiv/medRxiv). Mean 0.578 on MedTEB — best medical general-purpose embedding model.

BioLinkBERT

Model
Document-link pretraining on PubMed hyperlinks. Excels at multi-hop biomedical reasoning — top performer on BioASQ QA and USMLE-style questions.

PubMedBERT

Model
The backbone model for biomedical fine-tuning. Pretrained from scratch on PubMed — not continued from general BERT. Foundation for MedCPT, SapBERT, and many others.

📊 Benchmarks

Standard evaluation suites for biomedical retrieval — use these to measure your models.

Benchmark Task Domain Scale Metric Link
NFCorpus Ad-hoc search Nutrition / Medicine 323 queries · 3.6K docs nDCG@10 🤗 BeIR/nfcorpus
TREC-COVID Ad-hoc retrieval COVID-19 / CORD-19 50 queries · 171K docs nDCG@10 🤗 BeIR/trec-covid
SciFact Claim verification Scientific claims ~300 queries · 5K abstracts nDCG@10 🤗 BeIR/scifact
BioASQ QA retrieval Biomedical QA Varies annually MAP, nDCG bioasq.org
SCIDOCS Document similarity Scientific papers 1K queries · 25K docs nDCG@10 🤗 BeIR/scidocs
BIOSSES Sentence similarity Biomedical 100 sentence pairs Pearson r 🤗 tabilab/biosses
PubMedQA QA retrieval PubMed abstracts 1K labeled Accuracy 🤗 PubMedQA
MedTEB 51 medical tasks Pan-medical Comprehensive Multi-metric GitHub
R2MED Reasoning retrieval Clinical decision Multi-type nDCG@10 arXiv

🗂️ Training Datasets

Key corpora and labeled data for training biomedical retrieval models.

MedRAG/pubmed

Dataset
PubMed abstracts corpus — the core pretraining data for biomedical models. Used by BMRetriever, MedTE, and most domain-adapted models.

MedRAG/textbooks

Dataset
Medical textbook passages — high-quality, structured biomedical knowledge. Core fine-tuning data for BMRetriever and RAG applications.

MedRAG/statpearls

Dataset
StatPearls clinical reference articles — continuously updated clinical content used for retriever fine-tuning and medical Q&A.

BMRetriever Training Mix

Training
11-task instruction mixture for biomedical retrieval fine-tuning — query-document pairs spanning medical QA, entity linking, and scientific claim verification.

BioLORD Dataset

Dataset
UMLS concept definition pairs for contrastive learning. Powers BioLORD-2023's clinical concept embeddings and medical entity similarity.

CORD-19

Dataset
COVID-19 Open Research Dataset — 400K+ research papers. The corpus behind TREC-COVID, used for pandemic-era retrieval research and benchmarking.

🏆 Training Recipes Leaderboard

Ranked by result quality — the best published approaches for training a biomedical retriever.

# Model Params Training Recipe Best Result Paper
1 BioHiCL-Base 0.1B BGE + MeSH hierarchy contrastive (depth-weighted) + LoRA IR Avg 0.543, NFCorpus 0.379 2604.15591
2 BMRetriever-2B 2B LLM + unsupervised contrastive on PubMed/textbooks + instruction FT Matches 5B+ across 11 tasks 2404.18443
3 MedTE ~0.1B GTE-Base + self-supervised contrastive on 7 medical corpora MedTEB mean 0.578 2507.19407
4 BiCA-Base ~0.1B GTE-Base + 2-hop citation hard negatives, 20K examples Consistent BEIR + LoTTE ↑ 2511.08029
5 MedCPT ~0.1B PubMedBERT + 255M click-log contrastive (retriever + reranker) Zero-shot SOTA on 5 bio IR tasks 2307.00589
6 BioLORD-2023 ~0.1B PubMedBERT + UMLS definitions contrastive + LLM distillation + WA SOTA MedSTS, EHR-Rel-B 2311.16075

🚀 Get Started

Recommended learning path for biomedical text retrieval.

1

Understand the evaluation landscape

Read the BEIR paper to understand why domain generalization is hard. Run BM25 as your baseline on NFCorpus — it's surprisingly competitive and sets a meaningful floor.

2

Try a zero-shot retriever

Use MedCPT — the cleanest example of domain-specific contrastive pretraining. Separate query + article encoders make it intuitive. Evaluate on BEIR biomedical subsets.

3

Scale up with LLM retrievers

Deploy BMRetriever-410M for production — it outperforms models 11× larger. Use instruction-formatted queries with last-token pooling. The eval code is clean and well-documented.

4

Comprehensive evaluation

Benchmark on MedTEB — 51 medical embedding tasks, much broader than BEIR biomedical subsets alone. This is the new comprehensive standard (2025).

5

Fine-tune your own retriever

Use BiCA's citation-graph hard negatives for cheap, effective training data. Or BioHiCL's MeSH hierarchy supervision — 0.1B params matching 1B+ models.