Biomedical Text Retrieval

A curated map of papers, models, datasets, and benchmarks for dense retrieval in the biomedical domain.

15 papers 10+ models 12 benchmarks 7 training datasets

📜 Evolution of BioRetrieval

From domain pretraining to LLM-based retrievers — key milestones that shaped the field.

2019

BioBERT

First BERT fine-tuned on PubMed abstracts + PMC full texts. Proved domain pretraining consistently improves biomedical NER, RE, and QA. The baseline everything is measured against.

🤗 dmis-lab/biobert-v1.1 arXiv

2019

SciBERT

BERT pretrained on 1.14M scientific papers (18% biomed, 82% CS). Broader scientific vocabulary makes it competitive on retrieval and used as backbone for SLEDGE-Z (TREC-COVID SOTA).

🤗 allenai/scibert arXiv

2020

PubMedBERT

Showed pretraining from scratch on PubMed beats continual pretraining from general BERT. Introduced the BLURB benchmark. Became the de-facto backbone for biomedical retrieval fine-tuning.

🤗 microsoft/PubMedBERT arXiv

2021

SapBERT

Self-alignment pretraining using UMLS ontology + metric learning. SOTA on medical entity linking without task-specific supervision — the go-to for entity disambiguation and concept normalization.

🤗 cambridgeltl/SapBERT arXiv

2021

BEIR Benchmark

18 datasets, 9 tasks — revealed that dense models trained on MS MARCO generalize poorly to biomedical domains. BM25 often wins out-of-distribution. The standard evaluation framework.

GitHub arXiv · NeurIPS '21

2022

BioLinkBERT

Pretrains via hyperlinked documents in the same context window + Document Relation Prediction. Excels at multi-hop biomedical reasoning (BioASQ, USMLE QA).

🤗 michiyasunaga/BioLinkBERT-large arXiv · ACL '22

2023

Trained on 255M PubMed user click logs via contrastive learning — zero-shot SOTA on 5 biomedical IR tasks. Released as query/article encoder pair + cross-encoder reranker. Click logs as free supervision.

🤗 Query Encoder (382K↓) 🤗 Article Encoder 🤗 Cross-Encoder arXiv

2023

BioLORD-2023

Grounds biomedical concepts in UMLS definitions via multi-phase contrastive learning + LLM self-distillation + weight averaging. SOTA on MedSTS, MedNLI-S, EHR-Rel-B. Multilingual variants available.

🤗 FremyCompany/BioLORD-2023 (147K↓) arXiv · EMNLP '23

2024

BMRetriever

LLM-based retriever: unsupervised contrastive pretraining on PubMed/textbooks/StatPearls, then instruction fine-tuning on 11 datasets. 410M model outperforms baselines 11.7× larger. 2B matches 5B+ models.

🤗 410M 🤗 2B 🤗 7B arXiv · EMNLP '24

2024

BiCA

Citation-aware hard negatives: 2-hop citation graphs from PubMed articles for semantic hard-negative mining. Fine-tunes GTE-small/base with only 20K examples — consistent BEIR + LoTTE gains.

GitHub arXiv

2025

MedTE + MedTEB

51-task medical embedding benchmark (classification, clustering, retrieval). MedTE model (GTE-Base fine-tuned on 7 medical corpora) achieves mean 0.578 vs 0.539 next-best. The new comprehensive eval standard.

🤗 MohammadKhodadad/MedTE GitHub arXiv

2025

BioHiCL

Hierarchical MeSH supervision: depth-weighted contrastive loss + LoRA on BGE models. 0.1B model achieves IR Avg 0.543, beating BMRetriever-1B. Best on NFCorpus and SCIDOCS. Current efficiency SOTA.

🤗 BioHiCL-Base 🤗 BioHiCL-Large arXiv

🧠 Models

Production-ready retrieval models available on the Hugging Face Hub.

📊 Benchmarks

Standard evaluation suites for biomedical retrieval — use these to measure your models.

Benchmark	Task	Domain	Scale	Metric	Link
NFCorpus	Ad-hoc search	Nutrition / Medicine	323 queries · 3.6K docs	nDCG@10	🤗 BeIR/nfcorpus
TREC-COVID	Ad-hoc retrieval	COVID-19 / CORD-19	50 queries · 171K docs	nDCG@10	🤗 BeIR/trec-covid
SciFact	Claim verification	Scientific claims	~300 queries · 5K abstracts	nDCG@10	🤗 BeIR/scifact
BioASQ	QA retrieval	Biomedical QA	Varies annually	MAP, nDCG	bioasq.org
SCIDOCS	Document similarity	Scientific papers	1K queries · 25K docs	nDCG@10	🤗 BeIR/scidocs
BIOSSES	Sentence similarity	Biomedical	100 sentence pairs	Pearson r	🤗 tabilab/biosses
PubMedQA	QA retrieval	PubMed abstracts	1K labeled	Accuracy	🤗 PubMedQA
MedTEB	51 medical tasks	Pan-medical	Comprehensive	Multi-metric	GitHub
R2MED	Reasoning retrieval	Clinical decision	Multi-type	nDCG@10	arXiv

🗂️ Training Datasets

Key corpora and labeled data for training biomedical retrieval models.

MedRAG/pubmed

Dataset

PubMed abstracts corpus — the core pretraining data for biomedical models. Used by BMRetriever, MedTE, and most domain-adapted models.

pretraining abstracts

MedRAG/textbooks

Dataset

Medical textbook passages — high-quality, structured biomedical knowledge. Core fine-tuning data for BMRetriever and RAG applications.

fine-tuning textbooks

MedRAG/statpearls

Dataset

StatPearls clinical reference articles — continuously updated clinical content used for retriever fine-tuning and medical Q&A.

clinical fine-tuning

BMRetriever Training Mix

Training

11-task instruction mixture for biomedical retrieval fine-tuning — query-document pairs spanning medical QA, entity linking, and scientific claim verification.

instruction 11 tasks

BioLORD Dataset

Dataset

UMLS concept definition pairs for contrastive learning. Powers BioLORD-2023's clinical concept embeddings and medical entity similarity.

UMLS contrastive

CORD-19

Dataset

COVID-19 Open Research Dataset — 400K+ research papers. The corpus behind TREC-COVID, used for pandemic-era retrieval research and benchmarking.

400K+ papers COVID-19

🏆 Training Recipes Leaderboard

Ranked by result quality — the best published approaches for training a biomedical retriever.

#	Model	Params	Training Recipe	Best Result	Paper
1	BioHiCL-Base	0.1B	BGE + MeSH hierarchy contrastive (depth-weighted) + LoRA	IR Avg 0.543, NFCorpus 0.379	2604.15591
2	BMRetriever-2B	2B	LLM + unsupervised contrastive on PubMed/textbooks + instruction FT	Matches 5B+ across 11 tasks	2404.18443
3	MedTE	~0.1B	GTE-Base + self-supervised contrastive on 7 medical corpora	MedTEB mean 0.578	2507.19407
4	BiCA-Base	~0.1B	GTE-Base + 2-hop citation hard negatives, 20K examples	Consistent BEIR + LoTTE ↑	2511.08029
5	MedCPT	~0.1B	PubMedBERT + 255M click-log contrastive (retriever + reranker)	Zero-shot SOTA on 5 bio IR tasks	2307.00589
6	BioLORD-2023	~0.1B	PubMedBERT + UMLS definitions contrastive + LLM distillation + WA	SOTA MedSTS, EHR-Rel-B	2311.16075

🚀 Get Started

Recommended learning path for biomedical text retrieval.

Understand the evaluation landscape

Read the BEIR paper to understand why domain generalization is hard. Run BM25 as your baseline on NFCorpus — it's surprisingly competitive and sets a meaningful floor.

Try a zero-shot retriever

Use MedCPT — the cleanest example of domain-specific contrastive pretraining. Separate query + article encoders make it intuitive. Evaluate on BEIR biomedical subsets.

Scale up with LLM retrievers

Deploy BMRetriever-410M for production — it outperforms models 11× larger. Use instruction-formatted queries with last-token pooling. The eval code is clean and well-documented.

Comprehensive evaluation

Benchmark on MedTEB — 51 medical embedding tasks, much broader than BEIR biomedical subsets alone. This is the new comprehensive standard (2025).

Fine-tune your own retriever

Use BiCA's citation-graph hard negatives for cheap, effective training data. Or BioHiCL's MeSH hierarchy supervision — 0.1B params matching 1B+ models.

Biomedical Text Retrieval

📜 Evolution of BioRetrieval

🧠 Models

MedCPT (Query + Article)

BioLORD-2023

BMRetriever (410M–7B)

BioHiCL

SapBERT

MedTE

BioLinkBERT

PubMedBERT

📊 Benchmarks

🗂️ Training Datasets

MedRAG/pubmed

MedRAG/textbooks

MedRAG/statpearls

BMRetriever Training Mix

BioLORD Dataset

CORD-19

🏆 Training Recipes Leaderboard

🚀 Get Started

Understand the evaluation landscape

Try a zero-shot retriever

Scale up with LLM retrievers

Comprehensive evaluation

Fine-tune your own retriever