Understand the evaluation landscape
Read the BEIR paper to understand why domain generalization is hard. Run BM25 as your baseline on NFCorpus — it's surprisingly competitive and sets a meaningful floor.
A curated map of papers, models, datasets, and benchmarks for dense retrieval in the biomedical domain.
From domain pretraining to LLM-based retrievers — key milestones that shaped the field.
Production-ready retrieval models available on the Hugging Face Hub.
Standard evaluation suites for biomedical retrieval — use these to measure your models.
| Benchmark | Task | Domain | Scale | Metric | Link |
|---|---|---|---|---|---|
| NFCorpus | Ad-hoc search | Nutrition / Medicine | 323 queries · 3.6K docs | nDCG@10 | 🤗 BeIR/nfcorpus |
| TREC-COVID | Ad-hoc retrieval | COVID-19 / CORD-19 | 50 queries · 171K docs | nDCG@10 | 🤗 BeIR/trec-covid |
| SciFact | Claim verification | Scientific claims | ~300 queries · 5K abstracts | nDCG@10 | 🤗 BeIR/scifact |
| BioASQ | QA retrieval | Biomedical QA | Varies annually | MAP, nDCG | bioasq.org |
| SCIDOCS | Document similarity | Scientific papers | 1K queries · 25K docs | nDCG@10 | 🤗 BeIR/scidocs |
| BIOSSES | Sentence similarity | Biomedical | 100 sentence pairs | Pearson r | 🤗 tabilab/biosses |
| PubMedQA | QA retrieval | PubMed abstracts | 1K labeled | Accuracy | 🤗 PubMedQA |
| MedTEB | 51 medical tasks | Pan-medical | Comprehensive | Multi-metric | GitHub |
| R2MED | Reasoning retrieval | Clinical decision | Multi-type | nDCG@10 | arXiv |
Key corpora and labeled data for training biomedical retrieval models.
Ranked by result quality — the best published approaches for training a biomedical retriever.
| # | Model | Params | Training Recipe | Best Result | Paper |
|---|---|---|---|---|---|
| 1 | BioHiCL-Base | 0.1B | BGE + MeSH hierarchy contrastive (depth-weighted) + LoRA | IR Avg 0.543, NFCorpus 0.379 | 2604.15591 |
| 2 | BMRetriever-2B | 2B | LLM + unsupervised contrastive on PubMed/textbooks + instruction FT | Matches 5B+ across 11 tasks | 2404.18443 |
| 3 | MedTE | ~0.1B | GTE-Base + self-supervised contrastive on 7 medical corpora | MedTEB mean 0.578 | 2507.19407 |
| 4 | BiCA-Base | ~0.1B | GTE-Base + 2-hop citation hard negatives, 20K examples | Consistent BEIR + LoTTE ↑ | 2511.08029 |
| 5 | MedCPT | ~0.1B | PubMedBERT + 255M click-log contrastive (retriever + reranker) | Zero-shot SOTA on 5 bio IR tasks | 2307.00589 |
| 6 | BioLORD-2023 | ~0.1B | PubMedBERT + UMLS definitions contrastive + LLM distillation + WA | SOTA MedSTS, EHR-Rel-B | 2311.16075 |
Recommended learning path for biomedical text retrieval.
Read the BEIR paper to understand why domain generalization is hard. Run BM25 as your baseline on NFCorpus — it's surprisingly competitive and sets a meaningful floor.
Use MedCPT — the cleanest example of domain-specific contrastive pretraining. Separate query + article encoders make it intuitive. Evaluate on BEIR biomedical subsets.
Deploy BMRetriever-410M for production — it outperforms models 11× larger. Use instruction-formatted queries with last-token pooling. The eval code is clean and well-documented.
Benchmark on MedTEB — 51 medical embedding tasks, much broader than BEIR biomedical subsets alone. This is the new comprehensive standard (2025).