---
title: >-
BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol
Understanding and Reasoning
license: cc-by-4.0
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- biology
- protocol
- benchmark
- ai4science
---
# BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning
[](https://arxiv.org/pdf/2505.07889)
[](https://huggingface.co/BioProBench)
[](https://github.com/YuyangSunshine/bioprotocolbench)
[](https://yuyangsunshine.github.io/BioPro-Project/)
[](https://creativecommons.org/licenses/by/4.0/)
---
## 📢 Latest News
* ✨ **[2026-03-31]** **Data Split Update!** We have officially released the **Train/Test splits** for each task (PQA, ORD, ERR, GEN, REA), making it easier for the community to train and evaluate models consistently.
* 🔥 **[2026-03-18]** Our **BioProAgent** is now live on AI4S LAB! [Try it out and order wet-lab experiments here](https://yuyangsunshine.github.io/BioPro-Project/).
* 🎉 **[2026-03-03]** Our BioProAgent has been accepted by the **ICLR 2026 LLA Workshop!**
* 📝 **[2026-01-21]** BioProBench paper has been updated with new experimental results.[arXiv](https://arxiv.org/pdf/2505.07889).
* 🚀 **[2025-12-01]** Code and dataset (v1.0) are released on GitHub.
---
## 🌟 Introduction
**BioProBench** is the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning, specifically designed for Large Language Models (LLMs). It moves beyond simple QA to encompass a comprehensive suite of tasks critical for procedural text comprehension in life sciences.
### Key Features:
* 📚 **Large-scale Data:** Built upon **27K original biological protocols**, yielding nearly **556K high-quality structured instances**.
* 🎯 **Comprehensive Tasks:** 5 core tasks: **PQA** (Question Answering), **ORD** (Step Ordering), **ERR** (Error Correction), **GEN** (Generation), and **REA** (Reasoning).
* 🧬 **Broad Domain Coverage:** Covers **16 biological subdomains** from 6 major repositories.
* 🔬 **Standardized Evaluation:** A robust framework combining NLP metrics with novel domain-specific measures.
---
## 📊 Dataset Structure & Tasks
We provide standardized JSON files for each task, now including **Train** and **Test** splits:
| Task | Description | Files |
| :--- | :--- | :--- |
| **PQA** | Protocol Question Answering | `PQA_train.json`, `PQA_test.json` |
| **ORD** | Step Ordering | `ORD_train.json`, `ORD_test.json` |
| **ERR** | Error Correction | `ERR_train.json`, `ERR_test.json` |
| **GEN** | Protocol Generation | `GEN_train.json`, `GEN_test.json` |
| **Raw** | Full Protocol Corpus | `Bio-protocol.json`, `Protocol-io.json`, etc. |
### 🔗 Useful Links
* **Official Website:** [BioPro-Project Page](https://yuyangsunshine.github.io/BioPro-Project/)
* **GitHub Repository:** [bioprotocolbench](https://github.com/YuyangSunshine/bioprotocolbench) (Code for evaluation & training)
---
## 🔬 Key Findings
We evaluated 12 mainstream LLMs. Our findings reveal:
* **Surface vs. Deep Understanding:** Models perform well on QA (~70% Acc) but struggle with deep procedural logic.
* **Reasoning Bottleneck:** Performance drops significantly on **Step Ordering** and **Protocol Generation** (BLEU < 15%), highlighting the difficulty of managing temporal dependencies.
* **Bio-specific Models:** Interestingly, some bio-specific models lag behind general LLMs in capturing intricate procedural dependencies, suggesting a need for larger reasoning capacity.
---
## 🤝 Contributing & Contact
We welcome contributions such as new protocol sources, additional domains, or novel tasks!
- **Email:** sunshineliuyuyang@gmail.com
- **Issues:** Feel free to open an issue on our [GitHub](https://github.com/YuyangSunshine/bioprotocolbench).
## 📜 Citation
```bibtex
@misc{bioprotocolbench2025,
title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
author={Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang, Li Yuan, Yonghong Tian},
year={2025},
url={[https://arxiv.org/pdf/2505.07889](https://arxiv.org/pdf/2505.07889)}
}