BioRLHF GRPO Model (biorlhf-grpo-mistral-7b)
A LoRA adapter trained with Group Relative Policy Optimization (GRPO) on top of the BioRLHF SFT checkpoint, using composable biological verifiers (V1-V4) for multi-dimensional reward scoring.
This model improves calibration and factual accuracy on spaceflight transcriptomic reasoning tasks compared to the SFT baseline.
Part of the BioRLHF framework for training LLMs with verifier-based reinforcement learning on biological reasoning tasks.
Model Details
| Base model | mistralai/Mistral-7B-v0.3 (7.4B parameters) |
| SFT base | jang1563/biorlhf-sft-mistral-7b |
| Adapter type | LoRA (r=32, alpha=64, dropout=0.05) |
| Adapter size | 161 MB |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Training method | GRPO with V1-V4 composable verifiers |
| Training data | 363 examples (via SFT base) + GRPO self-play |
| Domain | Spaceflight biology, transcriptomics, drug-stressor interactions |
Performance
Evaluated on a 107-sample held-out set (eye + thymus tissues):
| Metric | GRPO | SFT Baseline | Change |
|---|---|---|---|
| Mean Reward | 0.566 | 0.428 | +32.1% |
| ECE | 0.183 | 0.478 | -61.6% |
| Brier Score | 0.281 | β | β |
| Overconfidence Rate | 40.8% | β | β |
| Mean Accuracy | 61.7% | β | β |
Note: SFT baseline metrics in this table are from the same evaluation run as the GRPO model. Absolute metrics (reward, ECE) are comparable across evaluation runs; percentage improvements are valid only within the same run.
Usage
Important: This GRPO adapter was trained on top of the merged SFT model. You must load the base model, merge the SFT adapter, then apply this GRPO adapter (3-step chain).
With PEFT
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Step 1: Load base model
base = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.3",
torch_dtype="auto",
device_map="auto",
)
# Step 2: Merge SFT adapter into base
sft_model = PeftModel.from_pretrained(base, "jang1563/biorlhf-sft-mistral-7b")
merged = sft_model.merge_and_unload()
# Step 3: Apply GRPO adapter
model = PeftModel.from_pretrained(merged, "jang1563/biorlhf-grpo-mistral-7b")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
prompt = "### Instruction:\nHow does hindlimb unloading affect gene expression in the liver?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With BioRLHF
from biorlhf.utils import load_model_for_inference, generate_response
model, tokenizer = load_model_for_inference(
model_path="jang1563/biorlhf-grpo-mistral-7b",
base_model="mistralai/Mistral-7B-v0.3",
sft_adapter="jang1563/biorlhf-sft-mistral-7b",
)
response = generate_response(
model, tokenizer,
"### Instruction:\nHow does Kaempferol affect the liver under radiation?\n\n### Response:\n"
)
print(response)
Training Details
Method
GRPO (Group Relative Policy Optimization) generates G=16 completions per prompt, scores each with composable verifiers, and updates the policy using group-normalized advantages with a KL penalty against the reference model.
Verifier System
| Verifier | Weight | Description |
|---|---|---|
| V1 Factual | 0.30 | Exact match for DEG counts, tissue names, directions |
| V2 Pathway | 0.15 | Pathway/gene set enrichment validation |
| V3 Consistency | 0.10 | Internal logical consistency |
| V4 Uncertainty | 0.45 | Calibration and epistemic humility (dominant) |
V4 operates in "legacy" mode with a moderate-confidence prior, providing the strongest calibration signal.
Hyperparameters
| Parameter | Value |
|---|---|
| Generations per prompt (G) | 16 |
| KL penalty (beta) | 0.02 |
| Learning rate | 5e-7 |
| LR scheduler | Cosine |
| Epochs | 1 (2308 steps) |
| Gradient accumulation | 8 |
| LoRA rank (r) | 32 |
| LoRA alpha | 64 |
| LoRA dropout | 0.05 |
| scale_rewards | group |
| loss_type | grpo |
| num_iterations | 2 |
| Precision | bf16 |
| Hardware | NVIDIA A40 48GB |
| Training time | ~38h |
Training Progression
This GRPO model is Phase 4 of the BioRLHF training pipeline:
| Phase | Verifiers | G | Reward | ECE | Key Finding |
|---|---|---|---|---|---|
| SFT | β | β | baseline | baseline | 90% accuracy on 20-Q test |
| MVE | V1+V4 | 4 | 0.650 | 0.078 | Proof-of-concept, best ECE |
| Full v2 | V1-V4 (equal) | 16 | 0.691 | 0.172 | Best absolute reward |
| Phase 4 (this model) | V1-V4 (V4=0.45) | 16 | 0.566 | 0.183 | Best calibration with full verifiers |
Training procedure
Framework versions
- PEFT 0.18.0
- TRL: 0.26.2
- Transformers: 4.57.3
- Pytorch: 2.5.1+cu121
- Datasets: 4.4.2
- Tokenizers: 0.22.2
Limitations
- 3-step loading required: Must load base model, merge SFT adapter, then apply GRPO adapter
- Domain-specific: Trained only on KMP spaceflight transcriptomic data (4 tissues)
- Small evaluation set: 107 samples with held-out eye and thymus tissues
- ECE target not met: Achieved 0.183 vs target of 0.15
- May not generalize to other biological domains without additional training
Citation
@software{biorlhf2026,
author = {Kim, JangKeun},
title = {BioRLHF: Verifier-Based Reinforcement Learning for Biological Reasoning},
year = {2026},
url = {https://github.com/jang1563/BioRLHF}
}
Cite TRL as:
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
- Downloads last month
- 14
Model tree for jang1563/biorlhf-grpo-mistral-7b
Base model
mistralai/Mistral-7B-v0.3