BioRLHF GRPO Model (biorlhf-grpo-mistral-7b)

A LoRA adapter trained with Group Relative Policy Optimization (GRPO) on top of the BioRLHF SFT checkpoint, using composable biological verifiers (V1-V4) for multi-dimensional reward scoring.

This model improves calibration and factual accuracy on spaceflight transcriptomic reasoning tasks compared to the SFT baseline.

Part of the BioRLHF framework for training LLMs with verifier-based reinforcement learning on biological reasoning tasks.

Model Details


Base model	mistralai/Mistral-7B-v0.3 (7.4B parameters)
SFT base	jang1563/biorlhf-sft-mistral-7b
Adapter type	LoRA (r=32, alpha=64, dropout=0.05)
Adapter size	161 MB
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training method	GRPO with V1-V4 composable verifiers
Training data	363 examples (via SFT base) + GRPO self-play
Domain	Spaceflight biology, transcriptomics, drug-stressor interactions

Performance

Evaluated on a 107-sample held-out set (eye + thymus tissues):

Metric	GRPO	SFT Baseline	Change
Mean Reward	0.566	0.428	+32.1%
ECE	0.183	0.478	-61.6%
Brier Score	0.281	—	—
Overconfidence Rate	40.8%	—	—
Mean Accuracy	61.7%	—	—

Note: SFT baseline metrics in this table are from the same evaluation run as the GRPO model. Absolute metrics (reward, ECE) are comparable across evaluation runs; percentage improvements are valid only within the same run.

Usage

Important: This GRPO adapter was trained on top of the merged SFT model. You must load the base model, merge the SFT adapter, then apply this GRPO adapter (3-step chain).

With PEFT

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Step 1: Load base model
base = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype="auto",
    device_map="auto",
)

# Step 2: Merge SFT adapter into base
sft_model = PeftModel.from_pretrained(base, "jang1563/biorlhf-sft-mistral-7b")
merged = sft_model.merge_and_unload()

# Step 3: Apply GRPO adapter
model = PeftModel.from_pretrained(merged, "jang1563/biorlhf-grpo-mistral-7b")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")

prompt = "### Instruction:\nHow does hindlimb unloading affect gene expression in the liver?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With BioRLHF

from biorlhf.utils import load_model_for_inference, generate_response

model, tokenizer = load_model_for_inference(
    model_path="jang1563/biorlhf-grpo-mistral-7b",
    base_model="mistralai/Mistral-7B-v0.3",
    sft_adapter="jang1563/biorlhf-sft-mistral-7b",
)

response = generate_response(
    model, tokenizer,
    "### Instruction:\nHow does Kaempferol affect the liver under radiation?\n\n### Response:\n"
)
print(response)

Training Details

Method

GRPO (Group Relative Policy Optimization) generates G=16 completions per prompt, scores each with composable verifiers, and updates the policy using group-normalized advantages with a KL penalty against the reference model.

Verifier System

Verifier	Weight	Description
V1 Factual	0.30	Exact match for DEG counts, tissue names, directions
V2 Pathway	0.15	Pathway/gene set enrichment validation
V3 Consistency	0.10	Internal logical consistency
V4 Uncertainty	0.45	Calibration and epistemic humility (dominant)

V4 operates in "legacy" mode with a moderate-confidence prior, providing the strongest calibration signal.

Hyperparameters

Parameter	Value
Generations per prompt (G)	16
KL penalty (beta)	0.02
Learning rate	5e-7
LR scheduler	Cosine
Epochs	1 (2308 steps)
Gradient accumulation	8
LoRA rank (r)	32
LoRA alpha	64
LoRA dropout	0.05
scale_rewards	group
loss_type	grpo
num_iterations	2
Precision	bf16
Hardware	NVIDIA A40 48GB
Training time	~38h

Training Progression

This GRPO model is Phase 4 of the BioRLHF training pipeline:

Phase	Verifiers	G	Reward	ECE	Key Finding
SFT	—	—	baseline	baseline	90% accuracy on 20-Q test
MVE	V1+V4	4	0.650	0.078	Proof-of-concept, best ECE
Full v2	V1-V4 (equal)	16	0.691	0.172	Best absolute reward
Phase 4 (this model)	V1-V4 (V4=0.45)	16	0.566	0.183	Best calibration with full verifiers

Training procedure

Framework versions

PEFT 0.18.0
TRL: 0.26.2
Transformers: 4.57.3
Pytorch: 2.5.1+cu121
Datasets: 4.4.2
Tokenizers: 0.22.2

Limitations

3-step loading required: Must load base model, merge SFT adapter, then apply GRPO adapter
Domain-specific: Trained only on KMP spaceflight transcriptomic data (4 tissues)
Small evaluation set: 107 samples with held-out eye and thymus tissues
ECE target not met: Achieved 0.183 vs target of 0.15
May not generalize to other biological domains without additional training

Citation

@software{biorlhf2026,
  author = {Kim, JangKeun},
  title = {BioRLHF: Verifier-Based Reinforcement Learning for Biological Reasoning},
  year = {2026},
  url = {https://github.com/jang1563/BioRLHF}
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Downloads last month: 14

Model tree for jang1563/biorlhf-grpo-mistral-7b

Base model

mistralai/Mistral-7B-v0.3

Adapter

(353)

this model