BioRLHF GRPO Model (biorlhf-grpo-mistral-7b)

A LoRA adapter trained with Group Relative Policy Optimization (GRPO) on top of the BioRLHF SFT checkpoint, using composable biological verifiers (V1-V4) for multi-dimensional reward scoring.

This model improves calibration and factual accuracy on spaceflight transcriptomic reasoning tasks compared to the SFT baseline.

Part of the BioRLHF framework for training LLMs with verifier-based reinforcement learning on biological reasoning tasks.

Model Details

Base model mistralai/Mistral-7B-v0.3 (7.4B parameters)
SFT base jang1563/biorlhf-sft-mistral-7b
Adapter type LoRA (r=32, alpha=64, dropout=0.05)
Adapter size 161 MB
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training method GRPO with V1-V4 composable verifiers
Training data 363 examples (via SFT base) + GRPO self-play
Domain Spaceflight biology, transcriptomics, drug-stressor interactions

Performance

Evaluated on a 107-sample held-out set (eye + thymus tissues):

Metric GRPO SFT Baseline Change
Mean Reward 0.566 0.428 +32.1%
ECE 0.183 0.478 -61.6%
Brier Score 0.281 β€” β€”
Overconfidence Rate 40.8% β€” β€”
Mean Accuracy 61.7% β€” β€”

Note: SFT baseline metrics in this table are from the same evaluation run as the GRPO model. Absolute metrics (reward, ECE) are comparable across evaluation runs; percentage improvements are valid only within the same run.

Usage

Important: This GRPO adapter was trained on top of the merged SFT model. You must load the base model, merge the SFT adapter, then apply this GRPO adapter (3-step chain).

With PEFT

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Step 1: Load base model
base = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype="auto",
    device_map="auto",
)

# Step 2: Merge SFT adapter into base
sft_model = PeftModel.from_pretrained(base, "jang1563/biorlhf-sft-mistral-7b")
merged = sft_model.merge_and_unload()

# Step 3: Apply GRPO adapter
model = PeftModel.from_pretrained(merged, "jang1563/biorlhf-grpo-mistral-7b")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")

prompt = "### Instruction:\nHow does hindlimb unloading affect gene expression in the liver?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With BioRLHF

from biorlhf.utils import load_model_for_inference, generate_response

model, tokenizer = load_model_for_inference(
    model_path="jang1563/biorlhf-grpo-mistral-7b",
    base_model="mistralai/Mistral-7B-v0.3",
    sft_adapter="jang1563/biorlhf-sft-mistral-7b",
)

response = generate_response(
    model, tokenizer,
    "### Instruction:\nHow does Kaempferol affect the liver under radiation?\n\n### Response:\n"
)
print(response)

Training Details

Method

GRPO (Group Relative Policy Optimization) generates G=16 completions per prompt, scores each with composable verifiers, and updates the policy using group-normalized advantages with a KL penalty against the reference model.

Verifier System

Verifier Weight Description
V1 Factual 0.30 Exact match for DEG counts, tissue names, directions
V2 Pathway 0.15 Pathway/gene set enrichment validation
V3 Consistency 0.10 Internal logical consistency
V4 Uncertainty 0.45 Calibration and epistemic humility (dominant)

V4 operates in "legacy" mode with a moderate-confidence prior, providing the strongest calibration signal.

Hyperparameters

Parameter Value
Generations per prompt (G) 16
KL penalty (beta) 0.02
Learning rate 5e-7
LR scheduler Cosine
Epochs 1 (2308 steps)
Gradient accumulation 8
LoRA rank (r) 32
LoRA alpha 64
LoRA dropout 0.05
scale_rewards group
loss_type grpo
num_iterations 2
Precision bf16
Hardware NVIDIA A40 48GB
Training time ~38h

Training Progression

This GRPO model is Phase 4 of the BioRLHF training pipeline:

Phase Verifiers G Reward ECE Key Finding
SFT β€” β€” baseline baseline 90% accuracy on 20-Q test
MVE V1+V4 4 0.650 0.078 Proof-of-concept, best ECE
Full v2 V1-V4 (equal) 16 0.691 0.172 Best absolute reward
Phase 4 (this model) V1-V4 (V4=0.45) 16 0.566 0.183 Best calibration with full verifiers

Training procedure

Framework versions

  • PEFT 0.18.0
  • TRL: 0.26.2
  • Transformers: 4.57.3
  • Pytorch: 2.5.1+cu121
  • Datasets: 4.4.2
  • Tokenizers: 0.22.2

Limitations

  • 3-step loading required: Must load base model, merge SFT adapter, then apply GRPO adapter
  • Domain-specific: Trained only on KMP spaceflight transcriptomic data (4 tissues)
  • Small evaluation set: 107 samples with held-out eye and thymus tissues
  • ECE target not met: Achieved 0.183 vs target of 0.15
  • May not generalize to other biological domains without additional training

Citation

@software{biorlhf2026,
  author = {Kim, JangKeun},
  title = {BioRLHF: Verifier-Based Reinforcement Learning for Biological Reasoning},
  year = {2026},
  url = {https://github.com/jang1563/BioRLHF}
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jang1563/biorlhf-grpo-mistral-7b

Adapter
(353)
this model