BioRLHF SFT Model (biorlhf-sft-mistral-7b)

A LoRA adapter fine-tuned on Mistral-7B-v0.3 for biological reasoning tasks, trained on spaceflight transcriptomic data from a Kaempferol (KMP) 2x2x2 factorial study.

This model is designed to answer questions about gene expression changes under spaceflight stressors (microgravity, radiation) and drug interventions, with calibrated confidence statements.

Part of the BioRLHF framework for training LLMs with verifier-based reinforcement learning on biological reasoning tasks.

Model Details


Base model	mistralai/Mistral-7B-v0.3 (7.4B parameters)
Adapter type	LoRA (r=64, alpha=128, dropout=0.05)
Adapter size	640 MB (~168M trainable parameters)
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Quantization	4-bit QLoRA (NF4, double quantization)
Training data	363 examples from KMP spaceflight transcriptomic study
Domain	Spaceflight biology, transcriptomics, drug-stressor interactions

Performance

Evaluated on a held-out 20-question test set covering factual recall, multi-step reasoning, and uncertainty expression:

Metric	Score
Overall Accuracy	90.0%
Factual Accuracy	80.0%
Reasoning Accuracy	100.0%
Calibration Accuracy	100.0%

Note: This test set is small (20 questions). For more robust evaluation, this SFT checkpoint was evaluated as a baseline alongside GRPO-trained models on a 107-sample held-out set covering multiple data sources. See the BioRLHF repository for full evaluation results.

This SFT checkpoint serves as the starting point for GRPO training with automated verifiers (V1-V4), which further improves mean reward by +18% and reduces calibration error on a 107-sample evaluation set.

Usage

With PEFT

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "jang1563/biorlhf-sft-mistral-7b")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")

prompt = "### Instruction:\nHow does hindlimb unloading affect gene expression in the liver?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With BioRLHF

from biorlhf.utils import load_model_for_inference, generate_response

model, tokenizer = load_model_for_inference(
    model_path="jang1563/biorlhf-sft-mistral-7b",
    base_model="mistralai/Mistral-7B-v0.3",
)

response = generate_response(
    model, tokenizer,
    "### Instruction:\nHow does Kaempferol affect the liver under radiation?\n\n### Response:\n"
)
print(response)

Training Details

Data

Training data is derived from a 2x2x2 factorial transcriptomic study:

Drug: Kaempferol (KMP) vs Control
Stressor 1: Hindlimb Unloading (HU) -- simulates microgravity
Stressor 2: Ionizing Radiation (IR) -- simulates space radiation
Tissues: Heart, Hippocampus, Liver, Soleus

The 363 training examples cover factual Q&A, chain-of-thought reasoning, and uncertainty calibration, generated through iterative refinement over 5 model versions.

Hyperparameters

Parameter	Value
Epochs	10
Learning rate	1e-4
LR scheduler	Cosine
Warmup ratio	0.1
Weight decay	0.01
Batch size	4
Gradient accumulation	4
Effective batch size	16
LoRA rank (r)	64
LoRA alpha	128
LoRA dropout	0.05
Max sequence length	1536
Quantization	4-bit QLoRA (NF4)
Optimizer	AdamW
Precision	bf16

Training Progression

Version	Accuracy	Key Improvement
v1	~20%	Format learned, facts incorrect
v2	~60%	Expanded training set
v3	~80%	Fact drilling via targeted repetition
v4	~85%	Chain-of-thought and calibration examples
Final	90%	Targeted drilling for remaining errors

Training procedure

Framework versions

PEFT 0.18.0
TRL: 0.26.2
Transformers: 4.57.3
Pytorch: 2.5.1+cu121
Datasets: 4.4.2
Tokenizers: 0.22.2

Limitations

Domain-specific: trained only on KMP spaceflight transcriptomic data (4 tissues)
May not generalize to other biological domains without additional training
Numeric recall (exact DEG counts) is challenging and may require GRPO fine-tuning
Small evaluation set (20 questions); broader evaluations available in the BioRLHF framework

Citation

@software{biorlhf2026,
  author = {Kim, JangKeun},
  title = {BioRLHF: Verifier-Based Reinforcement Learning for Biological Reasoning},
  year = {2026},
  url = {https://github.com/jang1563/BioRLHF}
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Downloads last month: 23

Model tree for jang1563/biorlhf-sft-mistral-7b

Base model

mistralai/Mistral-7B-v0.3

Adapter

(353)

this model