BioRLHF SFT Model (biorlhf-sft-mistral-7b)
A LoRA adapter fine-tuned on Mistral-7B-v0.3 for biological reasoning tasks, trained on spaceflight transcriptomic data from a Kaempferol (KMP) 2x2x2 factorial study.
This model is designed to answer questions about gene expression changes under spaceflight stressors (microgravity, radiation) and drug interventions, with calibrated confidence statements.
Part of the BioRLHF framework for training LLMs with verifier-based reinforcement learning on biological reasoning tasks.
Model Details
| Base model | mistralai/Mistral-7B-v0.3 (7.4B parameters) |
| Adapter type | LoRA (r=64, alpha=128, dropout=0.05) |
| Adapter size | 640 MB (~168M trainable parameters) |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Quantization | 4-bit QLoRA (NF4, double quantization) |
| Training data | 363 examples from KMP spaceflight transcriptomic study |
| Domain | Spaceflight biology, transcriptomics, drug-stressor interactions |
Performance
Evaluated on a held-out 20-question test set covering factual recall, multi-step reasoning, and uncertainty expression:
| Metric | Score |
|---|---|
| Overall Accuracy | 90.0% |
| Factual Accuracy | 80.0% |
| Reasoning Accuracy | 100.0% |
| Calibration Accuracy | 100.0% |
Note: This test set is small (20 questions). For more robust evaluation, this SFT checkpoint was evaluated as a baseline alongside GRPO-trained models on a 107-sample held-out set covering multiple data sources. See the BioRLHF repository for full evaluation results.
This SFT checkpoint serves as the starting point for GRPO training with automated verifiers (V1-V4), which further improves mean reward by +18% and reduces calibration error on a 107-sample evaluation set.
Usage
With PEFT
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.3",
torch_dtype="auto",
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "jang1563/biorlhf-sft-mistral-7b")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
prompt = "### Instruction:\nHow does hindlimb unloading affect gene expression in the liver?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With BioRLHF
from biorlhf.utils import load_model_for_inference, generate_response
model, tokenizer = load_model_for_inference(
model_path="jang1563/biorlhf-sft-mistral-7b",
base_model="mistralai/Mistral-7B-v0.3",
)
response = generate_response(
model, tokenizer,
"### Instruction:\nHow does Kaempferol affect the liver under radiation?\n\n### Response:\n"
)
print(response)
Training Details
Data
Training data is derived from a 2x2x2 factorial transcriptomic study:
- Drug: Kaempferol (KMP) vs Control
- Stressor 1: Hindlimb Unloading (HU) -- simulates microgravity
- Stressor 2: Ionizing Radiation (IR) -- simulates space radiation
- Tissues: Heart, Hippocampus, Liver, Soleus
The 363 training examples cover factual Q&A, chain-of-thought reasoning, and uncertainty calibration, generated through iterative refinement over 5 model versions.
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Learning rate | 1e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Batch size | 4 |
| Gradient accumulation | 4 |
| Effective batch size | 16 |
| LoRA rank (r) | 64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.05 |
| Max sequence length | 1536 |
| Quantization | 4-bit QLoRA (NF4) |
| Optimizer | AdamW |
| Precision | bf16 |
Training Progression
| Version | Accuracy | Key Improvement |
|---|---|---|
| v1 | ~20% | Format learned, facts incorrect |
| v2 | ~60% | Expanded training set |
| v3 | ~80% | Fact drilling via targeted repetition |
| v4 | ~85% | Chain-of-thought and calibration examples |
| Final | 90% | Targeted drilling for remaining errors |
Training procedure
Framework versions
- PEFT 0.18.0
- TRL: 0.26.2
- Transformers: 4.57.3
- Pytorch: 2.5.1+cu121
- Datasets: 4.4.2
- Tokenizers: 0.22.2
Limitations
- Domain-specific: trained only on KMP spaceflight transcriptomic data (4 tissues)
- May not generalize to other biological domains without additional training
- Numeric recall (exact DEG counts) is challenging and may require GRPO fine-tuning
- Small evaluation set (20 questions); broader evaluations available in the BioRLHF framework
Citation
@software{biorlhf2026,
author = {Kim, JangKeun},
title = {BioRLHF: Verifier-Based Reinforcement Learning for Biological Reasoning},
year = {2026},
url = {https://github.com/jang1563/BioRLHF}
}
Cite TRL as:
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
- Downloads last month
- 23
Model tree for jang1563/biorlhf-sft-mistral-7b
Base model
mistralai/Mistral-7B-v0.3