BioRLHF SFT Model (biorlhf-sft-mistral-7b)

A LoRA adapter fine-tuned on Mistral-7B-v0.3 for biological reasoning tasks, trained on spaceflight transcriptomic data from a Kaempferol (KMP) 2x2x2 factorial study.

This model is designed to answer questions about gene expression changes under spaceflight stressors (microgravity, radiation) and drug interventions, with calibrated confidence statements.

Part of the BioRLHF framework for training LLMs with verifier-based reinforcement learning on biological reasoning tasks.

Model Details

Base model mistralai/Mistral-7B-v0.3 (7.4B parameters)
Adapter type LoRA (r=64, alpha=128, dropout=0.05)
Adapter size 640 MB (~168M trainable parameters)
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Quantization 4-bit QLoRA (NF4, double quantization)
Training data 363 examples from KMP spaceflight transcriptomic study
Domain Spaceflight biology, transcriptomics, drug-stressor interactions

Performance

Evaluated on a held-out 20-question test set covering factual recall, multi-step reasoning, and uncertainty expression:

Metric Score
Overall Accuracy 90.0%
Factual Accuracy 80.0%
Reasoning Accuracy 100.0%
Calibration Accuracy 100.0%

Note: This test set is small (20 questions). For more robust evaluation, this SFT checkpoint was evaluated as a baseline alongside GRPO-trained models on a 107-sample held-out set covering multiple data sources. See the BioRLHF repository for full evaluation results.

This SFT checkpoint serves as the starting point for GRPO training with automated verifiers (V1-V4), which further improves mean reward by +18% and reduces calibration error on a 107-sample evaluation set.

Usage

With PEFT

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "jang1563/biorlhf-sft-mistral-7b")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")

prompt = "### Instruction:\nHow does hindlimb unloading affect gene expression in the liver?\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With BioRLHF

from biorlhf.utils import load_model_for_inference, generate_response

model, tokenizer = load_model_for_inference(
    model_path="jang1563/biorlhf-sft-mistral-7b",
    base_model="mistralai/Mistral-7B-v0.3",
)

response = generate_response(
    model, tokenizer,
    "### Instruction:\nHow does Kaempferol affect the liver under radiation?\n\n### Response:\n"
)
print(response)

Training Details

Data

Training data is derived from a 2x2x2 factorial transcriptomic study:

  • Drug: Kaempferol (KMP) vs Control
  • Stressor 1: Hindlimb Unloading (HU) -- simulates microgravity
  • Stressor 2: Ionizing Radiation (IR) -- simulates space radiation
  • Tissues: Heart, Hippocampus, Liver, Soleus

The 363 training examples cover factual Q&A, chain-of-thought reasoning, and uncertainty calibration, generated through iterative refinement over 5 model versions.

Hyperparameters

Parameter Value
Epochs 10
Learning rate 1e-4
LR scheduler Cosine
Warmup ratio 0.1
Weight decay 0.01
Batch size 4
Gradient accumulation 4
Effective batch size 16
LoRA rank (r) 64
LoRA alpha 128
LoRA dropout 0.05
Max sequence length 1536
Quantization 4-bit QLoRA (NF4)
Optimizer AdamW
Precision bf16

Training Progression

Version Accuracy Key Improvement
v1 ~20% Format learned, facts incorrect
v2 ~60% Expanded training set
v3 ~80% Fact drilling via targeted repetition
v4 ~85% Chain-of-thought and calibration examples
Final 90% Targeted drilling for remaining errors

Training procedure

Visualize in Weights & Biases

Framework versions

  • PEFT 0.18.0
  • TRL: 0.26.2
  • Transformers: 4.57.3
  • Pytorch: 2.5.1+cu121
  • Datasets: 4.4.2
  • Tokenizers: 0.22.2

Limitations

  • Domain-specific: trained only on KMP spaceflight transcriptomic data (4 tissues)
  • May not generalize to other biological domains without additional training
  • Numeric recall (exact DEG counts) is challenging and may require GRPO fine-tuning
  • Small evaluation set (20 questions); broader evaluations available in the BioRLHF framework

Citation

@software{biorlhf2026,
  author = {Kim, JangKeun},
  title = {BioRLHF: Verifier-Based Reinforcement Learning for Biological Reasoning},
  year = {2026},
  url = {https://github.com/jang1563/BioRLHF}
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jang1563/biorlhf-sft-mistral-7b

Adapter
(353)
this model