Changelog

All notable changes to BioRLHF will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.2.0 - 2026-03-22

GRPO training pipeline with verifier-based reward models
- GRPOConfig and run_grpo_training for Group Relative Policy Optimization
- CLI command biorlhf-grpo --config <path> for GRPO training
Verifier system (V1-V4) for multi-dimensional reward scoring
- V1 (Factual): Exact match scoring for DEG counts, tissue names, directions
- V2 (Pathway): Pathway/gene set enrichment validation (Hallmark, KEGG)
- V3 (Consistency): Internal logical consistency checking
- V4 (Uncertainty): Calibration and epistemic humility scoring
- RewardComposer for weighted multi-reward composition
GRPO dataset module (grpo_dataset.py) for prompt-based training data with hold-out tissues
GeneLab data loader (genelabloader.py) for NES conservation questions
Calibration evaluation (calibration.py) with Expected Calibration Error (ECE) scoring
Question generator (question_generator.py) for automated biological question creation
GRPO training configs: grpo_mve.json (MVE) and grpo_full_v2.json (full multi-reward)
SLURM job scripts for GRPO training on HPC clusters
Hold-out tissue evaluation (eye, thymus) for generalization testing

Bumped version to 0.2.0
Updated README with GRPO architecture, verifier system, and latest results
V1 factual verifier: reduced negation window from 30 to 12 characters to prevent cross-clause false negation
V1/V4 verifiers: smoothed reward scoring for GRPO (continuous instead of binary)
Updated HPC training guide with GRPO workflow and SLURM configurations
Updated dependencies: TRL >= 0.14.0 (GRPO support), PEFT >= 0.6.0
Lazy imports in evaluation/__init__.py to avoid torch dependency at import time

MVE experiment (G=4, V1+V4): Reward improved from 0.547 (SFT) to 0.650 (+19%), ECE reduced from 0.258 to 0.078 (-70%)
Full v2 experiment (G=16, V1-V4): Multi-reward training with zero-variance batch fraction <5% (vs 50% in MVE)

LoRA adapter loading: properly load base model first, then merge SFT adapter
Tokenizer loading from adapter directories in Transformers 4.57+
TRL GRPOConfig: scale_rewards as string type, explicit loss_type="grpo"
Batch size compatibility: both per_device_eval_batch_size and generation_batch_size divisible by num_generations
BioEval ground truth serialization for dict-type answers

Initial release of BioRLHF framework
SFT (Supervised Fine-Tuning) training pipeline
DPO (Direct Preference Optimization) training pipeline
Ground truth biological data from KMP 2x2x2 factorial study
Automated SFT dataset generation with multiple example types:
- Factual Q&A examples
- Chain-of-thought reasoning examples
- Uncertainty calibration examples
- Interaction prediction examples
- Experimental design critique examples
Model evaluation with accuracy metrics:
- Overall accuracy
- Factual accuracy
- Reasoning accuracy
- Calibration accuracy
Support for 4-bit quantization (QLoRA)
LoRA adapter training
Weights & Biases integration for experiment tracking
HPC support with SLURM job scripts
GitHub Actions CI workflow for automated testing
Pre-commit hooks configuration
Unit tests for ground truth data and dataset creation
Example scripts (quickstart, train_sft, evaluate_model)
CONTRIBUTING.md guidelines

Version	Date	Highlights
0.2.0	2026-03-22	GRPO pipeline, V1-V4 verifiers, multi-reward training
0.1.0	2025-01-09	Initial release with SFT/DPO pipelines