Changelog
All notable changes to BioRLHF will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.2.0 - 2026-03-22
Added
- GRPO training pipeline with verifier-based reward models
GRPOConfigandrun_grpo_trainingfor Group Relative Policy Optimization- CLI command
biorlhf-grpo --config <path>for GRPO training
- Verifier system (V1-V4) for multi-dimensional reward scoring
- V1 (Factual): Exact match scoring for DEG counts, tissue names, directions
- V2 (Pathway): Pathway/gene set enrichment validation (Hallmark, KEGG)
- V3 (Consistency): Internal logical consistency checking
- V4 (Uncertainty): Calibration and epistemic humility scoring
RewardComposerfor weighted multi-reward composition
- GRPO dataset module (
grpo_dataset.py) for prompt-based training data with hold-out tissues - GeneLab data loader (
genelabloader.py) for NES conservation questions - Calibration evaluation (
calibration.py) with Expected Calibration Error (ECE) scoring - Question generator (
question_generator.py) for automated biological question creation - GRPO training configs:
grpo_mve.json(MVE) andgrpo_full_v2.json(full multi-reward) - SLURM job scripts for GRPO training on HPC clusters
- Hold-out tissue evaluation (eye, thymus) for generalization testing
Changed
- Bumped version to 0.2.0
- Updated README with GRPO architecture, verifier system, and latest results
- V1 factual verifier: reduced negation window from 30 to 12 characters to prevent cross-clause false negation
- V1/V4 verifiers: smoothed reward scoring for GRPO (continuous instead of binary)
- Updated HPC training guide with GRPO workflow and SLURM configurations
- Updated dependencies: TRL >= 0.14.0 (GRPO support), PEFT >= 0.6.0
- Lazy imports in
evaluation/__init__.pyto avoid torch dependency at import time
Training Results
- MVE experiment (G=4, V1+V4): Reward improved from 0.547 (SFT) to 0.650 (+19%), ECE reduced from 0.258 to 0.078 (-70%)
- Full v2 experiment (G=16, V1-V4): Multi-reward training with zero-variance batch fraction <5% (vs 50% in MVE)
Fixed
- LoRA adapter loading: properly load base model first, then merge SFT adapter
- Tokenizer loading from adapter directories in Transformers 4.57+
- TRL GRPOConfig:
scale_rewardsas string type, explicitloss_type="grpo" - Batch size compatibility: both
per_device_eval_batch_sizeandgeneration_batch_sizedivisible bynum_generations - BioEval ground truth serialization for dict-type answers
0.1.0 - 2025-01-09
Added
- Initial release of BioRLHF framework
- SFT (Supervised Fine-Tuning) training pipeline
- DPO (Direct Preference Optimization) training pipeline
- Ground truth biological data from KMP 2x2x2 factorial study
- Automated SFT dataset generation with multiple example types:
- Factual Q&A examples
- Chain-of-thought reasoning examples
- Uncertainty calibration examples
- Interaction prediction examples
- Experimental design critique examples
- Model evaluation with accuracy metrics:
- Overall accuracy
- Factual accuracy
- Reasoning accuracy
- Calibration accuracy
- Support for 4-bit quantization (QLoRA)
- LoRA adapter training
- Weights & Biases integration for experiment tracking
- HPC support with SLURM job scripts
- GitHub Actions CI workflow for automated testing
- Pre-commit hooks configuration
- Unit tests for ground truth data and dataset creation
- Example scripts (quickstart, train_sft, evaluate_model)
- CONTRIBUTING.md guidelines
Training Results
- Achieved 90% overall accuracy on biological reasoning tasks
- 100% calibration accuracy (appropriate uncertainty expression)
- Successfully trained on 363 examples
- Model comparison study: Mistral-7B (90%) > Qwen2.5-7B (40%) > Phi-2 (25%)
Data
kmp_sft_final.json: 363 training exampleskmp_test_set.json: 20-question evaluation setkmp_dpo_preferences.json: Preference pairs for DPO training
Version History Summary
| Version | Date | Highlights |
|---|---|---|
| 0.2.0 | 2026-03-22 | GRPO pipeline, V1-V4 verifiers, multi-reward training |
| 0.1.0 | 2025-01-09 | Initial release with SFT/DPO pipelines |