# BioRLHF [![CI](https://github.com/jang1563/BioRLHF/actions/workflows/ci.yml/badge.svg)](https://github.com/jang1563/BioRLHF/actions/workflows/ci.yml) [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](CONTRIBUTING.md) **Biological Reinforcement Learning from Human Feedback** — A framework for fine-tuning LLMs on biological reasoning tasks using SFT, DPO, and GRPO with verifier-based reward models for factual accuracy, calibrated uncertainty, and chain-of-thought reasoning. ## Highlights - **Three-stage training pipeline**: SFT → DPO → GRPO with verifier-based rewards - **Multi-reward GRPO**: Four composable verifiers (factual, pathway, consistency, uncertainty) with configurable weights - **+19% reward improvement** over SFT baseline using GRPO (0.650 vs 0.547) - **-70% calibration error**: ECE reduced from 0.258 to 0.078 after GRPO - **90% accuracy** on domain-specific biological reasoning tasks (SFT stage) - **Learns from 363 examples** — efficient domain adaptation from spaceflight transcriptomics data ## Key Results ### GRPO Training (Phase 3) | Metric | SFT Baseline | After GRPO | Improvement | |--------|-------------|------------|-------------| | Avg Reward | 0.547 | 0.650 | +19% | | ECE (Calibration Error) | 0.258 | 0.078 | -70% | **GRPO Configuration (Full v2):** - 16 generations per prompt (G=16) for robust advantage estimation - Multi-reward: V1 (factual, 0.35) + V2 (pathway, 0.30) + V3 (consistency, 0.15) + V4 (uncertainty, 0.20) - KL penalty beta=0.02, 2 iterations per batch, group-normalized rewards ### Model Comparison (SFT, 20-question evaluation) | Model | Overall | Factual | Reasoning | Calibration | |-------|---------|---------|-----------|-------------| | **Mistral-7B** | **90.0%** | 80.0% | 100.0% | 100.0% | | Qwen2.5-7B | 40.0% | 30.0% | 80.0% | 20.0% | | Phi-2 | 25.0% | 20.0% | 60.0% | 0.0% | ### SFT Training Progression | Version | Accuracy | Key Improvement | |---------|----------|-----------------| | v1 (Base SFT) | ~20% | Format learned, facts wrong | | v2 (Expanded) | ~60% | More examples helped | | v3 (Fact Drilling) | ~80% | Repetition fixed key facts | | v4 (Advanced) | ~85% | Chain-of-thought, calibration | | **Final** | **90%** | Targeted drilling for remaining errors | ## Installation ### From Source ```bash git clone https://github.com/jang1563/BioRLHF.git cd BioRLHF pip install -e . ``` ### With Development Dependencies ```bash pip install -e ".[dev]" ``` ### GPU Requirements - NVIDIA GPU with 48GB+ VRAM recommended (A40 or A100) - 24GB+ VRAM sufficient for SFT/DPO with 4-bit quantization - CUDA 12.1+ recommended ## Quick Start ### SFT Training ```python from biorlhf import SFTTrainingConfig, run_sft_training config = SFTTrainingConfig( model_name="mistralai/Mistral-7B-v0.3", dataset_path="data/kmp_sft_final.json", output_dir="./my_sft_model", num_epochs=10, learning_rate=1e-4, ) model_path = run_sft_training(config) ``` ### GRPO Training with Verifiers ```bash # Using the CLI biorlhf-grpo --config configs/grpo_full_v2.json ``` ```python # Or programmatically from biorlhf.training.grpo import GRPOConfig, run_grpo_training config = GRPOConfig.from_json("configs/grpo_full_v2.json") run_grpo_training(config) ``` ### Creating a Dataset ```python from biorlhf.data import create_sft_dataset dataset = create_sft_dataset( output_path="my_dataset.json", include_calibration=True, include_chain_of_thought=True, ) print(f"Created {len(dataset)} training examples") ``` ### Evaluating a Model ```python from biorlhf import evaluate_model result = evaluate_model( model_path="./my_sft_model", test_questions_path="data/kmp_test_set.json", ) print(f"Overall Accuracy: {result.overall_accuracy:.1%}") print(f"Factual: {result.factual_accuracy:.1%}") print(f"Reasoning: {result.reasoning_accuracy:.1%}") print(f"Calibration: {result.calibration_accuracy:.1%}") ``` ### Running Inference ```python from biorlhf.utils import load_model_for_inference, generate_response model, tokenizer = load_model_for_inference( model_path="./my_sft_model", base_model="mistralai/Mistral-7B-v0.3", ) prompt = "### Instruction:\nWhich tissue is most sensitive to ionizing radiation?\n\n### Response:\n" response = generate_response(model, tokenizer, prompt) print(response) ``` ## Architecture ### Three-Stage Training Pipeline ``` Stage 1: SFT Stage 2: DPO Stage 3: GRPO (Supervised Fine-Tuning) (Direct Preference (Group Relative Policy Optimization) Optimization) Mistral-7B-v0.3 SFT model SFT model (merged) | | | LoRA (r=64, alpha=128) Preference pairs Generate G=16 completions | | | 363 training examples Ranked responses Score with V1-V4 verifiers | | | 10 epochs, lr=1e-4 beta=0.1 Multi-reward composition | | | SFT Adapter DPO Model GRPO Model ``` ### Verifier-Based Reward System (V1-V4) | Verifier | Name | Weight | What It Scores | |----------|------|--------|----------------| | **V1** | Factual | 0.35 | Exact match of biological facts (DEG counts, tissue names, directions) | | **V2** | Pathway | 0.30 | Correct pathway/gene set enrichment references (Hallmark, KEGG) | | **V3** | Consistency | 0.15 | Internal logical consistency within the response | | **V4** | Uncertainty | 0.20 | Appropriate confidence calibration and epistemic humility | The verifiers are composable via `RewardComposer` and can be individually weighted: ```python from biorlhf.verifiers import RewardComposer composer = RewardComposer( active_verifiers=["V1", "V2", "V3", "V4"], weights={"V1": 0.35, "V2": 0.30, "V3": 0.15, "V4": 0.20}, ) reward = composer.score(question, response, ground_truth) ``` ## Dataset Training data is derived from a 2x2x2 factorial transcriptomic study: - **Drug**: Kaempferol (KMP) vs Control - **Stressor 1**: Hindlimb Unloading (HU) — simulates microgravity - **Stressor 2**: Ionizing Radiation (IR) — simulates space radiation - **Tissues**: Heart, Hippocampus, Liver, Soleus (+ Eye, Thymus for GRPO hold-out) ### Training Example Types | Type | Count | Purpose | |------|-------|---------| | Factual Q&A | ~150 | Specific facts (DEG counts, tissue types) | | Chain-of-Thought | ~50 | Step-by-step reasoning | | Calibration | ~30 | Uncertainty expression | | Multi-hop Reasoning | ~30 | Integrating multiple facts | | Error Correction | ~20 | Learning from mistakes | ### Ground Truth Data ```python from biorlhf.data import ( STRESSOR_EFFECTS, KMP_EFFECTS, INTERACTIONS, TISSUE_TYPES, OXPHOS_PATTERNS, ) # Example: Get DEG counts for stressors print(STRESSOR_EFFECTS["Hippocampus"]) # {'HU': 1555, 'IR': 5477, 'HU_IR': 5510} ``` ## Project Structure ``` BioRLHF/ ├── src/biorlhf/ # Main package │ ├── training/ # SFT, DPO, and GRPO trainers │ ├── data/ # Dataset creation & ground truth │ ├── evaluation/ # Model evaluation & calibration │ ├── verifiers/ # V1-V4 reward verifiers │ │ ├── factual.py # V1: Factual accuracy scoring │ │ ├── pathway.py # V2: Pathway enrichment scoring │ │ ├── consistency.py # V3: Logical consistency scoring │ │ ├── uncertainty.py # V4: Calibration/uncertainty scoring │ │ └── composer.py # Multi-reward composition │ ├── utils/ # Model loading, inference helpers │ └── cli.py # Command-line interface ├── configs/ # Training configurations │ ├── grpo_mve.json # Minimum viable experiment │ └── grpo_full_v2.json # Full multi-reward training ├── data/ # Training datasets │ ├── kmp_sft_final.json # 363 SFT training examples │ └── kmp_test_set.json # 20-question evaluation set ├── examples/ # Usage examples ├── scripts/ # SLURM job scripts & HPC guide ├── tests/ # Unit tests └── docs/ # Documentation ``` ## Scientific Contributions ### 1. Verifier-Based GRPO Improves Calibration - GRPO with V1-V4 verifiers reduced calibration error (ECE) by 70% - Multi-reward composition outperforms single-reward training - G=16 generations dramatically reduces zero-variance batches (from 50% to <5%) ### 2. Fact Drilling Works for SFT - Initial training: 20% accuracy on key facts - After targeted repetition: 100% accuracy on drilled facts - LLMs need explicit reinforcement of specific domain facts ### 3. Calibration is Learnable - Trained on "I cannot determine X from this data" examples - Mistral achieved 100% calibration accuracy at SFT stage - GRPO further improved calibration via the V4 uncertainty verifier ### 4. DPO is Fragile for Domain Knowledge - Aggressive DPO (beta=0.05) destroyed learned knowledge - Model hallucinated unrelated content - Preference learning needs careful tuning in specialized domains ### 5. Architecture Matters More Than Size - Mistral-7B >> Qwen2.5-7B despite similar parameter counts - Phi-2 (2.7B) insufficient for complex biological reasoning - Model selection is critical for domain fine-tuning ## Key Learnings for AI Safety 1. **Honesty is trainable** — Models can learn appropriate epistemic humility 2. **Domain grounding matters** — Anchoring to experimental truth prevents hallucination 3. **Multi-reward > single reward** — Decomposing correctness into verifiable dimensions improves learning signal 4. **Preference learning is fragile** — DPO can catastrophically forget domain knowledge 5. **Evaluation drives improvement** — Systematic testing reveals specific failure modes ## Related Projects - **[SpaceOmicsBench](https://github.com/jang1563/SpaceOmicsBench)** — 115-question benchmark for LLMs on spaceflight biomedical data ## Citation If you use BioRLHF in your research, please cite: ```bibtex @software{biorlhf2026, author = {Kim, JangKeun}, title = {BioRLHF: Biological Reinforcement Learning from Human Feedback}, year = {2026}, url = {https://github.com/jang1563/BioRLHF}, note = {Fine-tuning LLMs for biological reasoning with verifier-based GRPO} } ``` ## Contributing Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. ## License This project is licensed under the MIT License — see the [LICENSE](LICENSE) file for details. --- *Developed by JangKeun Kim (jak4013@med.cornell.edu), Weill Cornell Medicine*