BioRLHF / COMPARISON_REPORT.md
jang1563's picture
Initial commit: BioRLHF v0.1.0
c7ebaa1

BioRLHF Model Comparison Study

Executive Summary

This study compared three language models fine-tuned on biological reasoning tasks using identical training data (363 examples) and hyperparameters. Mistral-7B achieved 90% accuracy, significantly outperforming Qwen2.5-7B (40%) and Phi-2 (25%).

Methodology

Training Configuration

  • Dataset: 363 examples (factual recall + chain-of-thought + calibration)
  • Epochs: 10
  • Learning Rate: 1e-4
  • LoRA: r=64, α=128
  • Max Length: 1536 tokens

Evaluation

  • 20 test questions across 3 categories:
    • Factual Recall (10 questions)
    • Reasoning (5 questions)
    • Calibration/Uncertainty (5 questions)

Results

Model Parameters Overall Factual Reasoning Calibration
Mistral-7B 7B 90.0% 80.0% 100.0% 100.0%
Qwen2.5-7B 7B 40.0% 30.0% 80.0% 20.0%
Phi-2 2.7B 25.0% 20.0% 60.0% 0.0%

Key Findings

1. Mistral-7B Shows Superior Fine-tuning Capability

Despite similar parameter counts, Mistral-7B learned the domain knowledge far more effectively than Qwen2.5-7B. This suggests Mistral's architecture is more amenable to domain-specific fine-tuning.

2. Calibration Requires Explicit Training

  • Mistral-7B: 100% calibration accuracy
  • Qwen2.5-7B: 20% calibration accuracy
  • Phi-2: 0% calibration accuracy

Only Mistral learned to express appropriate uncertainty. This demonstrates that calibration is a learnable skill but requires sufficient model capacity and training signal.

3. Smaller Models Struggle with Domain Knowledge

Phi-2 (2.7B parameters) achieved only 25% accuracy, suggesting a minimum model size threshold for effective biological reasoning fine-tuning.

4. Hardest Questions

All models struggled with specific numeric recall:

  • Heart baseline DEGs (112) - 0/3 correct
  • Heart stress DEGs (2,110) - 0/3 correct

This suggests these facts need more aggressive drilling or alternative training strategies.

Conclusions

  1. Model selection matters: Mistral-7B is recommended for biological domain fine-tuning
  2. Calibration is learnable: With appropriate training examples, models can learn epistemic humility
  3. Size threshold exists: Models below ~7B parameters may lack capacity for complex domain reasoning

Implications for AI in Life Sciences

This study demonstrates that:

  • Small-scale fine-tuning (363 examples) can achieve high accuracy on domain-specific tasks
  • Uncertainty calibration can be explicitly trained
  • Model architecture significantly impacts fine-tuning effectiveness

These findings inform best practices for deploying LLMs in scientific research contexts where accuracy and appropriate uncertainty expression are critical.


Study conducted: January 9, 2026 Dataset: KMP spaceflight countermeasure transcriptomic data Framework: BioRLHF (Biological Reinforcement Learning from Human Feedback)